我的随机森林主要学习到期时间噪音。

我的随机森林主要学习到期时间噪音。
My Random Forest Was Mostly Learning Time-to-Expiry Noise

原始链接: https://illya.sh/threads/out-of-sample-permutation-feature-importance-for-random

初步随机森林调优后，为了解决scikit-learn默认基尼重要性的局限性，采用了样本外（OOS）排列特征重要性进行特征优化。由于基尼重要性对连续变量存在偏见（许多特征是离散的）、计算基于训练数据以及与相关特征存在问题，因此被认为不适用。 OOS方法包括训练模型，然后评估在*独立*验证数据上，单个特征值随机打乱后预测能力下降的程度。结果显示，模型严重依赖“seconds_to_settle”特征——本质上是时间/到期时间——该特征承担了整个模型的预测权重。此外，异常高的AUC分数0.7566引发了对潜在的先验偏差和过拟合的担忧。因此，目前正在进行特征清理，以解决这种不平衡并提高模型的鲁棒性。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录我的随机森林主要学习的是到期时间噪音 (illya.sh) 13 分，由 iluxonchik 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 3 条评论帮助 phyzome 46 分钟前 | 下一个 [–] 背景是什么？感觉至少缺少了三段引言。回复 andai 37 分钟前 | 父评论 | 下一个 [–] https://illya.sh/thoughts/ 如果你跳过关于石油的文章，你会发现有几十篇文章讨论相同的主题（提到随机森林）。似乎是在预测比特币价格。编辑：这似乎是最详细的文章：https://illya.sh/thoughts/my-trading-ml-factory-yielded-22-r... zzleeper 1 小时前 | 上一个 [–] 这是一篇令人耳目一新的文章。易于阅读，并且我学到了一些东西！回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

After the first level of tuning of the random forest’s parameters, it came time to optimize the features on which the random forest gets trained on. I’ve already did a minor cleanup, but I didn’t yet test for the predictive importance of each feature in the random forest.

To optimize the random forest’s features, I’m using Out of Sample Permutation Feature Importance (OOS). The OOS approach consists in three core steps:

1️⃣ Train the random forest once on the training data
2️⃣ Take out-of-sample validation data (testing data), permute the values of a single feature/values within the same column and pass them to the model trained in step 1.
3️⃣ A feature is important for the model if the model’s predictive power reduces significantly when that feature’s values are randomly shuffled.

The “out of sample” part refers to the fact that the set of data used to train the model and evaluate it after permutations is distinct, thus reducing the contribution of noise to the evaluation metrics. By default, scikit-learn uses Gini Importance to rank features by their utility to the model. Gini is a bad metric for my data because of:

➖ High cardinality bias (it has an inherent bias towards continuous variables, and some of my features are discrete)
➖ Gini importance is computed on the training data
➖ In case of two correlated features, the random forest will randomly pick one at each split, but Gini will actually divide the importance between the two

Additionally, the out-of-sample Area Under the ROC Curve (AUC) of 0.7566 is unrealistically good for predicting 5-minute Bitcoin price moves. This value implies that if you pick a random 5-minute winning window and a 5-minute losing one, my model ranks the winner window ≈76% of the time. Either I found a model that beats virtually every financial institution in existence, or there is a lookahead bias and overfitting happening in the model.

Conclusion: my meta forest — the “seconds_to_settle” feature is basically carrying the entire model 😂 So in its current state, the random forest is training almost entirely on the time of day/time to expiration. The cleanup has started.

我的随机森林主要学习到期时间噪音。 My Random Forest Was Mostly Learning Time-to-Expiry Noise

我的随机森林主要学习到期时间噪音。
My Random Forest Was Mostly Learning Time-to-Expiry Noise