我的随机森林主要学习到期时间噪音。

我的随机森林主要学习到期时间噪音。
My Random Forest Was Mostly Learning Time-to-Expiry Noise

原始链接: https://illya.sh/threads/out-of-sample-permutation-feature-importance-for-random

初步随机森林调优后，为了解决scikit-learn默认基尼重要性的局限性，采用了样本外（OOS）排列特征重要性进行特征优化。由于基尼重要性对连续变量存在偏见（许多特征是离散的）、计算基于训练数据以及与相关特征存在问题，因此被认为不适用。 OOS方法包括训练模型，然后评估在*独立*验证数据上，单个特征值随机打乱后预测能力下降的程度。结果显示，模型严重依赖“seconds_to_settle”特征——本质上是时间/到期时间——该特征承担了整个模型的预测权重。此外，异常高的AUC分数0.7566引发了对潜在的先验偏差和过拟合的担忧。因此，目前正在进行特征清理，以解决这种不平衡并提高模型的鲁棒性。

一篇 Hacker News 帖子讨论了一篇博客文章（illya.sh），详细描述了使用随机森林机器学习模型进行价格预测的经验，很可能预测的是比特币。作者发现该模型受到“到期时间”噪音的严重影响，而非有意义的信号。最初的讨论集中在这一特征出人意料的强度以及对背景的请求，并通过链接到同一博客上的相关帖子提供了背景信息。这些帖子详细介绍了构建交易机器学习工厂并获得 22% 的回报。一位评论者建议使用 SHAP 值进行更好的特征分析。总的来说，这篇帖子和随后的评论强调了特征选择的挑战以及将机器学习应用于金融市场的潜在陷阱。这篇文章因易于理解且信息丰富而受到赞扬。

原文

After the first level of tuning of the random forest’s parameters, it came time to optimize the features on which the random forest gets trained on. I’ve already did a minor cleanup, but I didn’t yet test for the predictive importance of each feature in the random forest.

To optimize the random forest’s features, I’m using Out of Sample Permutation Feature Importance (OOS). The OOS approach consists in three core steps:

1️⃣ Train the random forest once on the training data
2️⃣ Take out-of-sample validation data (testing data), permute the values of a single feature/values within the same column and pass them to the model trained in step 1.
3️⃣ A feature is important for the model if the model’s predictive power reduces significantly when that feature’s values are randomly shuffled.

The “out of sample” part refers to the fact that the set of data used to train the model and evaluate it after permutations is distinct, thus reducing the contribution of noise to the evaluation metrics. By default, scikit-learn uses Gini Importance to rank features by their utility to the model. Gini is a bad metric for my data because of:

➖ High cardinality bias (it has an inherent bias towards continuous variables, and some of my features are discrete)
➖ Gini importance is computed on the training data
➖ In case of two correlated features, the random forest will randomly pick one at each split, but Gini will actually divide the importance between the two

Additionally, the out-of-sample Area Under the ROC Curve (AUC) of 0.7566 is unrealistically good for predicting 5-minute Bitcoin price moves. This value implies that if you pick a random 5-minute winning window and a 5-minute losing one, my model ranks the winner window ≈76% of the time. Either I found a model that beats virtually every financial institution in existence, or there is a lookahead bias and overfitting happening in the model.

Conclusion: my meta forest — the “seconds_to_settle” feature is basically carrying the entire model 😂 So in its current state, the random forest is training almost entirely on the time of day/time to expiration. The cleanup has started.

我的随机森林主要学习到期时间噪音。 My Random Forest Was Mostly Learning Time-to-Expiry Noise

我的随机森林主要学习到期时间噪音。
My Random Forest Was Mostly Learning Time-to-Expiry Noise