After the first level of tuning of the random forestās parameters, it came time to optimize the features on which the random forest gets trained on. Iāve already did a minor cleanup, but I didnāt yet test for the predictive importance of each feature in the random forest.
To optimize the random forestās features, Iām using Out of Sample Permutation Feature Importance (OOS). The OOS approach consists in three core steps:
1ļøā£ Train the random forest once on the training data
2ļøā£ Take out-of-sample validation data (testing data), permute the values of a single feature/values within the same column and pass them to the model trained in step 1.
3ļøā£ A feature is important for the model if the modelās predictive power reduces significantly when that featureās values are randomly shuffled.
The āout of sampleā part refers to the fact that the set of data used to train the model and evaluate it after permutations is distinct, thus reducing the contribution of noise to the evaluation metrics. By default, scikit-learn uses Gini Importance to rank features by their utility to the model. Gini is a bad metric for my data because of:
ā High cardinality bias (it has an inherent bias towards continuous variables, and some of my features are discrete)
ā Gini importance is computed on the training data
ā In case of two correlated features, the random forest will randomly pick one at each split, but Gini will actually divide the importance between the two
Additionally, the out-of-sample Area Under the ROC Curve (AUC) of 0.7566 is unrealistically good for predicting 5-minute Bitcoin price moves. This value implies that if you pick a random 5-minute winning window and a 5-minute losing one, my model ranks the winner window ā76% of the time. Either I found a model that beats virtually every financial institution in existence, or there is a lookahead bias and overfitting happening in the model.
Conclusion: my meta forest ā the āseconds_to_settleā feature is basically carrying the entire model š So in its current state, the random forest is training almost entirely on the time of day/time to expiration. The cleanup has started.