In December 2016, EY launched a worldwide internal data science challenge on datascience.net – an online platform for machine learning and AI competitions. The goal is to train mathematical models to predict a target value, given some correlated variables (base features). The quality of the prediction is determined by a scoring function. Competitors are usually given two datasets : the train set – containing the target and used to train models – and the test set – only containing features on which predictions are to be made.
The dataset used for the competition was a series of fitness-related metrics measured by a smartwatch during cyclists’ training sessions. The competition took place from December 16, 2016 to January 05, 2017. With over 850 contributors competing for the win, top teams restlessly tuned their models until the very last day. After a couple of failed submissions and dealing with some tricky issues with models, our team – “EY Lab Paris” – managed to win the challenge. Let’s dive into the models and data engineering techniques that got us the best predictions.
Week 1: Understanding the dataset and feature engineering
The variables and targets given for the challenge are described in the array below:
Unlike most Kaggle or datascience.net competitions, the target (missing value) was uniformly distributed between 3 possible columns : speed, power and cadence. The train set had 105 000 observations and the test set had 45 000 (15 000 for each missing column). The scoring function was a measure of the percent error.
100% on speed ! WOW !
After reading through the dataset a couple times, we realized that the travelled distance and the elapsed time were given. A couple hours of research later, we finally came up with this complex formula : speed = distance/time. We got 100% accuracy for speed predictions. At this point, since there was no private test set, we could add those 15 000 columns where speed was no longer missing to the train set and only had 30 000 predictions left to do, on cadence and power. Yay ! Fortunately for the challenge, there were no more hacks for the two remaining columns.
We decided to try simple models with practically no tuning to see which ones performed the best: Linear Regression, K-Nearest-Neighbors, SVM, Random Forest, etc. Random Forest performed quite well, since we got 37% error score with the default parameters and 500 trees. We then decided to use this baseline model to understand which features improved our predictions. After some hard work, we came up with the following features:
- Rider features :
- Rider dummy variables
- Descriptive stats on original features (avg, std, max)
- Session features :
- Detection of “training sessions” : two different observations could just be two measures of rider metrics during different (but possibly overlapping) periods of the same training session.
- Descriptive stats on original features (avg, std, max) within a session
- Weighted moving average (M.A.) on original features (capturing future and previous information from other observations – weights decrease the further in time the observations are from our current observation)
- Squared difference of features with their M.A. to capture strong variations in underlying data
- Other tricks :
- Detection of failures/outliers
- 10 first projections of a PCA applied on all explanatory variables
Note that three of these tricks are unrealistic in real life :
- Exploit future information : usually future information is not available
- Use target variables as explanatory variables : add the two given target variables to the explanatory variable set to predict the missing one
- Predict failed measures : unlikely that a data scientist is asked to predict failed measures
Week 2 : Tuning XGBoost
Now we were quite confident our features captured lots of information about our data, it was time to find the best model to predict with. Which model performs well on competitions? XGBoost. And so we started tuning an XGBoost model. Since a Random Forest performed reasonably well, we were not surprised to get better performances with XGBoost : it is similar to Random Forest, but more powerful because each tree learns from the errors of the previous trees. A fine-tuned XGBoost model trained with our fancy features scored 27.7% which was actually enough to win the challenge.
Week 3 : Stacking
In order to get even better performances, another common winning technique is stacking. The idea is that combining models that have good performances, and perform poorly on different parts of the data helps us build a stronger learner, so that there will always be a model doing a good prediction.
“Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. For this reason, stacking is most effective when the base models are significantly different.” – Wikipedia
We stacked four different models: RF, XGBoost, Perceptron, k-NN. Although we didn’t have enough time to tune it nearly as well as the winning XGBoost, it produced a score of 28.2%, and we’re confident that had we been given more time to do the tuning, it would have outperformed XGBoost alone.
Lesson to be learned : feature engineering matters
Once the submission deadline was over, and the winning solutions shared, it was very interesting to see that the teams ranked second and third used roughly the same algorithms as us : tuned XGBoost and stacking. Although parameter tuning probably increased our lead, it can not account the whole score gap between us and the second team. Clever feature engineering is most likely what made us win the challenge.