Rachel Thomas's article came onto my twitter feed. It caught my attention because it has a click-baity title "How (and why) to create a good validation set."
Or, I thought it was click bait but she is really serious about this. (For those not familiar with the literature, we don't use all historical data to build machine learning models. The historical data are split, typically at random, into training and validation sets. The validation set is supposed to simulate new data the algorithms haven't seen before, a sort of honest check of the model.) She makes some alarmist claims here:
- there is such a thing as a "poorly chosen" validation set
- random selection is not a good way to make a validation set, a "poor choice for many real-world problems"
- the analyst should manufacture a validation set
- the validation set should be representative of future, currently-unseen, data
Even though I don't like any of her advice, I can't disagree with her diagnosis:
An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again.
***
One of the examples given is a response function that has a time trend.
If this model does not detect the trend, indeed the prediction will have poor accuracy on real-world data. She is making a claim that a validation set based on a pre-post time split is better than a random selection.
Since this is a simple linear trend, either way of making the validation set will capture the trend. So what makes the model fail in production is not the presence of this trend but a shift of the trajectory after the model is deployed. But the choice of validation set won't help prevent the problem.
The downside of the pre-post split is the presence of many time-varying predictors. A naive example: if an on-off switch just happens to be pressed at that time split point, then all your training examples have the "on" condition while your validation examples have "off".
Manufacturing the validation set to reflect some unknown future trends creates a conceptual difficulty. The training set is now materially different from the validation set, so why would we expect the trained model to perform well on the validation set? And how much degradation in validation set performance is considered the right price to pay for potentially better in-market performance? That question boils down to how much you want to generalize the data, and at the core of the statistical view of the modeling problem.
***
The subtext of the article is that if the model doesn't work, fix the data. I tend to want to fix the model. If it doesn't work in production because the nature of the time trend has shifted, then adjust the model to include the new time trend.
Diagnosing the difference between production data and historical data is part of good model hygiene. It's very hard to predict unexpected shifts in the data and even if you could, you wouldn't have any training data to support such shifts.
The "data fix" is not the solution. Refining one's model is.
PS. While I don't agree with designing your validation set, I do advise selecting your historical dataset carefully, and think about which units to include or exclude from the modeling process, which Rachel discusses at the end of her post.
Recent Comments