Data science, in general, is poorly taught. There are so many concepts that float out there but are hardly explained. I mentioned one yesterday in the post about applying a model on data that are only partially representative of the training data. Today, I'm discussing the difference between an explanatory model and a predictive model.
These two types of models are not interchangeable. A model that explains the past well does not necessarily predicts the future. Because predictive models are created using historical data, it is confusing as to why they don't double as explanatory models.
***
This post is inspired by a webinar I tuned into earlier in the week on sports analytics. This webinar is part of a slew of free sessions hosted by JMP.
One of the talks was about basketball rivalries such as Virginia vs Maryland. The presenter showed a decision tree model that "predicts" the winner of the next contest. For example, an important factor was holding opponents to under 59 points.
I put "predict" in quotes because of the common confusion between a predictive model and an explanatory model. Let's examine that last sentence again:
When a team holds opponents to under 59 points, it has a high chance of winning the game.
When interpreted as an explanatory model, the meaning is: among all past Virginia-Maryland matches, the teams that held opponents to under 59 points had a better winning record relative to those letting the opponent score more than 59 points.
When interpreted as a predictive model, the meaning is: for any future Virginia-Maryland match, the team that holds opponent to under 59 points have a higher chance of winning the game.
An explanatory model does not need to be predictive. It's backward looking.
The problem with using this decision tree as a predictive model is that it issues a prediction only after the match is over. In order to know how many points the teams' opponents have scored, and other game statistics, we must wait till the end of the game.
To build a proper predictive model, we must specify when the prediction is required. Perhaps at half time. If so, only first-half statistics may be used for prediction. Obviously, such a model will not explain the past as well. If the goal is to explain the past, you use full-time statistics. If it is to predict the future, you use partial statistics.
The key point is that these two models will not be identical.
Comments
You can follow this conversation by subscribing to the comment feed for this post.