I saw an interesting talk by Brad Efron on the differences between the tasks of prediction, estimation and attribution (link). He made some provocative points. The notes below include my interpretations of what he said. (Efron is most famous for inventing the bootstrap.)
To nail down the setting, let me describe the example he used. The data came from a clinical study in which there are 100 subjects, half of whom developed cancer. For each subject, genetic data were available, which contain > 6,000 dimensions. So, this is a case in which p > n, i.e. the number of variables is larger than the number of subjects.
A classicial statistician approaches this problem as "estimation" of global (average) risk factors - the goal is to find a small set of genes that have high explanatory power: these may be single genes (main effects) or clusters of genes (interaction effects) that are strongly associated with subjects who developed cancers. The goal of such models isn't to predict the fate of individuals.
A machine-learning engineer tackles this as a prediction problem - the goal is to predict whether or not a subject will develop cancer. These complex prediction models do not lend themselves to simple interpretation, and so have little value as scientific explanations. The initial progress in ML arose by explicitly discarding the scientific imperative to focus solely on predictions: who cares if the average is biased if the individual predictions are better?
Both groups frequently deploy similar analytical machinery. For example, they may run logistic regressions. Thus, many practitioners see estimation and prediction models as substitutes, basically different ways to solve one underlying problem. The estimation model is applied to individuals to generate what look like predictions while the ML modeling software computes "variable/factor importance" which are interpreted as global "risk factors".
The value of Efron's presentation is to clarify that prediction and estimation are separate pursuits. At the end, he held out hope that a unifying theory may emerge but for now, it doesn't exist.
(I) Prediction errors from prediction models are often smaller than those from estimation models
Prediction models have proven their value in predicting outcomes. For Efron's example problem, a random forest model makes 1 error out of 50 in the test set while a GBM (gradient boosting) model makes 2 errors. Estimation models applied to make individual predictions do quite a bit worse.
(II) Prediction models have dubious value in estimation
Prediction models ignore bias but that's not what Efron focused on. Prediction models are "explained" by looking at variables/factor importance values. Something like this:
It turns out that these charts are misleading.
Since Efron had 6,000+ variables, he could afford to throw out a bunch of them. When he removed the top 10 variables, and re-built the random forests (or boosting) models with the remaining, the new model has similarly strong predictive power - except, of course, the most important variables has changed. The same thing happened when he removed the top 50, top 100, etc.
So, in fact, the set of "important" variables is not important in the general sense but important only to the particular model.
This finding implies that such tabulation of variable importance is useless. They don't contribute to "explaining" or "interpreting" models.
(III) Prediction is "easier" than estimation
Efron noted that estimation models attempt to fit the average value while prediction models fit individual values. Specifically, accuracy of prediction models is computed on a "test" dataset.
A test dataset can be thought of as a sample from the underlying population. In the cancer example, the boosting or random forests models are evaluated on 50 patients randomly selected from the set of 100. Repeated random selection results in different test sets of 50, which results in different values of predictive accuracy. Some of the prediction errors come from variability of the test datasets. This portion of the prediction errors does not depend on the predictive model itself. As a result, many predictive models end up with similar error rates.
The estimation error does not depend on test data variability since estimation models are judged against the average. With estimation models, there may be a few models that perform markedly better than others, and the modeler has to find those. This is one reason Efron said prediction is "easier".
Efron hopes that people making estimation models can figure out how to make more accurate predictions, and that people making prediction models can figure out how to draw scientific knowledge from them.
I would like to understand this distinction better.
Variable importance scores only tell us which variables the predictive model found useful. In a problem with many highly-correlated predictor variables, obviously you can remove a lot of them at random and still be able to fit a good predictive model.
As an ML person trying to become a better statistician, can you explain to me how this is different from finding "a small set of genes that have high explanatory power"?
Is the real difference that geneticists seek a causal model rather than a predictive one? In that case, we need to do more than just regressions, right?
Posted by: Thomas Dietterich | 06/22/2022 at 04:22 PM
TD: Yes, that's how I read it as well. With an "estimation" model, the goal is to identify genes that are causally related to the outcomes, and it's less about predicting individual outcomes, more about fitting the average outcome. The genes identified may lead to development of new treatments but only if they are causal in nature. The modern predictive models are good at aggregating a variety of small signals; each small signal may not be the causal mechanism but merely correlated with it - such signals I think would not be useful for developing new treatments. His bigger point is that both sides have thought their models can serve both functions but it turns out not to be true - for now.
Posted by: Kaiser | 06/22/2022 at 04:52 PM