This is Part 4 of a multi-post series about understanding statistical models. Our point of departure is the Oxford study which raised a controversy over projections of severity of the coronavirus epidemic in the UK.
In Part 3, I explained the nature of Bayesian modeling, which is at the core of the Oxford study. Recall the headline in the Financial Times's article about the Oxford study: Coronavirus may have infected half of UK population. This most likely came from a key sentence in the preprint: [our grey model] places the start of transmission at 4 days prior to first case detection and 38 days before the first confirmed death and suggests that 68% [of the UK population] would have been infected by 19/03/2020.
Throughout these posts, I refer to their "grey" model by the color these researchers use to represent it on all their charts.
In Part 2, I suggested that mathematical modeling transcends raw data, which rarely provide objective truth. Many quantities valuable for policy-making cannot be directly measured. Models such as those used in the Oxford study put values on these unknowable quantities.
In this post and continuing in Part 5, I cover various critiques of statistical modeling. These are the reasons why you might not want to buy without looking inside. On balance, I'm a believer and an insider. But feel free to ask questions below if you are not convinced.
Here is our full program:
Part 1: Does the model explain or does it predict? (link)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (this post)
Part 5: Models = Structure + Assumptions + Data (link)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
***
Part 4: Why is modeling vulnerable?
I left you hanging with a comment earlier saying modeling of this type is vulnerable. As I hope you can tell by now, the models squeeze a lot of juice out of a semi-dry lemon (that happens to be the only lemon left on the shelf.)
4a. Unknowable is not known
You may realize that most of the quantities being spoken about in the previous posts are unknowable. The curve of proportion of susceptibles in the population [Figure 1(B) as seen in Part 3], for example, is unknowable. It will never be known, even after the pandemic is in the rear-view mirror. Even if we can estimate the total proportion of people who were infected at the end of the crisis (which itself is next to impossible given inadequate testing and the use of triage), it is still impossible to learn what the cumulative proportion was at any time during the crisis. We simply couldn't turn back the clock!
This is where the science gives way to a debate. A proper debate should be centered around identifying what assumptions are made; understanding how important each assumption is to the conclusion; of those key assumptions, determining whether historical data inform the model; and, if so, deciding whether we believe history will repeat itself.
Don't criticize a modeler for making assumptions. That won't take you far.
Take R0 as an example. (R0 is the measure of how many people get infected by the average infectious person at the start of an outbreak.) Someone can reasonably borrow the value of R0 from a similar past epidemic. Doing so is making the assumption that the current coronavirus behaves in a similar way as that proxy virus from the past. Is that a reasonable assumption or not? That's the debate.
Data science students are often shocked to learn that science is debatable. This might happen because their teachers tell them data are objective, and then train them to follow a flowchart (i.e. workflow) for constructing models that hide complications. I understand the didactic decision: the class may keep looking at the weeds, debating issue after issue, and never reaching the end-goal of building the model. Until these students recognize what assumptions they are making - including those made implicitly, they will be hard pressed to support their own models.
4b. Predictive should predict
Exactly because most of the outputs from these models are unknowable, serious researchers spend a lot of time thinking about how they can "validate" their models if they intend to forecast. The over-riding question here is: if our model fits the past well, does it predict the future? This point seems counter-intuitive: if I have a set of equations that can almost perfectly replicate the cumulative death counts in the first 15 days, I ought to have strong confidence that the model will predict the next 15 days. Anyone who builds models for a living knows this is a fallacy. Just ask any trader who found patterns in past prices.
What all traders know is a minimum standard for accepting a predictive model is "backtesting", or testing its performance on a validation dataset. To learn more about how to validate data science models, see my short video (link).
How did the Oxford team validate their model? The short answer is they didn't.
The fundamental problem they faced is they used the 15 days of death counts to find the "right" settings for the inputs. The "right" settings were decided by how well the equations replicated the death counts. So if they asked whether the "predicted" deaths for those 15 days matched the official counts, it's clearly yes but yes because it is circular.
As I mentioned in Part 1 (link), I don't consider the Oxford models as predictive since in their preprint, they never offered a forecast. So the media should not treat the study as a forecast.
4c. Everyone gets a medal
I mentioned before that the preprint presented three models (grey, red and green). All three models supposedly offer reasonable fits to the cumulative death counts in the first 15 days after the first death was reported.
The red and green models paint a different picture from the grey model. Specifically, the key sentence cited at the start of this post would have to be re-written as: the model "places the start of transmission at the date of first case detection and 34 days before the first confirmed death and suggests that 36-40% [of the UK population] would have been infected by 19/03/2020."
This highlights another feature of mathematical modeling. This is not an exact science. When we have so little data (just a small number of daily deaths), and so many tunable inputs, there are many settings that will fit the output similarly. [Go back to Part 3 (link) to learn about the Bayesian modeling paradigm.]
The key driver of the difference in the Oxford models is the input called "proportion of population at risk of severe disease". This number is crucial because the number of deaths is computed by the UK population multiplied by the proportion infected multiplied by the proportion with severe diseases multiplied by the death rate of severely diseased. (For the nerds, each is a conditional probability.) Not surprisingly, this is yet another unknowable number.
The grey model that suggested almost 7 out of 10 Britons have already been infected made an assumption that 0.1% of the population are at risk of severe disease. The red and green models that suggested about 4 out of 10 have been infected, on the other hand, assumed 10 times more people are at risk of severe disease.
The researchers cited no sources that informed their choice of these priors. It seems to be contradicted by reports from the ground. As of March 31, 2020, the hospitalization rate in Italy as a percent of total population is 0.08%, which is closer to 0.1% than 0.001%, and Italy may not have reached its peak level yet. (The UK defines "severe disease" as requiring hospitalisation.)
In practice, multiple models will have similar statistical properties, and their relative value may come down to which makes better assumptions. Better is more realistic, and that's a judgment call.
***
In Part 4, you learned that statistical modeling is not infallible. By its very nature, models frequently deal with unknowable quantities that are unknown not just before the modeling process but also long after. Most models that explain the past well do not predict the future, and therefore, we should not grab any explanatory model and force it to issue predictions. A proper predictive model must be validated. For any complex problems, there will be multiple models that yield similar statistical properties, and our choice between them comes down to judging the assumptions contained in these models.
The discussion of why mathematical modeling is vulnerable continues in Part 5.
Continue to Part 5: Models = Structure + Assumptions + Data
Or, go back to:
Part 1: Does the model explain or does it predict? (this post)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Comments