This is Part 5 of a multi-post series on developing intuition about statistical models. This series in inspired by the media's coverage of the Oxford study which argued that over 50% of the UK population have already caught the novel coronavirus by March 19, 2020. This result sharply contrasted with earlier modeling by a research team at Imperial College that was credited with influencing the UK government's response to the coronavirus epidemic.
In Part 4 (link), I discussed some of the vulnerabilities of doing mathematical modeling. In Part 3 (link), I highlighted the reasons for building such models. Specifically, I outlined the Bayesian approach to modeling. I highly recommend you read those posts first.
Part 5 is the continuation of Part 4. In a sense, we are getting to the meat of the matter.
Please comment below if anything is not clear.
Here is our full program:
Part 1: Does the model explain or does it predict? (link)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (link)
Part 5: Models = Structure + Assumptions + Data (this post)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
***
Part 5: Model = Structure + Assumptions + Data
At the end of Part 4, I noted that a key driver of the modeling results is the input known as proportion of population at risk of severe disease. This drives the number of deaths through a "funnel": UK population -> infected -> severely diseased -> dead.
5a. Are assumptions or data driving the results?
It turns out that all three models in the Oxford study end up producing the same death rate of the severely ill of around 15 percent -- despite the grey model assuming one-tenth of the proportion of population at risk of severely diseased relative to the red and green models.
Now, we can explain why the grey model places the date of virus introduction four days earlier than the red and green models. The grey model must accelerate infections to compensate for the much lower infected-to-severely-ill transitions. Doing so delivers the same number of severely diseased, and then about 15 percent of these pass away. Recall that all three models are required to fit the death counts.
The green model is basically the grey model that assumes 0.1% of people at risk of severe disease, instead of 0.01%. In the green model, the date of introduction did not shift by 4 days, and the cumulative proportion exposed by 19 March 2020 was 36% rather than 68% (that's half as much). So the credibility of the Oxford study rests largely upon whether we believe that only 0.01% of the population will ultimately require hospitalization (which is the definition of "severe" used in the UK).
The researchers examined two settings for this input - one 10 times larger than the other, probably because they found no prior knowledge of its value. In each model, the posterior value did not shift away from its prior assumption. So, the model outputs are essentially reflecting the model assumptions.
This situation is similar to the effect of survey weighting in exit polls, which I previously discussed here (link). With regards to demographics breakdown, exit polls results essentially reflect the assumptions of the poll designers made prior to election day.
5b. Is the model structure driving the results?
The second - and perhaps easily neglected issue - is the assumed structure of the models. Even if everyone accepts the logic behind the prior assumption of the proportion falling severely ill, the model's performance is constrained by the set of equations that link the inputs to the output.
As an analogy, think about the knob in a hotel bath shower. Some of them allow you to control the temperature of the water but not its volume. The way the modeler sets up the equations eliminates certain possibilities, just like the shower knob's designer removes our ability to get more water out.
The Oxford model is based on the SIR framework (i.e. Susceptible-Infectious-Recovered). Under this framework, everyone is initially susceptible (S), then anyone who gets infected (I+R) is taken out of the susceptible population, never to return. In other words, no one can ever get re-infected. The framework assumes immunity upon recovery.
Does the model support the herd immunity theory? It does, because immunity is baked into the structure of the model. (Ironically, the SIR framework is not used to model influenza because people don't get immune from it after being infected once.)
There are more assumptions baked into the SIR framework. Another one is that one's infectious status does not change one's chance of contacting another infectious person. If 20 percent of the population is infectious, then 20 percent of anyone's daily contacts are with infectious people, regardless of whether one is infectious or not.
The point here is not to damn a model for having structural assumptions. Structural models play a big role in statistics. In fact, models that don't impose a structure, say a black box model, may contain unknown structures. Assuming no structure is also an assumption.
Once a structure is chosen, many modelers only focus attention on those knobs and levers, the tunable assumptions, forgetting about the existence of structural assumptions. That's a common oversight in modeling. Hard constraints that are baked into the structure of the model are not affected by data. They are only removed if the analyst alters the structure of the model.
The simplest case of this trap is the "add trendline" function in Excel that fits a straight line through a two-dimensional dataset. By building such a model, the analyst has imposed a linear structure. There is nothing inside the analysis that can be tuned to escape this constraint.
The SIR framework is a simple abstraction of the world, and that's why it's attractive to modelers. Many of the structural constraints can be removed but this inevitably involves a tradeoff. As a rule, removing structural assumptions makes the model structure more complex, meaning even more assumptions. More priors.
***
In Part 5, you learn that a statistical model can be decomposed into three components: there is a structure that imposes hard constraints on what the model outputs could say; there are assumptions within the structure that hopefully are externally validated; and there are raw data that condition those assumptions. It's important to learn how much of the model outputs are driven by the data versus the assumptions versus the structure.
The Oxford study utilizes the SIR framework, a structure in which all infected people will become immune (if they don't die). This structural assumption allows the emergence of "herd immunity".
By the way, the structure of structural models affects the vernacular. I was struck by the fact that when discussing "herd immunity", few people talk about deaths. We keep hearing that if a substantial proportion of the population becomes immune, then the remainder would not get infected. Surely, the more people get infected with the coronavirus, the more people will die. Then, I realize that the basic SIR framework does not explicitly model deaths (link to Wikipedia). Deaths are included in the "recovered" (R) compartment. They are just treated as infected but not currently infectious. So mathematically, they are no different from infected, recovered and immune.
In the last post of this series, Part 6 (link), I provide a summary of key takeaways..
Continue to Part 6: Key Takeaways.
Or, go back to review:
Part 1: Does the model explain or does it predict? (link)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (link)
Comments