This is Part 3 of a series of posts on understanding statistical models. This series is motivated by the media frenzy over the Oxford study (link) that claimed the majority of Britons have already been infected with the novel coronavirus by March 19, 2020.
In Part 2, we learned that the type of model built by the Oxford research team includes unknowable quantities, such as the date the virus first appeared in the UK, and the cumulative proportion of people already infected. These are unknowable in the sense that no ground truth will ever emerge. Because the values of these quantities are used in decision-making, we introduce mathematical modeling to estimate what are likely values.
As Part 2 ended, I was in the middle of drawing connections between the conclusions of the study and the outputs of the statistical modeling. Before answering that question, I must first explain the Bayesian approach of modeling, which is the main topic of this post.
Here is our full program:
Part 1: Does the model explain or does it predict? (link)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (this post)
Part 4: How is modeling vulnerable? (link)
Part 5: Models = Structure + Assumptions + Data (link)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
***
Part 3: What is a Bayesian model?
This is a good place to describe a Bayesian model. The Oxford modeling strategy uses an input for the date of introduction, which is when the novel coronavirus first landed in the UK. It also harnesses several other inputs, such as the proportion of people at risk of severe disease and the reproduction number (R0), which is how many people are infected by the average infectious person at the start of the epidemic. A set of equations links all the inputs to the output, in this case, the cumulative death counts.
The only quantity directly measured is the output, the daily cumulative death counts. We have official statistics for these. The equations then allow us to compute the output if we provide the inputs. But none of the inputs have available values. (Other available data such as cases or severe cases are not utilized in the Oxford study.)
What to do? If we think of these inputs as buttons and levers on a machine, we try applying different settings, and evaluating the outputs to see if they meet acceptance requirements. The goal of modeling is to find which combination of inputs can generate the observed death counts, if only approximately (assuming blatantly that the set of equations linking them holds up).
For Bayesians, getting the process started with an initial setting is known as setting the "priors". A prior on the input is an assumption of its value based on our best knowledge. (Strictly for the nerds, it's an assumption on the distribution of values, not just on the average value.) For the reproduction number (R0), the grey model assumes a prior average of 2.25 (a normal distribution with mean 2.25 and standard deviation 0.025). The researchers cited a few scientific papers as the basis for this assumption.
As a reminder, what I call the grey model is the one that produced outputs that were widely publicized by the Financial Times (link). Throughout the preprint's charts, this model is given a grey color.
Now, while the model is being fitted to replicate the death counts, all inputs will be varied from their initial settings. When the grey model is complete, we look at the R0 and learn that its final setting (posterior mean) is still around 2.25. [The researchers didn't provide the actual numbers but this can be inferred from Figure 1(C).]
Think of the posterior value of R0 as a compromise between the existing state of knowledge and the wisdom of the new data. The lack of movement suggests that the observed data adequately line up with the initial guess of R0.
Now back to the question: How does the modeling results support the study's conclusions?
Recall that the Financial Times report focused on two key metrics computed by the Oxford models: the date of introduction, and the cumulative proportion of infected.
For the date of introduction, the prior setting was chosen to be random (a uniform distribution). By setting this prior, the researchers admitted no special knowledge of this input. The technical term is a flat or uninformative prior. So here the observed data did all the work, and when the grey model was completed, the researchers found the date of introduction to be roughly 4 days before the first report of infection.
When the press breathlessly reported that the coronavirus has been spread around quietly for over a month before the first reported death, that's the number from the Oxford study to which they are reacting. That's a silly way to frame it; it's more appropriate to say four days before the first reported case.
Even if the Oxford study didn't exist, the official statistics depict a gap of 34 days between the first case and the first death. It is quite a stretch to claim the coronavirus has been spreading around "quietly" in these 34 out of those 38 days. Oops, I may have killed the fun of sensational journalism yet again.
As with anything statistic, there is a margin of error associated with the number 4. Again, the preprint didn't contain actual numbers; judging from Figure 1(E), I think a reasonable range is from 0 to 8 days. (If you are looking at the chart, I will get to the red and green models eventually. Just focus on the grey model for now.)
Notice that the cumulative proportion of infected - the quantity in the Financial Times headline - is hyper-sensitive to this date of introduction because the grey model suggests an extremely fast propagation through the population. Figure 1(B) shows how half the country got infected in a mere 20 days. Shifting the start of this curve by a few days changes the story quite a bit. If the curve is shifted forward by 4 days, then of course, the 50% mark would be reached four days later but four days prior, the proportion of infected would be 30% lower than claimed!
[As you can see, the lines drawn on these charts are so terrifyingly thick that one can't get precise in discussing the findings.]
***
In Part 3, I explained how the Oxford team constructed their Bayesian models, and how the two key quantities - the date of introduction and the cumulative proportion of infected - arose from the modeling, and plugged straight into the Financial Times report.
In Parts 4 and 5 (link), I point out where this type of statistical modeling is vulnerable. I'm not bashing statistical modeling - I'm a believer and an insider. But I think you should be a smart consumer of such models, and I like to give you some pointers.
Continue to Part 4: How is modeling vulnerable?
Comments