This is the concluding post in the multi-post series on developing intuition about statistical models. We began the journey with the Financial Times's article that trumpeted recent results from an Oxford research team, proclaiming that over half of the UK population have already been infected with the novel coronavirus.
Here is our full program:
Part 1: Does the model explain or does it predict? (this post)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (link)
Part 5: Models = Structure + Assumptions + Data (link)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
In Part 6, I summarize the key takeaways, firstly, for the Oxford study, and secondly, for statistical models in general.
If you made it this far, I hope you've learned something new, and are more confident when you read about statistical models in the future.
You can support my work by:
- Sending this series of posts to a friend or colleague
- Getting a copy of my books: Numbers Rule Your World (link), and Numbersense (link)
- Joining my Youtube channel (link), where I'm putting up videos about data science topics
- Hiring me for consulting or coaching (link)
Kaiser
***
Part 6: Takeaways
6a. Takeaways about the Oxford study
In featuring the Oxford study (link), the media has taken a limited mathematical model, and blown it out of proportion. The much-fussed-about herd immunity is primarily a feature of the SIR framework (Wikipedia), which is the governing structure of the Oxford models. Within the time horizon of the Oxford study, which is the 15 days after the first reported death, the proportion of susceptibles in the population has not stabilized, thus if there should be herd immunity, it would take place beyond the scope of the analysis.
The headline number, the proportion of infected by 19 March, 2020, is highly sensitive to one input - the proportion of population at risk of severe disease. No one has a good handle on this number. The Bayesian approach to modeling works around its unavailability by using a prior assumption. The headline number of over 50 percent already infected comes from the grey model, which assumes only 1 in 10,000 will ever become severely sick from Covid-95 in the UK. The proportion of infected is cut by half with a more realistic assumption of 1 in 1,000, which is closer to the hospitalization rate reported in Italy.
The other metric that caught the media's attention is the 4-day gap between the arrival of the novel coronavirus and the report of the first case. This gap is also indirectly driven by the assumed low rate of severe disease. The entire gap dissipates if the more realistic assumption of 1 in 1,000 were used. The preprint presents both scenarios but the media chose to ignore the less sensational one.
It's worth noting that model assumptions rather than the data drove these results. All models are required to fit the trend of deaths in the 15-day window. The grey model must compensate for the lower conversion of infected to severely sick by having more infections. More infections are induced by accelerating the time-line by four days. The majority of the prior assumptions chosen by the researchers have not been significantly modified by the data, reinforcing the feeling that assumptions dominate the results with the actual data playing a minor role. This is to be expected when the bulk of the model comprise unknowable quantities, and the only data available to the model are the daily death counts.
The researchers made no attempt to forecast beyond the 15-day window, so any extrapolation beyond 19 March, 2020 is at your own peril. The predictive power of these Oxford models is not validated. Of course, many of the unknowable quantities, such as the proportion of already infected, will simply never be known for real.
6b. Takeaways about statistical modeling
As an insider and a believer in statistical modeling, I want you to see both the strengths and the vulnerabilities of our practice.
The Oxford study demonstrates one usage of models - to quantify essentially unknowable quantities. We do it because the values of these quantities are helpful for decision-making. In general, models allow us to make up for the short-comings of raw data. Another common use of models is to correct biases in the raw data. The Oxford team simply ignored the various problems with the data - starting with the under-counting of cases and deaths in the U.K. due to sparse testing.
Some models are used to explain the past, and other models predict the future. Explanatory and predictive models are not interchangeable. There are many models that explain the past well but will not predict the future. The predictive power of a predictive model must be validated.
A statistical model is the sum of its structure, assumptions and data. It's important to understand when the model's outputs are driven by its structure, or by assumptions. How important is the data in shaping the modeling results?
Don't be surprised when results are driven by the model structure. The classic example of this trap is running the SLOPE function in Excel, which computes the gradient of a straight line fitted to the data inputs, and then issuing the finding of the type "sales increase by X units per day". Excel calculates regardless of whether the data are linear.
Don't be surprised when results are driven by prior assumptions. My recent post about exit polls explains when talking heads babble endlessly about demographic breakouts of voters, they merely reiterate prior assumptions of who the voters are, and not the raw data collected on election day.
The Bayesian approach taken by the Oxford team (also, the Imperial College team) works well for this problem because epidemiologists have a strong structural framework, which contains many inputs with no direct data. This approach works around the sparsity of data by setting prior assumptions on the inputs. The inputs are then linked to the output through a set of equations. The model modifies the priors of the inputs to generate an output that replicates as closely as possible the observed data. The final values of the inputs reflect a compromise between the assumptions and the new data.
Using mathematical models to transcend raw data has its vulnerabilities. The unknowable components of these models usually remain unknown. Many predictive models turn out to be bad at predicting. (I recently showed that the extremely popular, simple exponential growth model used to fit the growth in Covid-19 cases in various countries is poor at predicting future cases.) For any problem with numerous inputs, there are many settings that fit the output data well so we don't have an exact science with a single optimal answer. In the thick of things, it's easy to miss that an insight might not be driven by the data but by the structure imposed on the data, or by the assumptions embedded in the model.
6c. Related Posts
The following posts go deeper into some of the points made in the current series of posts:
How to act like a data scientist 8: Don't use lagging indicators to forecast (Structural assumptions, exponential growth curves, validation)
Note about fitting and visualizing exponential models (Evaluating and interpreting explanatory models)
How to Act Like a Data Scientist 7: Recognizing and correcting biases in surveys and polls (Exit polls, assumptions driving outputs)
New video: Validating data science models, a case study with Covid-19 data (Validating predictive models, backtesting)
Chapter 2, Numbers Rule Your World (on statistical modeling)
***
Thank you for reading. If you have questions, please add them below.
For your convenience, here is the full program again:
Part 1: Does the model explain or does it predict? (this post)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (link)
Part 5: Models = Structure + Assumptions + Data (link)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
Thank you, interesting explanation of how we can adjust our prior assumptions in light of observed data (with the help of a model).
Posted by: KIEN CHOONG | 04/25/2020 at 12:51 AM