Last week (two weeks ago when this is posted), there was an Oxford-Imperial College dustup in the British media. From the news, I heard that an Oxford team led by Professor Sunetra Gupta published a new set of projections, claiming that the alarming forecast by Imperial College Professor Neil Ferguson, who was credited with influencing UK and US coronavirus response, was overly pessimistic.
This story can be traced back to an article in the Financial Times, headlined "Coronavirus may have infected half of UK population -- Oxford Study" (March 24 2020). At the time of the Oxford Study (March 20), there were only 3,983 reported cases and 177 reported deaths in the UK (note: the UK had only tested about 65,000 people at that time). If 33 million Britons have already been infected, then only 1 in 10,000 infected people are sick enough to be identified as cases, and only 5 out of a million infected people have died. Naturally, I received emails from people who believe the coronavirus crisis is a hoax, arguing that the Oxford study proved that Covid-19 is just like influenza.
The FT article proclaimed that "the Oxford results would mean the country had already acquired substantial herd immunity through the unrecognized spread of Covid-19 over more than two months." Herd immunity is the bystander theory of letting the virus spread to a majority of the population. (The proponents of this concept find it absolutely distasteful to mention its implied consequence of intentional deaths.)
FT then leaped to public policy, saying that "If the findings are confirmed by testing, then the current restrictions could be removed much sooner than ministers have indicated." This is classic story time: it's like saying "if string theory is confirmed by experiment, then..." We've been lulled with some data, and then, it's story time, hoping we doze off soon not realizing the story strays far from the evidence.
The only thing in the Oxford study itself related to public policy is found in the last paragraph of its introduction: "This relationship can be used to determine how many people will require hospitalisation (and possibly die) in the coming weeks if we are able to accurately determine current levels of herd immunity."
Most of the public are not immersed in mathematical modeling. Most journalists are not schooled in statistics. So it's damn hard to know what's going on. I hear you. So I've written a guide to reading the Oxford study. I will paint a broad picture of how statistical models are made, and highlight the key issues for you to consider, and while the Oxford study is used as a running example, I intend this primer to be useful in a general sense.
To relieve readers who might not want to read an extremely long blog post (more than 6,000 words), I've broken it up into six posts. All six parts are simultaneously posted today. Read as much or as little as you like. Or spread out the reading over a few days.
Here's the full program:
Part 1: Does the model explain or does it predict? (this post)
Part 2: Overcoming the inutility of raw data (link)
Part 3: What is a Bayesian model? (link)
Part 4: How is modeling vulnerable? (link)
Part 5: Models = Structure + Assumptions + Data (link)
Part 6: Key takeaways (link)
EXTRA: Commentary on the data graphics in the Oxford study (link) - New 4/15/2020
Enjoy!
Kaiser
***
Part 1: Does the model explain or does it predict?
Is the study about the past or the future?
On reading the published preprint (link), I discovered that it is far more reserved than the news stories suggested, and crucially, the Oxford study does not even supply a forecast.
A forecast should be about the future. The Oxford study concerns 15 days of history in the UK and Italy, starting from the respective day each country reported their first death from Covid-19. The mathematical model could in theory make forecasts for days beyond those 15 but none was offered in the preprint, where all charts end on March 6 (Italy) or March 20 (UK). These researchers may have speculated about the future to reporters, but they did not have the conviction to publish actual predictions.
The Oxford model is an attempt to explain the past. Analyzing the past is also important work. Consider the following scenario.
A mass shooting occurred at a workplace with a dozen dead. A data analyst at the FBI was given the name of the murderer, and tasked with explaining why he did it. The analyst discovered he sent dark, threatening messages recently on Facebook to co-workers. What the analyst built is a model that explains the behavior of one known criminal.
Here came the media! Breaking news. Really really important. Red letters flashing: "The FBI missed red flags about the mass shooter." A panel of experts questioned why FBI agents didn't knock on the guy's door the minute he sent those Facebook messages. They could have prevented a dozen deaths!!!
What these paid talking heads had in mind was a forecasting model that predicted the identity of the future mass shooter based on monitoring Facebook threats. They thought the FBI agent built such a model but they are mistaken. Predicting the future is a different problem from explaining the past. An explanatory model is usually bad at forecasting.
The data analyst knew for sure the individual under investigation was a mass shooter so he could start with this person's file and traced his activities. On the other hand, if the analyst were asked to identify future mass murderers, he must begin with a population of millions, and eventually select a few.
Sure, sending threats on Facebook may be a risk factor. But many Facebook users who have sent threats will not ever unleash a shooting spree, and furthermore, not all mass shooters were Facebook bullies. The explanatory tool that helped us understand the one shooter does not perform well in predicting future murderers. That's why statisticians make a distinction between these types of models.
A predictive model must be validated, as I explained last week in this short video (link). Validating establishes the model's ability to predict the future; without it, the model may only explain the past.
When reading about "models", first figure out whether the model is designed to explain the past or to forecast the future. (Or figure out which goal you care ,pre about.) The Oxford study makes zero mentions of dates beyond the last date of the dataset and therefore, their models are not suitable for forecasting. So when I discuss their findings, I focus on how well their models explain what has happened so far.
***
To continue reading, please proceed to Part 2 (link) to learn about the role of mathematical modeling in data analysis. Why do we need models? (Hint: this may have something to do with the inutility of raw data, which is the topic of another recent post.)
Continue to Part 2: Overcoming the inutility of raw data
Comments