This is **Part 2** of my series of posts about how to understand statistical models, such as the Oxford study that claimed over 50% of the UK population have been infected with coronavirus by March 19, 2020.

Many scientific findings derive from mathematical models. These are often difficult for non-specialists to understand. There are some general principles that can help you gain intuition. In Part 2, I discuss the role of modeling in data analysis.

**Part 1: Does the model explain or does it predict? (link)**

**Part 2: Overcoming the inutility of raw data (this post)**

**Part 3: What is a Bayesian model? (link)**

**Part 4: How is modeling vulnerable? (link)**

**Part 5: Models = Structure + Assumptions + Data (link)**

**Part 6: Key takeaways (link)**

****

**Part 2: Overcoming the inutility of raw data**

**2a. Raw data do not represent truth**

If raw data were sufficient, there would be no jobs for statistical modelers.

During the Covid-19 pandemic, what governments measure, such as counts of cases and deaths, produces "raw data". In this crisis, we quickly learned that **raw data do not represent objective truth**. Just arbitrarily picking up a recent news item, I found a report from California, saying 64 percent of the coronavirus tests there were pending results so raw tallies from California are highly misleading when compared to other states.

This is a core message of my book **Numbersense** (link). In sections on U.S. news rankings, obesity, economic indicators, and fantasy football, spanning the whole book, I drew attention to how metrics are defined, the inherent subjectivity, and the attendant assumptions. Check out **Numbersense** (link) for more. Or, read a different example in my recent post on the inutility of raw data (link).

Models can be used to correct biases from imperfectly measured data. The survery weighting methodology is a case study of such corrections. See my post about exit polls to learn about this usage of models (here).

Another use of modeling is to **quantify essentially unknowable** things, and the Oxford study demonstrates this aspect.

A key sentence in the Oxford study is this one about one of their models: it "places the start of transmission at 4 days prior to first case detection and 38 days before the first confirmed death and suggests that 68% [of the UK population] would have been infected by 19/03/2020."

In this sentence, the researchers introduced two quantities into the conversation: the date on which the novel coronavirus first appeared in the UK, which may happen earlier than the day of the first reported case ("**date of introduction**"); and the cumulative proportion of people who have been infected up to a certain date ("**cumulative proportion infected**").

The first quantity is unknowable unless miraculously they locate "Patient Zero". The second metric is unknowable since the UK does not administer coronavirus tests to a random sample of the population at regular times.

We see why policy-makers desire those two quantities despite their inherent unknowability. This is where mathematical **modeling** enters the picture. One task of modeling is to use the observable data (death counts) to infer the values of those unknowables. At the same time, this is where modeling is the most vulnerable, and I'll return to this point in Part 4.

**2b. How do the modeling results support the study's conclusions?**

The Oxford preprint presented three models. That's because the research team adopted a **Bayesian approach to modeling**, which does not seek the one "best" model to fit the data; this approach generates a collection of models with different settings that are deemed good enough. Each of the three models does a decent job replicating the cumulative death counts in the 15-day window, according to these researchers.

Recall the key quote from the preprint, and let's see how the grey model supports it. (Grey refers to the color selected for it in the charts.)

An output of the grey model says the coronavirus appeared in the UK 4 days before the first case was confirmed. So anyone assuming the time of the first reported case is the beginning of the epidemic is off by a few days. Is that bad?

To answer this question, we make a detour into the world of Bayesian models. Please proceed to **Part 3**.

Continue to **Part 3: What is a Bayesian model?**

## Recent Comments