Antonio forwarded this Guardian article to me, with the headline “coronavirus may have been in Wuhan in August, study suggests.” This headline is based on a preprint promoted by Harvard, available here.
The report introduced two pieces of evidence: (a) occupancy of major hospitals increasing and (b) web search volume for Covid-related keywords increasing. The data sources were respectively satellite images of parking lot traffic, and search trend data from the leading search engine in China. This type of data analysis is typical of “big data,” which I define using the OCCAM criteria (see here).
I have come across this report on a different site, and planned to write about it – as the latest example of what I call “story time”. It’s a feeling I frequently experience in the midst of reading studies. As I start to doze off from the lullaby of the data presentation, the researchers have moved seamlessly into creative narration – they’ve called “story time.” After I wake up, I realize that the story is only minimally connected with the data.
One of the telltale ingredients of “story time” is proxies. The most important sentence in the preprint is this one: “Between 2018 and 2020, there was a general upward trend of increased hospital occupancy as measured by the parking lot volume proxy.” (my emphasis). Everyone should master the skill of “proxy unmasking”.
What must be true in order for parking lot volume to be an effective proxy for hospital occupancy? An incomplete list includes:
- People who are sick drive to one of those six hospitals in Wuhan.
- People who drive to these hospitals park their vehicles in the hospitals’ parking lots.
- People who park in these parking lots are patients, not staff, or random people say visiting places nearby.
- These hospitals have enough capacity in their parking lots.
- People who visit the hospital are themselves sick, not relatives or friends of someone who is sick.
- People who visit these hospitals will occupy hospital beds.
- People who are hospitalized keep their vehicles in the parking lots until they are discharged.
- People who are hospitalized have the novel coronavirus (as yet unknown pathogen).
Notice that the data from satellite images confer none of this information. They are simply assumed without comment. I sensed “story time” when I realized that most of these assumptions are contradicted by my own experience. Even if I believed them, I make a distinction between data and assumptions.
The same proxy unmasking applies to the web search data.
It's really a tough task to prove that one's featured story is the right story as opposed to a story. For example, here is the chart showing the trend in search volume for "diarrhea":
The story of the preprint focuses on the first and third vertical dashed lines. There is a small spike in the red line (diarrhear) around the time of the first line (August) and then a huge spike around the time of the third line (January 23, when the first case was identified). The blue line for "cough" is taken by these researchers as a "control," saying that its trend is explained by seasonal influenza. Given that cough is still a key symptom of Covid-19, one should expect the volume of "cough" searches to become elevated when both influenza and Covid-19 are affecting search volume. Further, the period to the right of the third vertical dashed line was the epidemic period - when more people were worried about community spread, and what can explain the dramatic drop in searches?
***
When the quality of data is lower, the standard of analysis must be set higher. A chef needs a bag of tricks when working with lower-quality ingredients. Lower-quality data cannot be an excuse for lower standards. Quite the contrary, "big data", OCCAM data demand more stringent benchmarks. Is this counterintuitive?
There are 3 laws of epidemiology
1. There must be something in the data.
2. If there isn't something our analysis must be wrong so change the predictors and definitions and anything else.
3. If you keep at it long enough, you will find something. People who don't find anything aren't trying hard enough.
A place I briefly worked, the senior statistician had been through some data and grouped the predictors in many possible ways. For example compare groups A+B+C to D+E. She then told the clinician that if he could justify one of the significant ones, then he had a paper. I left very quickly.
Posted by: Ken | 08/31/2020 at 08:38 PM
Ken: I'm sure you know Andrew Gelman's blog is the encyclopedia of exposed nonsense published and peer-reviewed.
Posted by: Kaiser | 08/31/2020 at 11:56 PM