Last week, I read the new study by Harvard researchers (link) who found “evidence” in parking lot images and search engine traffic that the novel coronavirus started circulating in Wuhan in August 2019. This study showcases the phenomenon of “story time”. Crucially, these datasets do not contain an outcome variable (i.e. the thing you're trying to explain). The connection between the signals (images and search traffic statistics) and the outcome (Covid cases) is tenuously established via coincidence.
Any data series about Wuhan that shows an unusual spike within months before December 1, 2019, the date of the first confirmed case, can be used to establish the “true” date of arrival, by following the exact analytical plan used by these researchers. What elevates one such data series above another is the assumption that one is a better proxy for Covid cases than the other.
For argument’s sake, let’s say the search traffic trend for “feeling dizzy” spiked around September 2019. In spirit, this pattern is exactly what buttressed the Harvard study – the only difference between this and that is the assertion that search traffic for “diarrhea” is a proxy for Covid cases while that for “feeling dizzy” isn’t. But the only justification for using search traffic for “diarrhea” as a proxy for Covid cases is it fits the story. Is it the only thing that fits the story? If “feeling dizzy” were accepted, the researchers would have concluded that the arrival date was in September rather than August.
***
Even if these novel datasets can be trusted, the patterns themselves do not tell the story without accepting a host of unverified assumptions (see last week’s post for a list). In this post, I take a deeper look at how the trend of parking lot traffic was analyzed. The preprint provides a glimpse into the nature of surveillance data that are being touted as the "new oil".
Here is the key chart from the Harvard study:
The orange line is the trendline fitted to the data. The vertical axis is labeled “relative car volume,” which is an index. This means the raw data – counting cars in satellite images of the parking lots – have been converted into an index, where the value of 1.0 means the traffic on that day is in line with the average traffic on similar days. Typically, one would just compare Monday’s traffic to the historical Monday average, etc. These researchers decided to divide the week into weekdays, Saturday and Sunday so weirdly, Monday’s traffic is relative to the historical average of all weekdays. Is it really true that the parking lot traffic is indistinguishable on any weekday in these six Wuhan hospitals? I doubt it.
In any case, when the relative car volume has a value of 1.1, it means on that day, the parking lot volume is 10% above the usual volume on comparable days. Based on the chart above, on August 1, the traffic was right around 1.0, and on September 1, it was just under 1.1. So, I don't think this orange line provides strong support for the conclusion that the virus landed in Wuhan in August.
The trendline is a loess smoother, which means that each value is a weighted average of the raw data within a "local" time window centered at the date of the value. So, what these researchers produced is smoothing (a form of averaging) on top of indexing (over subgroup averages). The raw data have been smoothed twice. Smoothing data is useful for discovering long-term trends – it is not the proper tool to look for specific points in time.
To their credit, the researchers also show the individual data series of the six Wuhan hospitals for which satellite images were analyzed. Those background lines are very strange indeed. If I am interpreting these lines properly, then there were fewer than 15 data points (satellite images) per hospital during the full year between May 2018 and May 2019, only about 1.25 observations per month per hospital!
The full analysis period was from Jan 9 2018 to Apr 30 2020, with a total of 140 observations across 6 hospitals, according to the preprint. These images are clustered, with maximum frequency during April 2020 (well after the peak of the pandemic in Wuhan). With 6 hospitals, and 842 days within the analysis period, there should be up to 5,000 or so daily parking lot images. This analysis looked at 140 of these, with timing that was far from even. The irregular spacing of the data should make one very nervous about the smoothing analysis.
Another sign of trouble is the skinny confidence intervals (the gray band around the orange line) that supposedly convey the uncertainty of the trend estimate. One should expect the uncertainty to bulge when data are sparse since fewer observations means we are less sure. That's not the case here as the gray band pretty much exhibits the same amount of uncertainty throughout the analysis period in spite of the uneven spacing.
We should also expect the confidence intervals to widen where there are abrupt turns - unless this is counteracted by having more samples around those turns.
It's unclear what level of confidence is used in the gray band. Typically, one should show 95% intervals. In the context of loess smoothing, one should expect, roughly speaking, the gray band to encompass about 95% of the data within that local time window. When I look at the period around May 2019 for example, I just don't understand that confidence interval. It's most likely just plus or minus one standard error, which would be about 67% confidence.
If they are showing one standard error, then the usual two standard error intervals will be 20% on either side of the orange line. This means that any difference within 20% is not statistically meaningful. That fact pretty much wipes out the entire picture, since the peak of the orange line right around December 2019 was only slightly above 1.2 (20% above "normal"). Thus, the satellite image data are too noisy even to establish that the parking lot traffic volume was unusual in December 2019, let alone August 2019.
***
Surveillance data are touted as the new oil. They're being sold to businesses and governments as some kind of X-ray that reveals all. It's not just a privacy issue. The bigger problem is that such data frequently yield incorrect insights - and such insights may actively harm customers and citizens due to blind trust. In addition to asking questions about the ownership of such data, we need to start measuring the impact of such data, whether incorrect insights are harming people, and which subgroups are disproportionately hurt.
Good piece.
Yes using standard error to suggest a confidence interval is a clue to the purpose of the research, and the remaining data comes up with a December conclusion.
In which case no earlier than Italy or Spain intriguing.
Posted by: Michael Droy | 06/22/2020 at 10:18 AM