Over the weekend, Jason P. alerted me on Twitter to a new study from Yale that caught a lot of attention - because it's about scientists poking around our sewage. The initial touchpoint was about this chart:
The twitter user was annoyed with the double axes, of which he knows I'm not a fan. After I read the preprint, I was hoping someone might introduce these authors to the good people in Yale's statistics department.
The chart shown above is the centerpiece of the study. It shows on the left axis (red line) a measure of the viral load of SARS-Cov-2 found in human sewage, and on the right axis (black line) the number of new reported Covid-19 cases. Both metrics were measured for the New Haven metropolitan area, served by the sewage plant from which the samples were obtained.
The claim is that those two lines are highly correlated, with the red line leading the black line by roughly 7 days. The research team concluded that the viral load in sewage is a "leading indicator" of infections.
They then quantified this relationship by creating a piecewise linear regression model. As shown in charts D and E below, they fitted one line to the left of the peak and one line to the right of the peak.
***
These researchers are interested in a predictive model. If it is true that viral loads in sewage predicts the number of new cases seven days later, then they have themselves a predictive model. Unfortunately, the analytical methods used in this study fail to make this case.
A predictive model must predict the future, and not just explain the past. As I said before, some models explain the past well but that doesn't mean these models will predict the future accurately. Predictive models must be validated with data that haven't been used in building the model. This Yale study fitted a model to the entire time series (Mar 19 to Apr 30), which precluded any validation. That's the first problem.
Given their stated goal, the simplest model has the form:
New cases on day X = a + b * (viral load in sewage on day X-7)
Here, a and b are numbers learned from the training data. The "slope" of the regression line is b. For this model, if one wants to know what the new cases are for tomorrow (day X), one looks up the viral load from seven days ago, plugs that into the equation above, and out pops the prediction.
What is described in the preprint isn't operational as a predictive model.
As shown in the panel above, the researchers actually ran two regressions (D and E). As they explained it, D captures the slope as the cases ascend while E estimates the slope as the cases descend. This means that we must know where the peak is to even decide which model (D or E) should be applied. But the peak is not known until after the fact - so this model can't predict.
In addition, all time series in the analysis were smoothed using LOWESS. In other words, the number to plug into the above equation isn't the viral load from seven days ago; it's really the LOWESS-smoothed viral load from seven days ago. LOWESS smoothing averages values within a window centered at each day of the time series (it's a more advanced way of doing a moving average). This means the averaged value for today depends on values from the past as well as values from the future. A predictive model can't use data from the future. They need a smoother that only averages past values - but that might hurt their thesis because such smoothers are always lagging.
The same problem applies to the data series of new cases. Because of smoothing, the model has been trained to predict the smoothed value, not the value on day X. The smooth value is a weighted average of the values on day X, some days before day X and some days after day X. If you look at Figure 2C, you see the peak number of new cases to be just under 100. The following chart shows the raw data:
There were at least nine days on which the number of new cases exceeded 100. The effect of smoothing is to temper down sharp peaks.
And did they run the wrong regressions? I'm not sure since the actual equations were not included in the preprint. What is telling is their description of the slope as "1,305 virus RNA copies/mL per new COVID-19 case" (for the ascending model). Also, in visualizing the regression line, it's conventional to put the target variable (Y) on the vertical axis and the predictor on the horizontal axis. In the simple model formula stated above, the slope b is typically described as the number of new COVID-19 cases per unit increase in virus RNA copies/mL. In other words, it's the reciprocal of their description.
It's odd to describe the slope as viral load per new case because the point of this predictive model is to use viral load to predict new cases. Recall that viral load is time shifted back by seven days. So, the slope as they described it is viral load seven days ago per new case today. It just doesn't sound right.
As I explained in my article for DataJournalism.com (here), regressing Y on X and regressing X on Y are two different things. The slope of one is not the reciprocal of the slope of the other. That's why we don't put the target variable in the horizontal axis, and the predictor in the vertical axis. It visualizes the wrong slope.
***
What is presented in the Yale preprint is at best an explanatory model. A common mistake is to assume that models that explain the past well will also predict the future accurately. See this post to learn more about why those types of models are different species.
To fix this preprint, they need to clarify what their model is, remove any dependence on future data not available to the model on the day of prediction, and include a validation study. A quick visit to Yale Statistics may prove very helpful.
There is a more basic problem. What if new cases are not driven by actual new infections but something much earlier? Lets suppose new cases are driven by increased testing, and increased testing is driven in practice by increased deaths and government/local authority response to test more (or perhaps in smart locations, increased hospitalization as an indicator of imminent deaths).
In other words the test results are being driven by events (deaths) that in turn are driven by initial events (initial infection) 3-4 weeks earlier.
In this situation the Red line is not predicting the black line 1 week later, it is "predicting" the cause of the black line 2-3 weeks earlier. A pretty useless prediction I'm sure you'll agree.
And Deaths driving test results is almost certain to be the case.
Posted by: Michael Droy | 05/28/2020 at 10:04 AM
I heard about this work on the BBC World Service (Science Matters). I think it is fascinating, and perhaps this data may provide a more objective estimate of the extent of Covid-19 infection rather than a way of "predicting" future patterns. In particular it would be very interesting to see comparative data from several areas where the prevalence of Covid-19 is known to differ. However the paper raises so many questions for me:
1. The daily sludge data is essentially correlated as the samples will represent the sludge collected from pretty much the same group of people over time whereas the epidemiological data will only count each person once (will it? or might some admissions be the same as earlier tested patients?)
2. The authors state that the concentration results were "normalized": how?
3. Was any investigation into the distribution of the concentration data done? Was it Normally distributed and if not, were any data transformations explored? Does this matter in this context?
4. I was not absolutely clear on how the replicates were defined. Assuming that the sludge samples were split into two to form these replicates, then is a simple regression of the data from these replicates appropriate? If this is the case I would be surprised if there were not a very high correlation between replicates. Perhaps an analysis where the structure of the data i.e. incorporating the replicates, would be more appropriate?
Posted by: Caroline Whately-Smith | 05/29/2020 at 06:00 AM