« The FDA looks past these statistical issues but maybe you shouldn't | Main | Know your data 27: IamNotHome.com »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Michael Droy

There is a more basic problem. What if new cases are not driven by actual new infections but something much earlier? Lets suppose new cases are driven by increased testing, and increased testing is driven in practice by increased deaths and government/local authority response to test more (or perhaps in smart locations, increased hospitalization as an indicator of imminent deaths).
In other words the test results are being driven by events (deaths) that in turn are driven by initial events (initial infection) 3-4 weeks earlier.

In this situation the Red line is not predicting the black line 1 week later, it is "predicting" the cause of the black line 2-3 weeks earlier. A pretty useless prediction I'm sure you'll agree.

And Deaths driving test results is almost certain to be the case.

Caroline Whately-Smith

I heard about this work on the BBC World Service (Science Matters). I think it is fascinating, and perhaps this data may provide a more objective estimate of the extent of Covid-19 infection rather than a way of "predicting" future patterns. In particular it would be very interesting to see comparative data from several areas where the prevalence of Covid-19 is known to differ. However the paper raises so many questions for me:
1. The daily sludge data is essentially correlated as the samples will represent the sludge collected from pretty much the same group of people over time whereas the epidemiological data will only count each person once (will it? or might some admissions be the same as earlier tested patients?)
2. The authors state that the concentration results were "normalized": how?
3. Was any investigation into the distribution of the concentration data done? Was it Normally distributed and if not, were any data transformations explored? Does this matter in this context?
4. I was not absolutely clear on how the replicates were defined. Assuming that the sludge samples were split into two to form these replicates, then is a simple regression of the data from these replicates appropriate? If this is the case I would be surprised if there were not a very high correlation between replicates. Perhaps an analysis where the structure of the data i.e. incorporating the replicates, would be more appropriate?

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep