I mentioned the other day that the testing data out of California looked weird (link).
On April 22, 2020, it released a gigantic pile of data, more than 165,000 results. This increased the total test results by more than 50 percent in one day. Most people kept moving along - telling themselves it's just a backlog being released.
If you're a good data scientist, you won't let this go so quickly. In fact, I won't hire anyone who accepts the backlog explanation at face value.
***
Tests are blood samples being processed at labs. Labs are currently swamped, therefore they are running at full capacity. This means a near constant rate of output. It's not that the labs suddenly found extra capacity to process 165,000 tests all at once. The normal rate of test results is in the 10-20,000 per day range.
So, after the labs completed the test analysis, the test results were held up somewhere. It's not a one day delay. It's about 10 days worth of results being locked up.
In addition, this batch of results are completely different than other days. From April 3-21, about 11 percent of all reported tests were positive. In the batch of 165,000 for April 22, only 1 percent were positive. This is an order of magnitude different, well outside random chance. For some reason, the test results that were delayed looked different from those that were released.
When I wrote the last post, I haven't looked at the test results released after April 22. On April 23, the proportion of positive results was.... you guessed it, back to 12 percent. On April 24, it was 16 percent.
Here's why the innocent explanation doesn't fly: let's assume the person who verified results before they were released disappeared for 10 days. So the results piled up on the person's desk, and s/he published them all at once. Those results should be even closer to the average proportion of positives because the sample size is 10 times bigger. Instead, the proportion positive is 10 times smaller!
As a data scientist, you've got to ask questions, and keep asking until you get a satisfactory answer.
***
What are some possible reasons?
Here's another plausible "innocent" reason. We might learn that the 165K results came from a separate, one-off initiative to get tests to a broader set of people including those without symptoms. Of course, we haven't heard of such a thing. So this is not accepted unless real evidence turns up.
Another innocent explanation might be that the California state contracted with additional labs to process the backlog. This can explain the 50% jump in test results but it fails to account for the extreme negative bias in the sample. The labs can't know which samples would come up negative before they processed them so for this reasoning to hold, the percent positive of those 165,000 had to be close to 11 percent.
Asking these questions led me to examine the daily testing report from California (link). I focus just on April because there was a reporting change in the first week of April; the numbers from March were small anyway. I saw a few other days in which 100% of test results were reported as positive (or negative). The chance of every single test coming up positive (or all negative) for any given day is practically zero. This means the release of the data is dependent on the test results. There is no scientific or statistical reason for this.
This is deeply concerning as a citizen. It makes one wonder if there are political considerations. As you could imagine, the tone of the media coverage can be swayed by selectively releasing positive (or negative) results. This conjecture too must be backed up with evidence.
***
I don't think we have an answer. Backlog is definitely not an acceptable answer. I encourage Californians to push your government to explain this. The questioning will be worth it. You will learn something about the testing program after this.
P.S. Thanks to the folks at Covid Tracking project who are keeping close tabs on the testing data.
Apparently Calif opened up testing to a broader audience on 4/21, and there is capacity to run 80,000 tests a day.
https://www.latimes.com/science/story/2020-04-21/california-first-state-coronavirus-tests-without-symptoms?fbclid=IwAR1miFdFBtY8rPPyA16bv55DB-trm2o4FI7YABLdNbCDgIxaKXJeGZbTQno
Posted by: TBW | 04/28/2020 at 12:56 AM
TBW: That might be true but it would be shocking to think they could process 165,000 tests in a single day, especially given previous reports of backlogs. It is also unlikely that the types of people getting tested switched diametrically overnight. The broadening is to "high risk" groups so I'm still skeptical.
Here's another innocent explanation: could it just be a typo? If we turn 165,000 into 16,000 tests, then both the volume and the percent positive goes back to the normal range.
Posted by: Kaiser | 05/07/2020 at 11:43 AM