As predicted, test accuracy is getting major play in the media this week. One of the big headlines involved the claim that about 3 percent of the U.S. population may already have been infected and recovered, according to a study coming out of Stanford. The good thing about this study is its attempt to escape from the straitjacket of "triage" or targeted testing. It recruited about 3,330 people without requiring that they have symptoms. They were tested for antibodies, which is an indication that the person has recovered from a previous SARS-CoV-19 infection. Showing that a large proportion of the population have been infected and recovered is key to those pursuing the "herd immunity" theory.
The bad news is that the methodology of the study has already been debunked by Andrew Gelman. Andrew's post is here in its entirety. In this post, I give you an executive summary, especially for those not trained in the statistics of testing.
***
Before I get to the numbers, the idea that this study supports the "herd immunity" theory is far fetched. Herd immunity models typically show the vast majority of the population would be infected, much higher than five percent. We would need about 20 times the number in Santa Clara to get to herd immunity. Also, the presence of antibodies is not identical to having immunity. It is still not clear whether humans have immunity after recovery from Covid-19, and if so, for how long. The headline drawing attention to the framing as 50 to 85 times higher than reported is burying the fact that they found antibodies in only 3 percent of the samples.
If the result of this study is accepted, it exposes the sham of diagnostic testing in Santa Clara.
***
There are two major statistical issues with the Stanford study. Both are related to the construction of the sample - its composition and its size.
1. The way volunteers are recruited introduces bias into the statistics.
The 3 percent headline result is statistically adjusted. The 3,330 tests actually returned only 50 positives (1.5 percent). The adjustments were necessary because the researchers discovered biases in their sample by comparing basic demographic statistics to the county averages.
Bias means certain subgroups are under-represented in the sample. If a certain subgroup (say, men) are over-represented, then the composite subgroup (women) is under-represented so it's really two sides of the same coin.
While the sample size of 3,330 for one county is pretty large, the selection of volunteers was decidedly not random. People were recruited through Facebook ads, and they must drive themselves to the testing centers.
A non-random selection has the chance of introducing unknown biases. In this case, some possible sources of biases are:
- people who have had difficulty getting tested (e.g. health workers) may show up in larger numbers
- some people who strictly adhere to the shelter-in-place order may not show up
- some people may like getting tested for free
- some people may believe Covid-19 is a hoax, or testing is pointless
- some people may find it more convenient to get to one of those test centers
- some people do not have Facebook
- some people do not respond to Facebook ads
- people who already have been tested, especially those who tested positive and recovered
These biases draw more or fewer infected people to the testing sites, and the strength of one versus another also matters. There is one bias that is irrefutable. Anyone who has severe symptoms or are hospitalized will not be represented at all.
It seems hard if impossible to measure and then correct for these biases, which is why statisticians usually recommend random selection. An ounce of prevention is worth a pound of cure. (A proverb that is most appropriate for this pandemic!)
It is sometimes mistakenly believed that larger samples cure sample bias. It doesn't. No matter how big the sample is, they will not find hospitalized patients in their sample. But those are almost surely infected.
***
2. The proportion of antibody-positive reported in the study is so small it may be all noise.
It turns out that even with 3,330 tests, a pretty large sample, the key result that 3 percent have antibodies could all be noise.
What is statistical noise? Imagine you repeat this experiment many times. For each set of 3,330 volunteers, you record the number of positive results. You aren't going to get 3 percent each time. The proportion with antibodies fluctuate with each sample but these variations are meaningless, an artifact caused by sampling.
The following analysis follows the plan used in Chapter 4 of Numbers Rule Your World (link) to analyze steroids testing and terrorist prediction systems. So if you have the book, you can compare the current situation with those stories.
In the raw data, the researchers found 50 positives out of 3,330 tests. (As mentioned in the first section, this 1.5 percent was revised up to 3 percent due to statistical adjustments which I won't describe here. If interested, Andrew has comments on those.)
Every test has errors. Some of the 50 positives are false positives. The false positive rate of a test describes the chance that the test will say someone has antibodies when they don't. The FP rate is one measure - but not the only one - of test accuracy.
To illustrate the issue, we consider the implausible, extreme scenario where none of the 3,330 volunteers have antibodies. So in a perfect test, the number of positives should be zero. If we get 50 positives, all 50 are false positives since no one has antibodies. That implies a false positive rate of 50/3330 = 1.5 percent.
Next, we assume that the researchers are right, that indeed 3 percent of the population has antibodies. (We further assume that the sampling is not biased.) That means the sample would be split 3% and 97%, that is, 100 people with antibodies and 3,230 without.
Let's proceed. We think that the test has a false positive rate of 1.5 percent. Applied to the 3,230 test-takers without antibodies, this antibody test should report 48 (false) positives. Note that in the actual experiment, the researchers counted 50 positives, so almost all of these are false positives!
Here is where things get confusing. Even though the test is rated as very accurate, 1.5 percent false positive rate, 98.5 percent true negative rate (also known as specificity), almost all the positives are false positives! The reason is the rarity of antibodies. The 1.5 percent false positive rate is applied on almost the entire test population because very few has antibodies. A small percentage of a large base is still a large number. (Another tie-in to the pandemic: a low fatality rate applied to a large infected base through fast community spread leads to a flood of deaths.)
***
When developing a new test, the false positive rate is not known because if we knew who has antibodies and who doesn't, we don't need to test people. We test because we don't know. In the above calculation, I used a thought experiment, the extreme case, to guess at the false positive rate.
The Stanford study had a better way of estimating it. They applied the antibody test to some blood samples collected before the Covid-19 pandemic when they're sure these should test negative if the test is accurate. Using that data, Andrew pegged the false positive rate at 0 to 2 percent.
At one end, the false positive rate may be 2 percent (instead of the 1.5 percent I used before). In that case, the test-takers without antibodies generate 65 false positives. This number is higher than the total number of positives (50) found in the experiment. So, test inaccuracy might account for all positive results!
At the other end, if the test was so accurate as to have zero false positives, then by definition, all 50 positives would be true positives, and we could run with their result. Anyone who knows a little about testing recognizes this scenario as implausible. It also implicates a high false negative rate!
Remember if the population were indeed split 3% with antibodies, 97% without as per the study's conclusion, there are 100 people with antibodies in the test sample. The test only returned 50 positives so the other 50 are false negatives.
This is the crux of understanding statistics of testing. You can't focus on just one accuracy metric. When one number goes up, another goes down. Here's the bottom line: if we think there are 100 people with antibodies in the test population of 3,330, then any reasonable test must return at least 100 positive results if it had any hope of catching all those with antibodies. That implies the test must have a positive rate (true or false positives included) of at least 100/3330 = 3 percent. To account for false positives, the test must return even more positive results. If it doesn't, then there must be false negatives.
False negatives are considered less damaging than false positives in the context of antibody testing because these people with antibodies would be taking unnecessary caution. It's not completely harmless since they might suffer from lost income. False positives are feared as they might have a false sense of security and take unknowing risks, and get infected.
Let me now relate this back to sample design. The above analysis shows that the observed test result is consistent with 0 percent of the population having antibodies. This is different from saying we believe the true proportion is 0 percent. It could be 3 percent as the researchers asserted. But the evidence is not strong enough to rule out 0 percent. In other words, even though 3,330 is a lot for one county, the sample is still not large enough to measure such a small signal.
***
I hope you're excited to learn more about testing.
See my previous post, or read Chapter 4 of Numbers Rule Your World (link).
"False negatives are considered less damaging than false positives in the context of antibody testing because these people with antibodies would be taking unnecessary caution. It's not completely harmless since they might suffer from lost income. False positives are feared as they might have a false sense of security and take unknowing risks, and get infected."
What do you think about the policy to not inform the tested people about their test result? Would it be feasible? useful? ethically acceptable?
Posted by: Antonio Rinaldi | 04/22/2020 at 11:03 AM
regarding the bias in the sample, all you wrote are certainly possible. I believe he said (in an interview I heard from him) that they corrected the sample by geo location; they didn't interview participants to break down all of the demographic groups. To me, that would cover a good pct of the concerns but not all. And, of course, FB doesn't cover everyone--no sampling methodology does--but it has greater reach than any other digital communications method. That stated, I'm not sure the biases mentioned, while possible or even likely, would impact the results. In other words, I'm not sure what effect bias in interest in getting a free test would have on positive rates (do people who want a free test have a greater risk of being infected? It's not clear to me they would, but maybe I'm just not thinking of them).
He also stated in the interview I heard from him that we are far from herd immunity. That's interesting but not a part of this study and from what he said, not a part of what is of interest to him.
Last, we don't know the false positive or false negative rates of this kind of test. I would presume that a test of this sort would strive to bias in favor of false positives (so you can rule out those FP later...FN are more dangerous). But given the relative sizes of the population, because the number of potential FNs is so much larger, I'd be more concerned about undercounting infections than overcounting (the tail of the distribution of uninfected is much broader). But that's conjecture because we don't know the FP or FN rates.
I wish we did far more of this. In spite of the limitations of this study, it provides valuable data. If we did 100 of these, just like political polling aggregates, we would get a much better sense of the true extent of infections, and by extension, the true mortality rate of this virus in the US.
Posted by: Dean Abbott | 04/22/2020 at 12:01 PM
AR: If antibody testing is about getting people back to work, then you have to inform people of the results. I have to understand the context a bit better for why one should withhold the results.
DA: I didn't specify the directions of the biases because as you pointed out, it's hard to hazard a guess. I'm sure you know, it's dangerous to assume there is no bias, or that the various biases balance out, just because we don't know the direction/magnitude of them. If we don't know, how can we know that correcting the demographics will correct the biases?
Posted by: Kaiser | 04/22/2020 at 02:34 PM