Last week, I reviewed the explosive expose from Buzzfeed News on the Stanford antibody study. The first preprint from the research team was met with skepticism from numerous statisticians. Andrew as usual has a nice roundup of the key technical objections. Then, the researchers published a revised preprint that addressed some of the complaints.
Reading these preprints is a sobering experience. I talked to a friend who's in the business of developing tests, and apparently, FDA sets a low bar for tests like this, as it has little possibility of harm. In other words, they will look past most of the issues I cover in this post.
***
File Drawer Effect
The research team faced a difficult problem: outsider reviewers questioned their assumption of specificity central to their calculations. In particular, they know that the test's specificity (the ability to correctly identify negatives) has to be over 99 percent to defuse the statistical challenge.
To claim an extremely high level of specificity (i.e. almost zero false positives), the researchers applied antibody testing to a sample of 30 blood specimens collected by a third party before the novel coronavirus was discovered, and reported that all 30 test results were negative. That was a woefully small sample to make such a strong statement, as I pointed out here. The Buzzfeed expose further revealed that a Stanford-affiliated lab conducted those 30 tests in an effort to validate the accuracy of the antibody test kit, which hasn't yet been approved by the FDA. The lab backed out of the study after seeing the validation results.
A back-of-the-envelope calculation indicates that at least 10 times the sample size is required to establish beyond reasonable doubt that the specificity is above 99 percent. Of course, the more negative blood specimens, the better. Procuring such samples is a bottleneck in the test development process. It's hard and expensive to acquire them. Labs usually purchase samples from other organizations, or they may contract other labs to conduct the validation testing.
In the revised preprint, the researchers now say they obtained 3,000 more negative blood specimens within just a few weeks (as I pointed out in the last post, it's never clearly stated whether they acquired blood samples, or just data). This effectively settles the sample-size objection. But the sudden emergence of thousands of validation samples raises an altogether different - and much thornier - question.
As one can imagine, the researchers spent the weeks in between preprints hunting for negative specimens. They knew exactly what they were looking for. They needed at least hundreds of specimens. The required sample size grew further if some of negative blood samples ended up with false positive results.
Statisticians hate getting caught up in this type of goal-seek dilemma. The situation can easily turn into a sequential optimization game: after each new acquisition, the researchers tally up the total number of specimens, and what proportion of the results are false positives. At each step, they decide whether to hunt for the next sample, or stop. (The third possibility is to reject the latest sample, and keep hunting.) In practice, one probably keeps hunting until a minimum sample size is reached, and the observed false-positive rate falls below 1 percent, or whatever the target level is.
Because the researchers can choose to stop at any step, and they can continue hunting till they achieve what they want, any result from this sequential process is over-optimistic. This is the exact same problem that plagues many A/B testing programs in the tech industry - the issue is known as "running tests to significance". Since these tests are never stopped when the results were insignificant (unless all patience has run out), but they are early-stepped at moments when the results showed significance, any conclusions are biased toward the positive.
If you're a sports fan, would you be delighted if I grant you the power to end a game whenever you like. What would you do? You'd never call an end if your team is behind but as soon as your team goes ahead, you call a stop. In that world, the winning percentage of your team is not an accurate reflection of your team's ability!
This statistical problem has various names, such as confirmation bias, publication bias, and file drawer effect. Andrew likes to call this "the garden of forking paths." Essentially, the researchers have too much control over the outcome.
The Stanford antibody study showed how such bias could infect a study quietly. We know that a second Stanford-affiliated lab performed a validation study of the test kit. We know that they compared the positives to a more established antibody test, and found roughly half the positives to be false positives. We know that this validation sample was not included even in the second preprint. (We learned of its existence thanks to the whistleblower report and Buzzfeed.) Its exclusion may have been a decision by the research team, or one made by the outside lab, as it also severed its tie with the antibody study. Regardless of who made the decision, the effect of its exclusion is publication bias. If unhelpful data are systematically vanished, the conclusion will be over-optimistic.
Selection bias
A quick look at the demographics table reveals that their sample of 3,000 residents was in no way representative of Santa Clara's population. It consisted primarily of white women. It had only 5 percent 65 and older. It had a third as many Hispanics as in the general population.
The only acknowledged recruitment channel was Facebook ads, which deploys an opaque targeting algorithm. My own experience with Facebook targeting gives me little confidence that it is "random" even when specified.
As a result of the obvious selection bias, the researchers re-weighted the data. This is a common procedure I explained previously in the context of exit polls (link). After this statistical adjustment, the reweighted sample had roughly the right gender, and race distributions. For example, in the raw data, whites had a weight of 64% and hispanics 8%; in the adjusted data, the weight of whites dropped to 35% while that of hispanics tripled to 25%. The true distribution of whites in the county was 33% and hispanics 26%.
The validity of this procedure rests on whether the 8% hispanics are a good representation of all hispanics, and so on for every subgroup they reweighted.
Lurking: age
Nowhere in the two preprints did the research team address the "age" problem. They included zip codes in their reweighting but not age. As a result, the age distribution even after adjustment deviated markedly from the county's true age distribution. Thirteen percent of the county's residents are 65 and over, a demographic group that is particularly badly hit by Covid-19. Yet, less than 5 percent of the adjusted sample are people 65 and over.
The research team said they did not use any other variables because doing so would create too many tiny subgroups, which is indeed not a good thing. I'd still like to hear why they thought the age adjustment is less critical than zip code adjustment.
Pooling of validation samples
It seemed that only 30 of the 3000 negative samples were tested at Stanford labs. All the other entries in the table of samples appear to be (a) manufacturer supplied data (b) data obtained by third parties or (c) test results performed by third parties on contract. I'm just guessing at part (c) as there was no indication they outsourced validation testing. It is possible that the extra thousands of negative specimens are all just data, without any new testing.
It's shocking to me that FDA may accept such data. It's also shocking that they allow the wanton pooling of arbitrary samples of data.
Here's how the specificity of 99.5% was ascertained. The researchers found from all corners (of the world?) 13 different samples of blood specimens ranging from 29 to 1,102 in size. Some of these are from the staff of the test kit manufacturer; some are "COVID-19-era PCR-negative specimens, for which some tested positive for another respiratory virus"; some are "pregnant women pre COVID-19"; etc. Basically, each sample has different characteristics, was collected for a different reason, and has different numbers of specimens. They are "pooled together" as if they are the same.
Why is this procedure suspect? Let's simplify and pool together two samples.
In this first scenario, you have two samples of 100 men randomly selected from the county. These two samples are statistically identical since they are random. By pooling them, you have total size of 200. They are still men from the same county. The two sample averages should be roughly similar as before pooling. The sampling error is reduced because of higher sample size. All good.
In the second scenario, you have one sample of 180 men selected from county A, and one sample of 20 women selected from county B. Using the same pooling strategy, you have total size of 200. After pooling, you ignore the gender or county of the people.
But it can't be true that the two scenarios can be analyzed in the same way, can it?
In the second scenario, the average has the implicit weighting of 9 men to 1 women, and 9 As to 1 B. Is your metric invariant with gender or residence? Probably not. So the average of the two samples is biased towards the average of men in county A. (Re-weighting is an attempt to correct such biases in the experimental sample but it wasn't applied to the validation samples.)
Meanwhile, the variability is underestimated. Variability is both between individuals, and between groups (gender and county). While the former is indeed reduced on pooling, variability due to gender and residence differences counteracts that. Underestimating variability means the margins of error are too narrow.
***
As mentioned at the outset, the FDA tends to look past these issues because they believe the potential harm of test kits is low. But these are practical examples of real-world statistical challenges, which I hope you find worth your time to know about.
Comments