Data analyses are flooding the airwaves. They paint a confusing, contradictory picture of all aspects of the pandemic. Sadly, most of the work are hastily put together, and low quality. Here are some things to keep in mind when you look at these studies.
What happened in an experiment will not happen again
If a vaccine trial returns a 90% vaccine efficacy, we can say for sure the vaccine’s efficacy is precisely 90% for the 30,000 or so participants in that specific vaccine trial. We are not empowered to conclude that the VE for billions of people who will eventually get the shots will be 90% - no matter how many epidemiologists repeat this falsehood on TV.
That’s because the trial involved a small sample of people, and there is a margin of error around the 90% number. If the range estimate is 70% to 98%, then we say “with 95% confidence” that the VE for the general population is above 70%. That’s still an impressive number, and it has the advantage of respecting scientific principles.
Imagine repeating the same clinical trial with different choices of 30,000 participants repeatedly. The margin of error says 95% of the series of VE estimates from these trials will fall inside the 70% to 98% range. The chance of any of these trials repeating the exact 90% VE is practically zero.
The randomized clinical trial (RCT) is the gold standard for establishing cause--effect. The lack of randomization in “real-world” studies opens a can of worms.
In a vaccine trial, we compare the case rates of those who are vaccinated to those who aren’t, which are the primary ingredients of the vaccine efficacy formula. We are currently being served weekly new studies that also compare vaccinated people to unvaccinated people. The media are telling us that these new studies are better because (a) they are more recent (b) they have much larger sample sizes and (c) they constitute “real-world” evidence. A popular theme is that these studies fill in the gaps left by the vaccine trials.
The noise you’re hearing are the chuckles from incredulous statisticians. Since RCTs are regarded as the gold standard for causal inference, no “real world” evidence should be modifying, and certainly not correcting, findings from a scientific experiment. Doing so is like hiring a C student to tutor an A student, and siding with the C student when they disagree.
There is one critically important property missing from all real-world studies: the randomization of treatment. (See, however, so-called natural experiments.) In an RCT, a coin flip determines who gets the vaccine but in the real world, who gets the shots is anything but random. Most countries have priority lists and specific types of people are getting the vaccines earlier than others. In any “real world” study, the unvaccinated group is different from the vaccinated group, not just by vaccination status but also by many other factors, both known and unknown. Any of those other factors could contribute – majorly or minorly – to the observed difference between the two groups. Randomizing treatment ensures that on average, these other factors will not bias the finding in an RCT, a condition that does not exist in real-world studies.
A partial solution is to define a better control group. We’d take a subset of the unvaccinated people who look like those who are vaccinated. This is an effort to manufacture artificially the randomization condition. This matching process is highly subjective, and we can only match people using measurable and influential factors. There is no law that dictates that every important factor is measurable. As an illustrative example, perhaps people with higher-speed Internet connections are more likely to land vaccination slots. Since medical researchers do not have individual data on people’s Internet speeds, we cannot control for this effect. Even the most obvious adjustments have problems. For example, young people can be excluded because they don’t qualify for vaccination yet but young people who are front-line workers are getting shots.
Another common issue is incurable imbalance: if all care home residents have been vaccinated, we won't find care home residents in the unvaccinated group. The biggest problem with matching studies is lack of transparency. The typical disclosure is vague and confusing – I can’t even figure out what they did. None of these studies publish their data or code.
So, a real-world study that corrects for selection bias is better than one that doesn’t but in no case is an observational study superior to an RCT. Larger sample sizes produce more precise estimates (with lower margins of error) but they do not auto-correct biases. More precise estimates derived from biased data pose even greater risk precisely because they feel more solid when the biases are ignored.
The randomized clinical trial is a victim of abuse.
In reviewing the torrent of studies that have come out of the vaccine trials, I can’t help but notice that analysts are mercilessly abusing the RCT framework. I’m going to echo Andrew Gelman’s criticism of the standards of research in psychology. We’re concerned about the high probability of false-positive findings.
The VE numbers from Pfizer, Moderna, Astrazeneca, etc. are not comparable to each other. That’s because each team measures case rates differently. Pfizer drops all cases prior to 7 days after the second shot; Moderna and Astrazeneca drop cases prior to 14 days after the second shot. Then, you have people reanalyzing the data, who use their own case-counting windows. The U.K. government published a post-hoc analysis of the Pfizer data, counting only cases between 15 and 28 days after the first dose; this is echoed in a Canadian re-analysis of the same data, in which they set the window of 14 days to 20 days. A more recent “real-world” study out of Scotland counted from 28 to 34 days after the first dose. This is no small matter as VE is an improvement metric relative to a baseline, and we can’t keep our heads straight which the baseline applies.
Many scientists justified the choice of when they count cases, on record no less, by pointing to the cumulative case curve, arguing for counting when the vaccine’s curve starts flattening because that’s exactly how long it takes for the vaccine to do its magic. This type of post-hoc thinking is what teachers warn students not to do. These clinical trials were sized to measure the endpoint, and not sized to identify a turning point on a case curve. Imagine repeating this trial many times over. Assuming that the VE after 90 days rises to roughly 90 percent in each trial, it is likely that the turning point on the curve is not precisely at day 14 (or whenever it was in the first trial).
Analysts appear to have looked at their individual datasets, identified the period in which the vaccine showed the best performance against the placebo, defined their VE metric to hone in on that time window, and justified the definition with some sciency blather. This is a recipe for over-estimating the true performance.
The proof is the absence of new studies that go through similar post-hoc thinking to adjust down the findings of the respective RCT. In fact, the re-analyses of the Pfizer data, which moved the estimated VE of the first dose from around 50% to 90% is a nice example of a C student correcting an A student.
To prevent the danger of post-hoc theorizing, scientists agree to pre-register their studies so they have to define how they would count cases ahead of time. Unfortunately, those protocols are written loosely to allow a wide range of options, such as 7 days, 14 days, 21 days, etc. as possible evaluation metrics. This maybe exposes how little investigators know about the behavior of their inventions.
If only the start of the case-counting window were the only “researchers’ degree of freedom”. The researchers also follow their noses about (a) the duration of the case-counting window (b) whether a positive test is required to confirm cases (c) which test is allowed for confirmation and the parameters for running the test (d) the maximum length of time between reporting symptoms and testing positive (e) what symptoms are included and excluded as qualifying (f) how many symptoms are qualifying. As I showed in this post, VE drops fast with just a few more cases on the vaccine arm so any of these decisions matters.
The RCT framework does not protect us against these potential abuses. RCTs are the gold standard, but they aren’t fool-proof.
Story time after RCT
The potential problems outlined in the previous section relate mostly to trial design decisions. The vaccine RCTs are also being suffocated by post-hoc hacking. I already devoted an entire post to this phenomenon of “story time”. Not all results obtained from analyzing data from RCTs have the crucial randomization property. That’s because analysts can destroy the property by cherry-picking the data. We’re lulled into thinking these lesser findings have the same status as a proper RCT finding.
The most infamous example of such analysis is the 90% VE attributed to the so-called low-dose, standard-dose subgroup in the Astrazeneca-Oxford trial (link). This VE is indeed based on comparing case rates of a vaccine arm to that of a placebo arm. In a true RCT, those two arms are statistically the same except for what was placed in the two shots. Nevertheless, the low-dose subgroup happened as an accident, and it subsequently emerged that this subgroup contained only younger participants, have earlier enrollment dates, have longer intervals between doses, etc. And yet, on the front page of the Lancet paper summarizing this trial, the investigators stated “In participants who received two standard doses, vaccine efficacy was 62.1%... and in participants who received a low dose followed by a standard dose, efficacy was 90.0%...” By this point, there has been no disclosure about the “accident,” and anyone reading this sentence is likely to assume that there were two randomized dosage schedules used in this trial.
Likewise, much ink has been spilled on dose intervals. Many analysts have compared subgroups that took their second shots later and earlier. Even though the underlying data were collected in an RCT, dose intervals were not randomized. And yet, these results were published and publicized using methods for analyzing RCTs and not methods for analyzing observational studies. In some cases, the researchers even disclosed that these subgroups differed on numerous important factors, and still they plodded on.
**
In this era, even results coming out of RCTs must be scrutinized. Early analyses, post-hoc analyses, deep-dive analyses, and side analyses typically discard the crucial randomization property, and should be treated as observational studies.
Of course, results from observational ("real-world") studies should be scrutinized even more. In a future post, I’ll outline how to review analyses of observational data.
Correct me if I am wrong but I think an assumption of random incidence would be wrong in these cases (perhaps a carry over from throw ones hands in the air epidemiological modeling) and stratification is needed. The only question is with which variables. We come across this frequently, with the main problem being the achievement of projectable sample size. Better to be partly right though....
Posted by: Coleen B | 02/25/2021 at 11:44 AM
CB: Yep, the whole enterprise of observational studies is to make adjustments and correct biases. Not adjusting at all is to allow the biases to fester. I think a good way of putting it is "we are confident the answer is partly right if our adjustments capture the key biases". Also, what did the researchers do to check their confirmation bias at the door?
Posted by: Kaiser | 02/25/2021 at 12:32 PM