Good data analysts have developed an instinct of perking up when results are "too good to be true". This is a must-have attribute that I'm always looking for in a job candidate. It is a protective instinct, a response that protects the analysts from getting fooled by the data. Nassim Taleb called this type of thing "fooled by randomness". The temptation is not randomness itself but the lack thereof.
(Credit: Sonja Alves, Flickr)
Observational studies of the Covid-19 vaccines are a case in point. I have been spooked by the consistency of estimates of vaccine effectiveness. Almost all peer-reviewed studies and most working papers that get mainstream attention contain values of VE that are in the 70-90 percent range. (Since these are relative ratios, a 50-percent multiple improvement is already a huge effect.) There is a consistency across headline results of different studies, and there is also a consistency of values inside individual studies, across segments, using different outcome metrics, different methodologies, etc. etc.
Below is Table 2 from the traffic accidents study that I have been reviewing (link):
The aggregate result (derived from the inadmissible non-adjusted comparison discussed in the first post of this series) is shown in the last row and represents a value of 1.7x. The rows above come from applying the same methodology to subsets of the study population, some of them are mutually exclusive while others are overlapping. Most of the values line up close to or near that 1.7x value. There is only one segment (over 65) that has the opposite direction but barely. What does this consistency tell us?
Shown below is the top part of Table 3. Here, the researchers analyzed many variants of the outcome, applying the same analytical strategy. Whereas Figure 2 above shows any involvement in traffic accidents, Table 3 below breaks down these accidents into subtypes, like involvement as driver versus pedestrian, time of day and so on.
The consistency is once again remarkable. Whereas the aggregate number is shown as 1.7x (top row), all the other numbers ranged from 1.5x to 5.1x. Almost all the variance is on the high side of the aggregate. What does such consistency tell us?
In the Appendix, the authors rolled out even more rows of analysis:
Again, all effects are in the range 1.4x to 1.7x. What does this consistency mean? That's what I want to explore in this post.
***
In the mainstream, such consistency is portrayed as resounding evidence. Each individual study that generates another impressive relative ratio similar to the previous result is regarded as another confirmation. The authors of the traffic accidents study believe that because everywhere they looked, they saw the same consistent number, it must mean that they have found something really important - like Newton's Law of Gravity.
Unlike Newton, they aren't studying the realm of physics. The study's data concern human behavior, and social behavior, and it is actually shocking, rather than comforting, to find a law of nature in this arena. My prior - from decades looking at data about humans - is that (a) aggregate effects are typically small, so that even a 10% improvement is remarkable, and (b) there is substantial variability when looking across different segments, different metrics, etc., which is another way to say that "interaction effects" are important.
Were these researchers, mainstream scientists, and the mass media being fooled by (lack of) randomness?
***
Other than discovering some immutable law of nature, what are other possible explanations for the remarkable consistency?
File Drawer Effect
Let's say there is a medical journal editor who only publishes vaccine studies that show high vaccine efficacy - since anything else has little societal value. The result of this editorial practice is that (almost) every study that sees the light of day will definitely have a VE estimate above some lower bound (say, 70%). In this scenario, the consistency of all the peer-reviewed results may have nothing to do with the vaccines themselves, but can be explained by publication bias that is filtering out studies below that bound. This bias is sometimes called the "file drawer effect," describing researchers who stack away "failed" studies (e.g. p-value > 5%) in their file cabinets since journals would reject them.
A consequence of the file drawer effect is that the average of the published results over-estimates the true vaccine efficacy.
In the context of Covid-19 vaccines, the file drawer effect may be related to the phenomonon of cargo cult science.
Cargo Cult Science
Cargo cult science is a term coined by physicist Richard Feynman. Roughly speaking, he warns against the potential of bad science even when scientists follow so-called best practices.
Other than the initial clinical trials conducted in 2020 (which were scientifically aborted upon emergency use authorization when the placebo groups were destroyed), all evidence of Covid-19 vaccine efficacy have come from observational datasets. All observational data sets either lack controls, or if controls existed, treatment status was not randomly assigned, which means that unvaccinated people differ from vaccinated people in multiple dimensions besides vaccination status. Therefore, it is wrong to attribute all observed differences between the two groups to vaccination only.
Let's teleport back to the first part of 2021, weeks after the mass vaccination campaign commenced, when the first observational study appeared in print. For the sake of argument, let's say the first analysis showed a VE of 50%. How do we know whether this estimate is to be believed since the people who got vaccinated were clearly different from those who didn't? The honest answer should be no one knows because there is no ground truth in an observational study.
What circumstantial evidence do we have? The only breadcrumbs are interim analysis results from randomized clinical trials that produced VE of 90%+. We face a gnarly problem because our hypothetical observational study led to a VE estimate of 50%. One of the numbers is not like the other; they cannot both be correct. But which study is wrong?
Not surprisingly, most would give more credence to the RCT results, since RCTs are considered the gold standard of medical research. Therefore, since our observational study came in much lower, we'd go back to the drawing board, and massage the dataset some more... until the estimated VE is closer to 90%, making it more credible. The additional data massaging is justified by best practices - we are simply addressing residual confounding bias.
(It is important to realize that randomized experiments can be analyzed in a straightforward, standardized manner but observational studies rely on statistical adjustments that involve subjective judgment from researchers. Analysts try to guess what biases exist in the data, and device ways of correcting for them. It's an iterative process with limited guardrails.)
To make sure you get what's going on, think about what happens if the first analysis of an observational dataset leads to a VE value of 90%. This result would be immediately accepted. Nevertheless, it's not clear that this research team did a better job adjusting for bias than the previous team who came up with 50%.
Since the RCT result was the only piece of available evidence when the first observational study showed up, it became a filter for all subsequent observational studies. Any observational study that produces a number close to 90% feels comfortable, and any study that shows a number far from 90% feels "wrong". Such an editorial process directly triggers the file drawer effect.
***
I don't know how we landed on this stupid state of affairs. Using an observational study to "confirm" an RCT result - the gold standard - is like asking a C student to grade an A student. After the teacher has already given the A student a top grade, what additional information is gleaned by having a C student grade the A student's exam?
What's happening is even more dubious. The C student is not grading the same exam that the teacher saw - the C student is asked to grade a different exam submitted by the A student. By this, I mean: the observational studies are different from the RCTs in meaningful ways, e.g. timing of vaccinations, incidence of Covid-19, types of people eligible for vaccination, residence of patients, no placebo shots... There is scant justification for assuming that the observational studies should produce VE values in the same ballpark as the RCTs!
When another observational study comes out proclaiming 70-90% VE, it doesn't mean much. It just tells me that the research team has selected a set of adjustments so that the VE estimate lands in the zone that is considered acceptable for publication, i.e. it falls in the vicinity of the RCT result.
Lurking Variables
Another way in which we can get fooled by the consistency of results is via unknown or unexplored sources of bias. All observational studies can only control for factors for which there are measurements. I haven't encountered yet an observational study of Covid-19 vaccines in which the researchers collected their own data. The current practice is to pull data from pre-existing, government databases. So the variables available are typically limited to basic demographics (age, gender, etc.) or health-related data (prior health concerns, healthcare usage, etc.). Important factors such as attitudes toward freedom and government authority, risk tolerance and exposure to the virus, are not measured, and not adjusted for.
If key sources of bias are not adjusted for, they could explain why the results are so consistent among the subgroups; each such result contains residual confounding from the missing variables.
The missing variable may be a common cause of both vaccination status and infection outcome, for example, people who are generally more health conscious and healthier pre-Covid. They are both more likely to get vaccinated and less likely to get infected (regardless of vaccination status). Since the government databases have no labels for this type of people, there is no way to correct this bias. It is delusional to think that dumping age and gender variables into a regression equation properly adjusted for such a bias. There are health-conscious people in every agexgender segment! In each such demographic segment, vaccination status is still confounded with health-seeking status.
Multiple Comparisons
When so many analyses are performed using so many metrics, the best practice is to adjust for multiple comparisons. The general idea is that if one screens a lot of metrics, a few of them will accidentally show statistical significance, and those results are highly likely to be spurious. In other words, without adjustment, traditional statistical methods underestimate the degree of variability in the outcomes. That is to say, results appear more consistent than they are. Yet, most of these vaccine studies sidestep this issue.
I'd be shocked if elementary medical statistics courses don't cover the problem of multiple comparisons. Practitioners in this discipline seem to think that fishing for significance is fine so long as they pre-specify all their intended analyses. Actually, pre-registration does not solve the problem of multiple comparisons at all, unless the pre-registered statistical analysis explicitly outlines a strategy for dealing with it. (Unfortunately, the randomized trials also pre-registered multiple comparisons without correction.)
Poorly Defined Metrics
Another reason for consistent results across many analyses is a poorly chosen scale for one's outcome metric. An analogy to grade inflation in schools is telling. American schools has a grade-point average (GPA) typically defined on a scale of 0-4, 4 being grade A. Due to the grade inflation trend, most of the grades given out are As so in reality, the majority of the scale 0-3.5 is basically empty, with almost all students squeezed into the 3.5-4.0 range. Of course, all students look more or less the same.
I think something like this is at play with the relative ratio scales used to measure vaccines. I think an ineffective vaccine will show a VE substantively higher than 0% so the grading scale is compressed just like in schools.
***
In this post, I have offered five possible reasons why every research study of Covid-19 vaccines seems to offer similar effectiveness estimates, and why there appears to be little or no variability when aspects of the methodology are altered. As a statistician who have analyzed a lot of human data, I find it least likely that the reason for the consistency is that these vaccines are like a law of nature, such an overwhelming force that flattens all human variability.
Recent Comments