When reviewing the clinical trial design for the Moderna vaccine, I mentioned the statisical fallacy known as testing to significance.
Every pharma developing a vaccine or drug must prove a statistically significant (and clinically relevant) outcome. The difference between the treatment and placebo arms must be large enough to overcome background noise coming from sampling, measurement errors, etc. Any statistical result is attached a margin of error, typically 5 percent. For a clinical trial, there is a five percent chance that a statistical significant difference is detected even if the vaccine may be useless.
Even when the vaccine is useless, five percent of the time, the clinical trial yields a benefit large enough to be "significant". Because of this, if we keep analyzing the difference between treatment and placebo day after day after day, eventually, there will be a day when the difference meets the significance threshold - even if the treatment is useless.
If we keep checking the differences beyond that day, we will discover that the significance does not last. That's not the problem. The fallacy arises if the trial is declared a success at first sight of significance.
***
I now use a simulation to demonstrate this statistical fallacy.
For the simulation, I generate 10,000 alternate worlds. In each world, I track 5000 people who get the vaccine and 5000 other people who receive the placebo. I then count the cumulative infections over time in each arm. I assume the vaccine is useless, so that infections grow at a normal rate within the community, regardless of whether the participant got the vaccine or the placebo.
The following line chart shows a single "sample path", the growth curves of infections in one of those 10,000 scenarios. (The blue line represents the vaccine group, the green line, placebo. Fewer cases are better.)
The rates of infection are just these counts divided by 5000, so the curves of rates look exactly the same.
As per the trial protocol, I follow both groups for up to 760 days (2 years after the second injection).
In the next chart, I add a third line (in yellow) to display the difference between the placebo and the treatment infections. Positive differences imply fewer infections on the vaccine arm than the placebo (vaccine may be beneficial); negative differences mean the placebo arm has fewer infections (vaccine may be harmful).
The raw differences don't matter as much as their statistical significance. When we say a difference is statistically significant, we are confident that the difference is driven by the vaccine, as opposed to background noise.
I perform significance testing on the infection rates over the course of the trial, at Day 60, 90, 120, ..., up to Day 760. Around Day 90, the gap between the placebo and vaccine arms is large enough to meet the statistical significance threshold.
If I look again on Day 240, the difference is not significant. The gap does not hold - this is in line with the assumption that the vaccine is useless. The fluctuations of the yellow line (differences in cases) are driven by random variability. (Important note: this phenomenon has nothing to do with a vaccine losing effectiveness over time. In my simulation, the vaccine is assumed useless.)
What if I test to significance? This means I end the research after finding the Day-90 difference to be significant, and conclude that the vaccine is beneficial. This is equivalent to erasing the rest of the line the first moment the line hits the significance threshold.
Such a conclusion is spurious as the simulation stipulates that the vaccine is no better than the placebo. Besides, if I continue to track the data, the Day-90 difference does not hold up.
***
Recall that I've created 10,000 alternate worlds.
The next chart shows 20 lines: each line is the difference between placebo and vaccine in one of 20 alternate worlds (selected randomly from the 10,000). I am showing you the background noise. By Day 750, in most scenarios, the observed differences hover around zero but in a few cases, the gaps are around 100 cases in one or the other direction.
The next chart shows what happens if I stop the analysis at first sight of significance. In over 25 percent of these scenarios, a winner is determined before the end of two years even though by assumption, the vaccine and the placebo groups should respond similarly. In fact, in two scenarios (10 percent), the significance threshold is crossed by Day 90.
The basic point is that if I follow each line far enough, eventually I will find a day on which the vaccine shows significant benefit relative to the placebo. If the trial continue indefinitely, those non-significant lines will eventually reach significance!
Here are 50 scenarios in which the vaccine can be declared the winner by Day 120. The thick part of the lines is observed; the thin part of the lines will not see the light of day if I declare mission accomplished by Day 120.
I remind you once more that the simulation assumes the vaccine is useless. Any finding of statistical significance is an error. This includes the lines at the top of the chart in which significance appears to hold through Day 750.
***
Let's move from the simulated fake world to the real world. In any single clinical trial, all we observe is one of these sample paths. We don't have the luxury of comparing the one observed path to the background noise. Over the course of the trial, the path reveals itself; we also cannot predict where it is heading - as demonstrated by the simulation, any path can turn at any moment.
In order to prove the success of a vaccine, it must demonstrate a statistically significant reduction in infections. The crucial question is by what time. The wrong answer is by the time we see a significant difference and before we run out of resources or patience.
This answer is wrong because it essentially guarantees success, bringing along a high likelihood of a spurious result. In the simulation, I force the vaccine to be useless (no signal, all noise). In real life, the vaccine might be slightly useful (weak signal, plus noise). This is even harder because a few of the significant readings may in fact be real.
An antidote to the statistical fallacy is patience. This is in short supply during this pandemic. The political and societal expectations are exactly opposite - stressing "warp speed". The lurking danger is confirmation bias.The pharmas track the trial results over time, and at first sight of significance, they face tremendous pressure to declare mission accomplished.
That's why I feel uncomfortable that the design documentation for the Moderna trial does not specify analysis time points. It gives an "up to" two-year time frame, which everyone knows is not real. They are not saying it would take more than two years to get a scientific result. In fact, they have told the media that we might get an answer within a few months!
***
This situation is not limited to clinical trials. I've encountered this debate throughout my career in marketing analytics. Let me give you some more color on the nature of the debate.
In one of the charts above, I show 50 scenarios in which the vaccine can be declared a winner by Day 120. The following chart displays a different set of 50 scenarios: these paths lead to the placebo reaching significance by Day 120.
If the one observed sample path comes from this chart, then by Day 120, we should stop the trial early, and declare the vaccine a failure. (As before, any such finding is spurious because I set up the simulation to assume no difference between the placebo and vaccine groups.)
No, you don't think the vaccine effort will be aborted. Neither do I.
The decision-makers will pummel the data analysts with endless questions to make sure they did not make a mistake. They can easily come up with pages of reasons why we need more data. Shutting down the study is like closing a losing position of a stock investment; you have to immediately write off the billions invested. It feels better to keep the position in the hope that the stock price or the response curves would reverse course.
Now, the exact reverse happens if the initial analysis shows the vaccine to be winning. The data analysts urge patience, wanting to more data before drawing a conclusion. The decision-makers demand swift action, arguing they can't afford to wait.
In one case, the statisticans are criticized for being too "academic" and cautious; in the mirror scenario, they are blamed for not being "certain" and using imperfect data.
Is there a way out of this? One solution is called "pre-registration". All parties have these discussions before the clinical trial even begins. Lay out the posisble outcomes. Gain consensus on what to do under each outcome. Once the data start appearing, it is virtually impossible for any side to remain neutral.
P.S. [8/4/2020] Based on some side conversations I've been having with a few readers, let me clarify a couple of things. I'm not suggesting that Moderna is planning to test to significance. I expect that they have protocol to mitigate this issue. I was expecting to find breadcrumbs in the document submitted to ClinicalTrials.gov but it contains no information about how the data would be analyzed. Experts say the regulators should have received a more detailed protocol but that has not been made public.
Statisticians have developed solutions to deal with this issue. Pre-specifying stages of analysis is important, and the threshold for statistical significance is adjusted to account for each early view of the data.
Edited sentence per convesation with AR in the comments.
one day your pulse will flatline...
Or the Nassim Taleb analogy
of the Turkey fed by the butcher for 1000 days but on day 1001...
Posted by: Michael Droy | 08/03/2020 at 03:09 PM
"there is a five percent chance that the vaccine may be useless even if a statistical significant difference is detected."
seems ambiguous to me; a clearer sentence could be:
"there is a five percent chance that a statistical significant difference is detected even if the vaccine is useless."
Posted by: Antonio Rinaldi | 08/04/2020 at 02:36 AM
AR: Of course you're right about that. The second number is not five percent as stated. It's typically more than five percent if the second number is five percent. I'll reword it to remove this confusion.
Posted by: Kaiser | 08/04/2020 at 12:11 PM