If you are running A/B tests in industry, you're used to this scenario: two days (or indeed, two hours) after the test started running, the test sponsor - someone who suggested testing that variant of the marketing copy because they believe strongly that the new version is of that miracle vaccine they have always dreamt - pointed to a huge spike in performance on the real-time tracker, and urged that the test be stopped early because the new version is obviously better, and we should pocket our gains immediately rather than conducting "academic research" for the sake of "scientific purity".
For those with any statistical training, alarm bells should be ringing loudly because such decision-making is the mother of all false positives. The key to understanding why is to imagine what would have happened if the first few hours of performance was in the opposite, undesirable direction. The test sponsor would have been very quiet. (The sponsor could have knocked on your door and said we should stop the test early and declare the new version an instant dud. I have not gotten that knock on the door once after having run hundreds of tests. It's just human nature.)
Worse, what really happens is that the test sponsor will remain silent until such time as the test version displayed a "significant" positive gap relative to the control version. This fallacy is technically called "running a test to significance". This may happen in two hours, two days, five days, whenever. The only time the test would run to its designed end date is when the gap between the test and the control versions never exceeds an acceptable minimum distance.
***
A few years ago, I wrote a piece for FiveThirtyEight, mischievously suggesting that baseball fans should leave the ballpark once the score gap is larger than a certain number of runs. It appears that most baseball fans - presumably including many business managers - think such behavior is blasphemous. Well, it's also blasphemy when they demand an early stop to an A/B test because the favored version of the marketing copy has a "substantial" lead!
It's also the same behavior when a pharmaceutical executive goes to the FDA and pushes to end a clinical trial early, which is the real topic of today's post.
***
The FDA has made a bargain with the pharmaceutical companies to allow "interim analyses," the point of which is to enable early stopping of clinical trials under clearly defined circumstances. Some safeguards have been imposed on such analyses, which as recent events have proven, are easily gamed.
Safeguard #1 is a limit on the number of interim analyses one can conduct during the course of the trial. Each analysis contributes a false-positive probability, and the more times one "peeks" at the outcomes, the higher the chance of a spurious conclusion.
Safeguard #2 is to raise the required standard of evidence for earlier analyses. This is a simple application of statistics: in an interim analysis, the sample size is much smaller, so the error bar around any outcome is much wider (it's worse than linear), and thus the required gap between test and control must be larger. Instead of a 5% significance level, the required level in interim analyses could well be 0.05%. This is useful but does carry risks. One way the gap clears the higher bar is if something strange happens by the time of the first analysis (i.e. an outlier event that will not repeat). Analogously, a baseball team might score 10 runs in the first inning, clearly enough for me to leave the ballpark but should one expect this team to score 10 runs in the first inning in the next games?
Safeguard #3 is to pre-specify three possible decisions during the interim analysis: a) stop the trial for efficacy b) continue to collect more data and c) stop the trial for futility. This is a sensible requirement.
***
Sadly, recent events have shown that statisticians have made a Faustian bargain. The other side has usurped the rules, rendering these safeguards toothless.
Exhibit #1: the recent approval of "aducanumab", a "treatment" for Alzheimer's disease by Biogen. I have not followed the entire saga, but according to this article in StatNews (link), the FDA has made many concessions throughout the process, including waiving Phase 2 trials, and allowing interim analyses of Phase 3 trials. During the interim analysis, the trials were stopped for futility. In other words, while the investigators hoped that at early read, the drug would prove so spectacularly successful that they could stop the trial and declare victory - they instead discovered that aducanumab performed badly enough that the trial had to be ended prematurely.
If the story ended there, I have no complaints.
Biogen continued to mine the data after the trial was terminated, and subsequently, submitted "more data" that they claimed overturned the interim analysis results. This action violates all three safeguards listed above. Biogen is no longer adhering to a limited number of interim analyses - in fact, it appeared to be running the trial to significance. The significance requirements in Safeguard #2 are tailored to the specific analysis schedule, and so those become irrelevant when the schedule is abandoned. Finally, the FDA has also dropped Safeguard #3 because a fourth decision path has been created: stop the trial for futility tentatively and continue to mine for any exploratory analysis that can contradict the interim analysis.
As before, the harm of this action is most readily seen if we imagine what might happen if the trial had been stopped for efficacy. Will Biogen continue to mine the data looking for evidence that the drug in fact did not work?
We don't even have to speculate about the answer because of other recent actions by the FDA and pharmaceutical companies.
***
Exhibit #2: the coronavirus vaccine trials have adopted a modified version of the interim analysis plan. Because of various other shortcuts, the allowable decisions at the interim analysis were a) granting emergency use authorization on efficacy and continuing the trial to its normal end date b) not granting EUA and continuing to collect more data c) stopping the trial for futility. It's actually not clear to me whether c) was ever considered an option but let's assume it was. The vaccine developers committed to continue to run the trials to at least one or two years even if they sucessfully received EUA.
In essence, the successful vaccine trials were stopped for efficacy - although the trials were not stopped because full approval has not been granted. Did the pharmas continue to mine the data so they can volunteer negative information to invalidate the EUA? The answer is emphatically no!
On the contrary, the pharmas argued that upon granting of the EUA, it has become unethical to keep people in a placebo group (i.e. unvaccinated). Therefore, within a few months, almost everyone in the placebo groups has been given the vaccine. In other words, they can't be mining any more data given that the placebo group has ceased to exist.
This situation is the worst nightmare for any statistician who made the Faustian bargain of allowing early stopping through interim analysis. They were hoping that the other side is making a genuine effort to balance science with pragmatism. They are sorely mistaken. When the test is stopped for efficacy, actions are immediately taken that prevent further data mining. When the test is stopped for futility, the sponsor does not really stop the test, and continues to mine the data.
These actions make a mockery of the interim analysis bargain, which means we are back to square one - clinical trials are being run to significance, which is the mother of all false-positive findings.
***
P.S. This post is particularly relevant in light of the Citizen Petition to the FDA about potentially granting full approval to the Covid-19 vaccines based on interim results.
[6/11/2021: So far, three members of the scientific advisory board has resigned in protest of the FDA decision (link). No one on the board voted to approve the drug but the FDA leaders overruled the board.]
I had a look for data on the aducanumab studies and found this https://alz-journals.onlinelibrary.wiley.com/doi/epdf/10.1002/alz.12213 Judging from the results, they would be lucky if the decrease in the CDR-SB score with treatment is 25% less than with placebo. This isn't a lot, although you might be able to convert it into something like delaying the effect by 3 months every year. Given the high price of the drug this isn't going to satisfy any reasonable cost-benefit analysis. You would also have to compare it to Aricept which works, and being off patent is cheap. I expect their sample size calculations were done assuming a larger clinical difference than they found, but I can't find a protocol to check. Something that I read was that the FDA decided to approve on the basis of the effect on the plaque, which given that there is no data linking this to cognitive decline it is not sensible.
For the vaccines there was always going to be very good real world data. For example in Australia we recently had a super spreader event. It was roughly the 6 vaccinated subjects were not infected, of the 30 something unvaccinated there were 30 infected. We've given up having birthday parties for the moment.
Posted by: Ken | 07/04/2021 at 05:17 AM