Two weeks ago, the FDA advisory committee narrowly voted to recommend authorizing the sale of Merck's Covid-19 pill (Molnupiravir or MOV). This is a rich story worthy of a number of posts. We start with the final result and work our way backwards.

***

Molnupiravir is said to be effective at reducing the risk of hospitalization or death of Covid-19-infected people at high risk of developing severe disease. The treatment program spans five days, two 800-mg pills per day. The study was set up as a randomized clinical trial with ~1,550 participants, randomly divided into two halves, one taking MOV pills while the other ingesting placebo pills. The primary endpoint was the difference in event rate between the two arms. An event is either hospitalization or death. Each participant has been followed up for at least 28 days (or until an event occurred).

At the end of the study, the event rate was 10% on placebo and 7% on MOV. The difference in event rate is -3%, meaning that for every 100 people taking the pills, 3 fewer people will be hospitalized or dead, on average. For each person to benefit, about 33 must take the pills. (This number, known as NNT or numbers needed to treat, looks bad because 90% will not develop severe disease whether or not they take MOV.) Note also that for every 100, 7 will be hospitalized or dead despite taking MOV.

The above result has been reported as 30% efficacy, which is the relative event rate ratio (1-7%/10%). I prefer using differences as they don't exaggerate the impact off a small baseline. This primary endpoint is not statistically significant (but it's being portrayed as significant; I will get to this point later).

Molnupiravir may have very serious side effects. None of these effects have surfaced in the trial but the trial design contained so few participants that it can only find very common effects. Only about 700 people got treated with Merck's pills. An event with risk of 1 in 1,000 has almost 50% chance of not registering a single case in 700 people. An event with risk of 1 in 10,000 has over 90% chance of not showing up in this trial.

The FDA Briefing document that was reviewed at the FDA meeting on November 30, 2021 (link) describes several potential adverse side effects, all found in earlier animal and lab studies. These include bone marrow toxicity, mutagenecity, and reproductive toxicity.

***

How this type of treatment can be put to clinical practice should have been thought through before running this experiment, but evidently, it wasn't.

Rolling out the treatment requires an accurate prediction of who are at high risk of developing severe disease. Just like we don't know what level of antibodies is high enough to prevent infection, we don't know what factors predict disease severity. During the trial, Merck went with the usual list of "risk factors" for hospitalization or death, such as older age and obesity. This just means most patients who are hospitalized with or dead from Covid-19 are older and/or obese, but that's not the same as saying most older or obese patients will develop severe Covid-19.

What we do know is that the patients present in the Merck trial definitely had unusually high risk of being hospitalized or dying. The 10% baseline event rate meant that 1 out of every 10 Covid-19 patients enrolled in the trial ended up hospitalized or dead. This rate is over 100 times higher than the unvaccinated rate of hospitalization in the U.S., according to CDC numbers cited in the FDA briefing doc.

In real life, how does one find a comparable set of patients who would have the highest chance of benefiting from these pills?

**

The FDA document also discloses a number of other headaches - all of which are direct results of the design decision to exclude certain groups from the clinical trial, namely, vaccinated people, pregnant people and immunocompromised people.

The design decision is odd because vaccinated people constitute the vast majority of Americans, and the other two groups are apparently at higher risk of developing severe disease. Instead of oversampling them in the clinical trial in order to measure its effects, they were excluded completely from the trial. Strictly speaking, there is no evidence that these pills work for anyone who has had at least one dose of Covid-19 vaccine. This meant that the FDA advisors are asked to take part in a game of guessing without any actual data to support their opinions.

In the trial, those who were seropositive at baseline did not benefit at all from taking the Merck's pills (5 events vs 2 in placebo - nice example of why I dislike relative ratios), indicating that MOV might not work for vaccinated people who have antibodies. The FDA analysts cast doubt on this result, arguing the sample size was too small - and yet, they have no trouble saying that the pills appear to work equally well across age groups, and other variables - when these analyses contain similarly small samples. (e.g. 136 who were seropositive vs 118 who were > 60 years old)

Contrast statement (a): "Given the small size of the seropositive at baseline subgroup and the small number of events in this subgroup, robust conclusions regarding MOV efficacy in seropositive participants cannot be made"

with statement (b): "Efficacy results were generally consistent across subgroups including age (>60 years), at risk medical conditions (e.g., obesity, diabetes), baseline COVID-19 severity (mild, moderate) and SARS CoV-2 variants"

This is the kind of flexible interpretation of statistical significance that frustrates your statistics professor. You either believe none of the subgroup analyses should be considered because the trial was not designed to have enough samples to read them, or you believe they should all be trusted.

***

Another place in which statistical significance was mishandled was hinted at earlier. The main result is a 3% reduction in event rate, which has a 95% confidence interval of 0.1% to 5.9%. Notice that the CI apparently missed 0% by a pip - if the CI crosses 0%, the result is statistically insignificant, leading to the conclusion that the trial did not generate sufficient evidence to prove that MOV was effective at changing the event rate.

But is it barely significant or not significant? Here are two additional issues for your consideration:

What if there were one fewer hospitalization counted in the placebo group (67 instead of 68 events)? This tilts the confidence interval to the other side of zero, making the result statistically insignificant. Talk about hanging on a thread.

Now, in Table 1 (footnote) of the FDA Briefing Doc, we learn that "participants with unknown status at Day 29 are counted as having an outcome of all-cause hospitalization or death in efficacy analysis". Said in English, this means that anyone for whom they didn't know the status by Day 29, they assumed to have been hospitalized or have died. I highlighted this line in red because this seems out of the ordinary. Usually, we assume the opposite (presumed alive vs presumed "dead").

How many people fell into the bucket of unknown outcome, turned into hospitalized or dead? Exactly one, on the placebo arm. Thus, if they had made the opposite assumption, that the unknown person was not hospitalized or dead, then the statistical significance flips!

***

The above discussion of statistical significance assumes a normal analysis schedule in which analysts perform a full analysis when the entire study population has passed the full follow-up period of 28 days. That was not what happened.

In many clinical trials, researchers are allowed to peek at early results, known as interim analyses. In this Merck study, they performed an interim analysis at the half-way point, when half the participants have completed 28 days of follow-up. (This explains the two press releases from Merck about the same trial.)

A problem with peeking at interim results is that it increases the chance of false-positive readings (i.e. wrongly declaring significance). To compensate for this risk, the required level of significance must be reduced (more stringent). We encountered this issue before when looking at Moderna's vaccine clinical trial analysis plan. (Link)

Because of early peeking, the required significance level is not the standard 5% but lower than 5%. A different way of saying this is we shouldn't look at the 95% confidence interval but something higher than 95%, which means the width of the CI is larger than the reported 95% confidence interval. In other words, if the proper threshold was applied, the reported 3% reduction is not statistically significant.

They managed to come up with a justification to overturn the usual statistical practice: "Formal statistical testing was not performed for the full population assessment, because statistical significance was demonstrated at the interim population assessment. The nominal 1-sided p-value was 0.0218."

This is yet another statement that drives your statistics professor mad. Firstly, this statement literally embodies the statistical fallacy of "testing to significance", which I explained here. Secondly, even though they didn't do formal testing, they issued an "informal" testing result, showing a p-value of 0.02, suggesting that the testing result would be significant (under the conventional of p-value < 0.05 unadjusted for early peeking). Thirdly, they showed a one-sided p-value, which makes no sense in this situation; a one-sided analysis can only be justified if there is little doubt as to the direction of the outcome, and is almost never used in clinical trials. The "nominal" two-sided p-value is over 0.04, and likely over 0.05 when adjusted for early peeking.

The situation is worse when you compare what happened in the first and second halves of the Merck study:

There are rather disturbing signals on this table. The interim analysis showed an improvement of 7% over a placebo event rate of 14%. The final analysis showed the improvement to be 3% over a placebo rate of 10%. This meant that the Merck pills performed very badly in the second half of the participants: 6% event rate vs placebo event rate of 5%.

Thus, in this particular experiment, the first half of the result is contradicted by the second half. The FDA analysts endorsed Merck's argument to ignore the second half because "statistical significance was demonstrated at the interim population assessment".

If this were my experiment, I'd investigate whether there were operational issues (or enrollment biases) that caused the placebo event rate to go from 14% in the first half to 5% in the second half, a three-fold change. (By contrast, the event rate of the treatment arm only changed slightly.)

The phenomenon can just be regression to the mean, though. That's why classical statisticians don't like early-stopping strategies. If this trial was stopped for efficacy, and the second half was not analyzable, we would have over-stated the effect by double.

***

I've exceeded the usual post length again so further comments will appear in next posts.

## Recent Comments