01/19/2022 in Covid-19, Current Affairs, Data, Errors, Ethics, Health, Mass media, Medicine, Science | Permalink | Comments (0)
If you follow sports, you could not avoid the Novak Djokovic saga at the Australian Open, which is scheduled to start this week. In brief, Australia, having pursued a zero Covid policy for most of the pandemic, only allows vaccinated visitors to enter. Djokovic, who's the world #1 male tennis player, is also a prominent anti-vaxxer. Much earlier in the pandemic, he infamously organized a tennis tournament, which had to be aborted when several players, including himself, caught Covid-19 (link). He is still unvaccinated, and yet he was allowed into Australia to play the Open. People are upset. Some players who got themselves vaccinated in order to play in the tournament are not cool. Spectators who also must be vaccinated in order to watch the matches in person are not amused.
When the public learned that Djokovic received a special exemption, the Australian government decided to cancel his visa. Djokovic's camp, however, proceeded to fight his case in court. This then became messier and messier, as the superstar told his side of the story. His parents, his fans, and the Serbian government aggressively supported the player. [Djokovic lost in court for the second time this Sunday, and was deported and no longer could play in the tournament.]
In the midst of it all, some enterprising data journalists uncovered tantalizing clues that demonstrate that Djokovic's story used to obtain the exemption is full of holes. It's a great example of the sleuthing work that data analysts undertake to understand the data.
***
A central plank of the tennis player's story is that he tested positive for Covid-19 on December 16. This test result provided grounds for an exemption from vaccination, although the Australian government tightened entry requirements due to the Omicron surge. The timing of the test result was convenient, raising the question of whether it was faked. Intriguingly, Djokovic attended a children's event the day after he said he tested positive, and also gave an in-person interview to a French reporter two days after. His team maintained that the test was authentic, and offered evolving explanations and apologies for his not isolating after testing positive.
Digital breadcrumbs caught up with Djokovic. As everyone should know by now, every email receipt, every online transaction, every time you use a mobile app, you are leaving a long trail for investigators. It turns out that test results from Serbia include a QR code. QR code is nothing but a fancy bar code. It's not an encrypted message that can only be opened by authorized people. Since Djokovic's lawyers submitted the test result in court documents, data journalists from the German newspaper Spiegel, partnering with a consultancy Zerforschung, scanned the QR code, and landed on the Serbian government's webpage that informs citizens of their test results.
The information displayed on screen was limited and not very informative. It just showed the test result was positive (or negative), and a confirmation code. What caught the journalists' eyes was that during the investigation, they scanned the QR code multiple times, and saw Djokovic's test result flip-flop. At 1 pm, on December 10, the test was shown as negative (!) but about an hour later, it appeared as positive. That's the first red flag.
Since statistical sleuthing inevitably involves guesswork, we typically want multiple red flags before we sound the alarm.
The next item of interest is the confirmation code which consists of two numbers separated by a dash. The investigators were able to show that the first number is a serial number. This is an index number used by databases to keep track of the millions of test results. In many systems, this is just a running count. If it is a running count, data sleuths can learn some things from it. (This is why even so-called metadata can reveal more than you think. Djokovic may have become the latest victim.)
Djokovic's supposedly positive test result on December 16 has serial number 7371999. If someone else's test has a smaller number, we can surmise that the person took the test prior to Dec 16, 1 pm. Similarly, if someone took a test after Dec 16, 1 pm, it should have an serial number larger than 7371999. There's more. The gap between two serial numbers provides information about the duration between the two tests. Further, this type of index is hard to manipulate. If you want to fake a test in the past, there is no index number available for insertion if the count increments by one for each new test! (One can of course insert a fake test right now before the next real test result arrives.)
The researchers compared the gaps in these serial numbers and the official tally of tests conducted within a time window, and felt satisifed that the first part of the confirmation code is an index that effectively counts the number of tests conducted in Serbia. Why is this important?
It turns out that Djokovic's lawyers submitted another test result to prove that he has recovered. The negative test result was supposedly conducted on December 22. What's odd is that this test result has a smaller serial number than the initial positive test result, suggesting that the first (positive) test may have come after the second (negative) test. That's red flag #2!
To get to this point, the detectives performed some delicious work. The landing page from the QR code does not actually include a time stamp, which would be a huge blocker to any of the investigation. But... digital breadcrumbs.
While human beings don't need index numbers, machines almost always do. The URL of the landing page actually contains a disguised date. For the December 22 test result, the date was shown as 1640187792. Engineers will immediately recognize this as a "Unix date". A simple decoder returns a human-readable date: December 22, 16:43:12 CET 2021. So this second test was indeed performed on the day the lawyers had presented to the court.
Dates are also a type of index, which can only increment. Surprisingly, the Unix date on the earlier positive test translates to December 26, 13:21:20 CET 2021. If our interpretation of the date values is correct, then the positive test appeared 4 days after the negative test in the system. That's red flag #3.
To build confidence that they interpreted dates correctly, the investigators examined the two possible intervals: December 16 and 22 (Djokovic's lawyers), and December 22 and 26 (apparent online data). Remember the jump in serial numbers in each period should correspond to the number of tests performed during that period. It turned out that the Dec 22-26 time frame fits the data better than Dec 16-22!
***
The stuff of this project is fun - if you're into data analysis. The analysts offer quite strong evidence that there may be something smelly about the test results, and they have a working theory about how the tests were faked.
That said, statistics do not nail fraudsters. We can show plausibility or even high probability but we cannot use statistics alone to rule out any outliers. Typically, statistical evidence needs physical evidence. That's one of the key takeaways in Chapter 5 of Numbers Rule Your World (link).
***
Some of the reaction to the Spiegel article demonstrates what happens with suggestive data that nonetheless are not infallible.
Some aspects of the story were immediately confirmed by Serbians who have taken Covid-19 tests. The first part of the confirmation number appears to change with each test, and the more recent serial number is larger than the older ones. The second part of the confirmation number, we learned, is a kind of person ID, as it does not vary between successive test results.
One part of the story did not hold up. The date found on the landing page URL does not seem to be the date of the test, but the date on which someone requests a PDF download of the result. This behavior can easily be verified by anyone who has test results in the system.
Because of this one misinterpretation, the data journalists seemed to have lost a portion of readers, who now consider the entire data investigation debunked. Unfortunately, this reaction is typical. It's even natural in some circles. It's related to the use of "counterexamples" to invalidate hypotheses. Since someone found the one thing that isn't consistent with the hypothesis, the entire argument is thought to have collapsed.
However, this type of reasoning should be avoided in statistics, which is not like pure mathematics. One counterexample does not spell doom to a statistical argument. A counterexample may well be an outlier. The preponderance of evidence may still point in the same direction. Remember there were multiple red flags. Misinterpreting the dates does not invalidate the other red flags. In fact, the new interpretation of the dates cannot explain the jumbled serial numbers, which do not vary by the requested PDFs.
***
Statistical investigations can be very powerful, and have gained strength in the Big Data era due to digital breadcrumbs. Nevertheless, statistical arguments suggest plausibility or probability, never certainty. Short of a confession or whistle-blowing or leaking, those who are inclined to disbelieve can always find reasons to disbelieve. Similarly, interested investigators can easily fool themselves.
01/17/2022 in Assumptions, Behavior, Big Data, Chapter 5, Covid-19, Crime, Current Affairs, Data, Health, Science, Sports, Statisticians, Web/Tech | Permalink | Comments (2)
The Covid19 pandemic has transformed human communications in the U.S.
Most business meetings are (still) being conducted via Zoom or a variety of similar services, in lieu of in-person meetings. Scientists have relied on preprints instead of peer-reviewed publications; increasingly, preprints are even abandoned in favor of press releases. Merck's Covid19 pill, known as molnupiravir (MOV), represents a case in point. (The Pfizer pill is another example.)
Previously, I have mentioned how press releases have been used to seed public opinion prior to any preprints or peer-reviewed publications being available (link). Merck just took this to the next level. When I first researched MOV, I looked for a preprint or a journal article, and I couldn't find any. This is weeks after the FDA advisory board recommended to authorize molnupiravir. This practice means independent observers are blocked from seeing any data, except those selected to support the pharma's findings.
Thus, I had to pull together information from different places. There are three press releases by Merck, a FDA briefing document - which is a report by FDA analysts about the second of Merck's press release, with an erratum in which they disclosed misprinting the value of the key efficacy metric (52% instead of 48%), an Addendum to the FDA briefing document - in which the FDA acknowledged Merck's third press release containing revised data, without comment. No detailed protocol related to the trial has been released, and I have to work with an incomplete and extremely abridged version uploaded to ClinicalTrials.gov.
Those documents are substantively different from a research article - as they present specific conclusions, and offer only data in support of those conclusions. They leave more questions than they answer.
***
In this post, I trace how information about Merck's pill was staged in an apparent joint production with the FDA.
The FDA convened a meeting of advisors right after Thanksgiving holiday on November 30, 2021. Reference materials were released ahead of the meeting, the key document being a briefing document prepared by FDA analysts who reviewed the Merck data. This document made no reference to the final result but reported the interim results, which were almost twice as good. The prespecified primary endpoint was a difference in event rates (3% at final analysis, 7% at interim); nevertheless, they computed a relative ratio since that sounded more impressive. They then misprinted this number as 52% when it was 48%, an embarrassing mistake disclosed in a separate Erratum. Why they didn't correct the original report directly, I cannot understand. This practice is similar to the New York Times putting up a correction several days after an article went to press, tucked into a corner in the back pages, unlikely to be noticed by most readers.
Meanwhile, on Black Friday (Nov 26), the day after Thanksgiving, Merck issued a third press release announcing the final outcomes. This announcement came out literally two working days before the FDA meeting.
In an "Addendum" to the main report, the FDA analysts acknowledged being informed of the final analysis on Nov 22, four days before the press release. This development should have caused a five-star alarm as their primary briefing document would now be misleading readers. Instead of revising the briefing document, and pushing back the meeting of advisors if necessary, they continued. They now added an "Addendum".
The very first paragraph of the Addendum repeats the "50%" improvement talking point.
The second paragraph acknowledges receiving new data from Merck, and refers readers to Merck's own addendum.
The third paragraph tells us how many people were in the trial.
The fourth paragraph repeats the (now meaningless) interim analysis results again!
What are they waiting for? Finally in the fifth paragraph, at the bottom of the page and continuing on the next page, they described the full analysis results.
Were the FDA analysts alarmed by the drastically reduced effect at full analysis? You would not know based on reading the Addendum. The fact that the primary endpoint value dropped from 7% to 3% apparently did not cause any concern. Here is the sixth paragraph that appeared after they stated the final results:
The Agency continues to evaluate the known and potential benefits and risks of MOV considering the results from all randomized participants. During the meeting, the Agency will provide additional key safety and efficacy results based on all 1433 randomized participants (full population). The review issues and benefit/risk assessments may therefore differ from the original assessments provided in the briefing document which was based on the interim analysis.
I reviewed the slides presented at the meeting, and I didn't find any additional information on efficacy.
This situation reminds me of the interim analysis of the Moderna vaccine (link). The FDA briefing document endorsed data that did not meet the FDA's requirement of half the participants reaching 6 months of follow-up while acknowledging that they have received updated data that did not make it to the briefing document.
***
What is the harm of science by press releases?
Look at the following list of key information that are absent from those press releases:
***
There are several other interesting tidbits I gathered that didn't make it to the last post.
The trial defines someone as having high risk of having a severe outcome if they meet one or more of the following criteria: 60 years old +, diabetes, obesity, chronic kidney disease, serious heart conditions, chronic obstructive pulmonary disease, active cancer.
At first glance, they obviously succeeded in selecting a subset that has high chance of severe disease as the event rate (on placebo) was 10%. On the other hand, this group does not seem as high-risk as advertised, based on the very limited information that was published.
Surprisingly, only 14% of the study population was 60 years old+, and only 3% above 75. About 14% have diabetes. Think about that for a moment. If everyone above 60 years old have diabetes, then no one under 60 years old in the trial have diabetes. If no one above 60 years old have diabetes, then 14% of the under-60 have diabetes.
I find it odd that only 14% were over 60. I'd think that many of the other serious conditions are correlated with age so if you pick a random cancer patient or a random person with serious heart conditions, the person is more likely to be older than younger. Thus, I'm imagining that the trial enrolled older but healthy people, and younger people with more comorbidities. We can't be sure since they didn't disclose any details.
Also, it appears that the most at-risk are excluded from the trial. According to the abridged protocol on ClinicalTrials.gov, they excluded anyone who "is on dialysis or has reduced estimated glomerular filtration rate (eGFR) <30 mL/min/1.73m^2 by the Modification of Diet in Renal Disease (MDRD) equation." That would mean someone with chronic kidney disease can only participate in the trial if they are not on dialysis. The protocol lists many other exclusions.
Merck counts hospitalizations and deaths on "all causes". This practice differs from what was used in the vaccine trials, when each case was adjudicated as to whether it is related to Covid-19. We don't have the Merck protocol so we don't know if any adjudication was used.
The study population was almost entirely recruited outside North America. There were only 18 participants from North America, and 40 from Western Europe. Most of the participants came from Latin America or Russia.
If you are running A/B tests in industry, you're used to this scenario: two days (or indeed, two hours) after the test started running, the test sponsor - someone who suggested testing that variant of the marketing copy because they believe strongly that the new version is of that miracle vaccine they have always dreamt - pointed to a huge spike in performance on the real-time tracker, and urged that the test be stopped early because the new version is obviously better, and we should pocket our gains immediately rather than conducting "academic research" for the sake of "scientific purity".
For those with any statistical training, alarm bells should be ringing loudly because such decision-making is the mother of all false positives. The key to understanding why is to imagine what would have happened if the first few hours of performance was in the opposite, undesirable direction. The test sponsor would have been very quiet. (The sponsor could have knocked on your door and said we should stop the test early and declare the new version an instant dud. I have not gotten that knock on the door once after having run hundreds of tests. It's just human nature.)
Worse, what really happens is that the test sponsor will remain silent until such time as the test version displayed a "significant" positive gap relative to the control version. This fallacy is technically called "running a test to significance". This may happen in two hours, two days, five days, whenever. The only time the test would run to its designed end date is when the gap between the test and the control versions never exceeds an acceptable minimum distance.
***
A few years ago, I wrote a piece for FiveThirtyEight, mischievously suggesting that baseball fans should leave the ballpark once the score gap is larger than a certain number of runs. It appears that most baseball fans - presumably including many business managers - think such behavior is blasphemous. Well, it's also blasphemy when they demand an early stop to an A/B test because the favored version of the marketing copy has a "substantial" lead!
It's also the same behavior when a pharmaceutical executive goes to the FDA and pushes to end a clinical trial early, which is the real topic of today's post.
***
The FDA has made a bargain with the pharmaceutical companies to allow "interim analyses," the point of which is to enable early stopping of clinical trials under clearly defined circumstances. Some safeguards have been imposed on such analyses, which as recent events have proven, are easily gamed.
Safeguard #1 is a limit on the number of interim analyses one can conduct during the course of the trial. Each analysis contributes a false-positive probability, and the more times one "peeks" at the outcomes, the higher the chance of a spurious conclusion.
Safeguard #2 is to raise the required standard of evidence for earlier analyses. This is a simple application of statistics: in an interim analysis, the sample size is much smaller, so the error bar around any outcome is much wider (it's worse than linear), and thus the required gap between test and control must be larger. Instead of a 5% significance level, the required level in interim analyses could well be 0.05%. This is useful but does carry risks. One way the gap clears the higher bar is if something strange happens by the time of the first analysis (i.e. an outlier event that will not repeat). Analogously, a baseball team might score 10 runs in the first inning, clearly enough for me to leave the ballpark but should one expect this team to score 10 runs in the first inning in the next games?
Safeguard #3 is to pre-specify three possible decisions during the interim analysis: a) stop the trial for efficacy b) continue to collect more data and c) stop the trial for futility. This is a sensible requirement.
***
Sadly, recent events have shown that statisticians have made a Faustian bargain. The other side has usurped the rules, rendering these safeguards toothless.
Exhibit #1: the recent approval of "aducanumab", a "treatment" for Alzheimer's disease by Biogen. I have not followed the entire saga, but according to this article in StatNews (link), the FDA has made many concessions throughout the process, including waiving Phase 2 trials, and allowing interim analyses of Phase 3 trials. During the interim analysis, the trials were stopped for futility. In other words, while the investigators hoped that at early read, the drug would prove so spectacularly successful that they could stop the trial and declare victory - they instead discovered that aducanumab performed badly enough that the trial had to be ended prematurely.
If the story ended there, I have no complaints.
Biogen continued to mine the data after the trial was terminated, and subsequently, submitted "more data" that they claimed overturned the interim analysis results. This action violates all three safeguards listed above. Biogen is no longer adhering to a limited number of interim analyses - in fact, it appeared to be running the trial to significance. The significance requirements in Safeguard #2 are tailored to the specific analysis schedule, and so those become irrelevant when the schedule is abandoned. Finally, the FDA has also dropped Safeguard #3 because a fourth decision path has been created: stop the trial for futility tentatively and continue to mine for any exploratory analysis that can contradict the interim analysis.
As before, the harm of this action is most readily seen if we imagine what might happen if the trial had been stopped for efficacy. Will Biogen continue to mine the data looking for evidence that the drug in fact did not work?
We don't even have to speculate about the answer because of other recent actions by the FDA and pharmaceutical companies.
***
Exhibit #2: the coronavirus vaccine trials have adopted a modified version of the interim analysis plan. Because of various other shortcuts, the allowable decisions at the interim analysis were a) granting emergency use authorization on efficacy and continuing the trial to its normal end date b) not granting EUA and continuing to collect more data c) stopping the trial for futility. It's actually not clear to me whether c) was ever considered an option but let's assume it was. The vaccine developers committed to continue to run the trials to at least one or two years even if they sucessfully received EUA.
In essence, the successful vaccine trials were stopped for efficacy - although the trials were not stopped because full approval has not been granted. Did the pharmas continue to mine the data so they can volunteer negative information to invalidate the EUA? The answer is emphatically no!
On the contrary, the pharmas argued that upon granting of the EUA, it has become unethical to keep people in a placebo group (i.e. unvaccinated). Therefore, within a few months, almost everyone in the placebo groups has been given the vaccine. In other words, they can't be mining any more data given that the placebo group has ceased to exist.
This situation is the worst nightmare for any statistician who made the Faustian bargain of allowing early stopping through interim analysis. They were hoping that the other side is making a genuine effort to balance science with pragmatism. They are sorely mistaken. When the test is stopped for efficacy, actions are immediately taken that prevent further data mining. When the test is stopped for futility, the sponsor does not really stop the test, and continues to mine the data.
These actions make a mockery of the interim analysis bargain, which means we are back to square one - clinical trials are being run to significance, which is the mother of all false-positive findings.
***
P.S. This post is particularly relevant in light of the Citizen Petition to the FDA about potentially granting full approval to the Covid-19 vaccines based on interim results.
[6/11/2021: So far, three members of the scientific advisory board has resigned in protest of the FDA decision (link). No one on the board voted to approve the drug but the FDA leaders overruled the board.]
06/10/2021 in Assumptions, Behavior, Business, Covid-19, Current Affairs, Data, Errors, Ethics, False positive, Health, Medicine, Science, Sports, Statisticians, Tests | Permalink | Comments (1)
The scientific report on Pfizer's adolescent trial did not bring closure on the two key issues I flagged in my previous post (link). Here are my notes on this trial.
The Key Efficacy Finding Does Not Require a Randomized Controlled Trial
The researchers explained that based on the low expected Covid-19 case rates among 2,000 teenagers (split into two groups), they were not going to see enough cases to offer the same kind of analysis as in the adult trial. They have convinced the FDA (or the other way round) to accept immunogenicity findings as a substitute. That is to say, Pfizer will show only that injecting teenagers with the vaccine generates antibodies intended to fight the SARS-CoV-2 virus.
The sample size issue is real. The infection rate in teenagers is well known to be much lower than in adults. It's hard to find national numbers split by age group, though. I found out that in California, roughly 4% of children below 17 have tested positive for Covid-19 during the whole pandemic while the number was close to 10% for people above 18. These numbers are cumulative for the 12-15 months of the pandemic whereas a typical vaccine trial spans 3-4 months at the time of interim analysis. Detecting a difference in tiny signals (rare events) is challenging and requires a large sample size. Two thousand is like 20 times fewer participants than the adult trial.
Typically, immunogenicity results advance a treatment to Phase 3. These lab tests prove that the vaccine produces a good amount of antibodies as intended. However, having antibodies does not guarantee the vaccinated people will be protected from infection. Nor do lab results suffice to decide what level of antibodies is sufficient to reduce infection (and transmission). Further, Phase 2 trials involve few people. Thus, a Phase 3 trial is required for FDA approval. Phase 3 trials involve thousands of people, and measure clinically relevant outcomes, which for a vaccine usually means reduction of infection.
With immunogencity as the key efficacy metric, the adolescent trial is in spirit an expanded Phase 2 trial. You don't even need to have a placebo group, nor randomize treatment. Injecting someone with saline will not produce antibodies against the coronavirus. They also measured antibodies in only 200 kids, a 20%-ish subset of those enrolled in the adolescent trial. They could have obtained the same outcome by recruiting 200 kids, giving them two shots, and measuring their antibody levels afterwards. No RCT is necessary.
The average antibody level of the 12-to-15-year-olds is compared to a number they previously obtained from the adult trial, involving those in the 16-to-25-year-old age group. If similar levels of antibodies are found, they presume that the vaccine will work in teenagers.
There are several assumptions required to justify such a claim. First, they assume the vaccine works with the same efficacy for 16-to-25-year-olds in the adult trial as the overall VE. This is not a given since the trial did not contain enough people in that (or any other) age group to produce a reliable estimate of subgroup efficacy. Second, they assume there are no material differences between the two age groups that affect immune response. Third, they assume that a direct link between antibody levels and reduction in case rate has been established in the 16-to-25-year-old age group. If that is true, I haven't seen the evidence.
In the protocol, the scientists were supposed to provide an "immunobridging" analysis, which sounds like a mathematical model that justifies the third assumption. They ultimately did not come through with that model.
The Trial Provides Minimal Data on Safety
While the placebo arm is redundant for the pre-specified efficacy analysis, the placebo participants are useful to establish a background level of side effects, to help the vaccine developer make the case that the vaccine does not cause more adverse incidents than expected.
Nevertheless, the conclusion one can draw from a study of 1,000 vaccinated participants is highly limited. If a deadly side effect has a 1 in 5,000 chance of happening, there is over 80% chance that there will be zero cases of this side effect during this trial. With probably ~15 million Americans aged 12-15, such a side effect would affect 3,000 kids. Fewer than 200 American teens have died from Covid-19.
The trial did not identify any major concerns but what we don't know, we don't know.
What about that 100% VE talking point?
On Twitter and cable TV, the talking heads have mostly not presented the trial results as I have above. Below is a typical example:
Notice that they ignored the immunogenicity results and trumpeted a vaccine efficacy (VE) number. In the study, Pfizer reported a "surprise," saying that they found 16 cases of Covid-19 among the placebo group, and zero cases in the vaccine arm.
In the NEJM paper, on the other hand, the immunogenicity outcome was unmistakably featured as the primary endpoint. This raises a critical scientific issue that is perhaps hard for an outsider to grasp.
All clinical trials are required to submit protocols and analysis plans well ahead of enrollment. This process is called pre-registration (investigators are not supposed to make edits, especially major edits, while the trials are running although this practice has become a casualty of the pandemic). Pre-specifying your analysis prevents data mining for results after the data have materialized. Such data mining frequently results in spurious findings - mistaking one-off phenomenon as generalizable outcome. Spurious findings will not replicate if the trials were to be repeated.
In the adolescent trial, pre-registration made Pfizer think about how to demonstrate efficacy. They feared that using the usual measurement based on symptomatic, PCR-positive Covid-19 cases is likely to saddle the study without a conclusion because they were only willing to run a trial with 2,000 participants. Therefore, they chose the immunogenicity outcome as the primary endpoint.
Now, after the trial finished, the Pfizer press release - and its readers - trumpeted the 100% VE metric as if that was the pre-specified analysis. This behavior usurps the principle of pre-registration.
So let me now explain why there is a good chance the observed VE is an outlier. If the whole trial were repeated, I'd be surprised to see the same result repeated.
According to the FDA's press release on approving Pfizer's vaccine for teenagers, the case rate in the placebo arm was 16/978 = 1.6%. In Pfizer's adult trial, the corresponding rate was 162/18325 = 0.9%. So, the teenagers were getting sick at a rate 80 percent higher than that of adults. If I assume that the kids were equally likely to get infected as adults at 0.9%, the chance that we would find 16 or more cases in 978 kids is less than 2 percent. If we were to run this trial 100 times, 98 or more of them would observe fewer than 16 cases in the placebo group.
But teenagers have less than half the case rate of adults as mentioned above. If I assume that the case rate for kids is at 0.45%, the chance that we would find 16 or more cases in 978 kids is 0.001%. Don't forget, also, that the case rate only counts adjudicated, lab-confirmed, symptomatic Covid-19 cases occurring at least 7 days after the 2nd dose so the real-life infection rate is even higher than the 1.6%.
This is very strong evidence that what was observed during this trial was an outlier event.
Nevertheless, this outlier event has been painted as a "pleasant surprise". We didn't think we'd have enough cases to show anything but how could we refuse the unexpected gift of 100% efficacy? The real surprise was not the VE but the inexplicable, highly unlikely incidence of Covid-19 among teenagers enrolled in this specific trial.
In fact, this is exactly the reason why pre-registration is good practice. If we allowed this type of post-hoc data mining, spurious findings will bloom.
***
See also my previous rant about the way the paper was written up as if the adolescent and 16-to-25-year-old age group were originally planned as two arms of a trial.
See also my post on the Law of Small Numbers, one of the seminal contributions by Tversky and Kahneman. The Law of Small Numbers is the false belief that the Law of Large Numbers applies to one's small sample.
06/08/2021 in Assumptions, Cause-effect, Controls, Covid-19, Current Affairs, Decision-making, Errors, Ethics, False positive, Health, Mass media, Medicine, Models, Politics, Science, Statisticians, Tests, Variability | Permalink | Comments (0)
Last Friday, the U.S. ended the pause of the Johnson & Johnson vaccine, which may be linked to rare blood clots in women under 50. As with the EMA's stance on the Astrazeneca vaccine, they recommend continuing to administer these vaccines based on a cost-benefit analysis.
In today's post, I summarize the key information in the documents released during Friday's meeting of experts about those blood clots. The media have failed to cover the statistical reasoning behind the decision.
In its presentation, J&J highlighted the number that is most beneficial to its commercial interest - the risk of rare blood clots in the general population regardless of gender and age. The media is also fixated on this statistic, which is the least useful of all the numbers in these documents. Out of 8 million people who have been given the J&J shot since March (remember that this vaccine only got approved at the end of February), they have reported 15 cases of rare blood clots coupled with low platelet counts. This means the risk is 2 cases per million.
Measuring the risk of this side effect is a great example of why no statistic is purely objective. The subjectivity lurks behind how the risk is defined. Risk is a ratio of two numbers: the number of cases, and the relevant population at risk. The 2 cases per million number bakes in two choices: on the denominator, J&J selected the entire population regardless of age and gender; on the numerator, J&J selected the co-occurrence of rare blood clots and low platelet counts. Those are not the choices I expected.
The denominator should be women only, or younger women only, given that all 15 cases occurred in women under 60 years old. The choice of the denominator is not immaterial! The CDC ultimately used women aged 18-49 as the relevant population, and this change of denominator bumped the risk from 2 up to 7 per million. (Including men roughly doubles the denominator while adding zero cases to the numerator.)
The cutoff age of 50 is also a choice. Two cases were found in women between 50 and 60 years old. By setting the at-risk population as women under 50, those two cases are removed from the numerator. Each additional case increases the risk by roughly 0.5 times if the denominator is not changed appreciably.
The CDC also looked at smaller subgroups. Women in their thirties accounted for the most cases, and the risk in that age group was 12 per million (7 cases among 600K vaccinations). Meanwhile, 1.4 million women aged 50-64 took the J&J vaccine, with two reported cases of blood clots and low platelets, leading to a risk of < 2 per million. Scientists can choose the most restrictive analysis by focusing only on women in their thirties, or the most expansive analysis, on all women. The CDC's choice of 50 is reasonable while not the only option.
The numerator is also unusual, and different from what the EMA used when examining the Astrazeneca vaccine. J&J and the CDC decided to define the side effect as rare blood clots coupled with low platelet count. This is a more restrictive case definition, which reduces the number of cases under study. Interestingly, this definition neatly draws a line between the adenovirus platform and the mRNA platform (Pfizer & Moderna), as so far, the mRNA vaccines have seen some cases of blood clots but zero cases of blood clots plus low platelets.
***
The next issue is how bad is that risk. For this, we have to compare to other risks.
One obvious comparison is the baseline rate of blood clots and low platelet count. This is where the choice of numerator presents a difficulty. The baseline risk is essentially zero: that combination of conditions almost never happens. One of the CDC documents compared the observed risk to the risk of blood clots in women 20-50 years, which is invalid. The risk of blood clots plus low platelets should be quite a bit lower than the risk of blood clots (with or without low platelets). In a different CDC document, the risk of blood clots and low platelets is estimated to be under 1 case per million. In the J&J document, they claim the baseline risk is 0.1 per million. Neither of these numbers can be directly compared to the 7 per million observed risk because the baseline risk is per million Americans, not per million American females aged 18-50.
Nevertheless, it is abundantly clear that the risk of getting blood clots and low platelet count among women under 50 who took the J&J shot is much higher than the baseline risk. The risk of dying from this side effect is also much higher than the baseline mortality risk. (About 25-30% of the known cases have already died.) On an absolute scale, the condition is very rare. On a relative scale, it is concerning.
All 15 cases were reported between 6 and 15 days after the vaccination. Notably, this is the period during which scientists insist the vaccine has zero benefits.
***
Another relevant comparison is the risk of death from Covid-19. This is a key to the cost-benefit analysis. How should we compute this risk? We should focus on the period from March and April, the two months in which the J&J vaccine was administered.
This CDC slide indicates that among females aged 18-50, the Covid-19 death rate was around 0.3 per 100,000, or 3 per million. This means the chance of dying from Covid-19 is about the same as from dying from blood clots+low platelets after taking the J&J vaccine.
Yes, that's right. It's also confirmed in the CDC simulation model.
So, what does the CDC mean when it says the benefits outweigh the costs? The simulation model finds that slowing down the vaccinations causes a lot of excess Covid-19 deaths among men, and among females over 50, and therefore, across all age groups and genders, stopping J&J causes more harm than good.
***
Two other small details caught my attention. There is another fatal case of blood clots and low platelets that was excluded - seemingly because the situation is too complex. There are also other cases being investigated including cases affecting men.
When the investigation first started, there were 6 reported cases. About 10 days later, the number of cases more than doubled to 15. The vaccine didn't become more risky suddenly. This phenomenon is caused by self-reporting of cases. Such self-reporting has always been a weakness of how we monitor side effects. The risk may be over-estimated because of under-reporting of mild cases.
04/26/2021 in Assumptions, Bias, Cause-effect, Covid-19, Current Affairs, Data, Decision-making, Health, Medicine, Science, Statisticians | Permalink | Comments (3)
Data analyses are flooding the airwaves. They paint a confusing, contradictory picture of all aspects of the pandemic. Sadly, most of the work are hastily put together, and low quality. Here are some things to keep in mind when you look at these studies.
What happened in an experiment will not happen again
If a vaccine trial returns a 90% vaccine efficacy, we can say for sure the vaccine’s efficacy is precisely 90% for the 30,000 or so participants in that specific vaccine trial. We are not empowered to conclude that the VE for billions of people who will eventually get the shots will be 90% - no matter how many epidemiologists repeat this falsehood on TV.
That’s because the trial involved a small sample of people, and there is a margin of error around the 90% number. If the range estimate is 70% to 98%, then we say “with 95% confidence” that the VE for the general population is above 70%. That’s still an impressive number, and it has the advantage of respecting scientific principles.
Imagine repeating the same clinical trial with different choices of 30,000 participants repeatedly. The margin of error says 95% of the series of VE estimates from these trials will fall inside the 70% to 98% range. The chance of any of these trials repeating the exact 90% VE is practically zero.
The randomized clinical trial (RCT) is the gold standard for establishing cause--effect. The lack of randomization in “real-world” studies opens a can of worms.
In a vaccine trial, we compare the case rates of those who are vaccinated to those who aren’t, which are the primary ingredients of the vaccine efficacy formula. We are currently being served weekly new studies that also compare vaccinated people to unvaccinated people. The media are telling us that these new studies are better because (a) they are more recent (b) they have much larger sample sizes and (c) they constitute “real-world” evidence. A popular theme is that these studies fill in the gaps left by the vaccine trials.
The noise you’re hearing are the chuckles from incredulous statisticians. Since RCTs are regarded as the gold standard for causal inference, no “real world” evidence should be modifying, and certainly not correcting, findings from a scientific experiment. Doing so is like hiring a C student to tutor an A student, and siding with the C student when they disagree.
There is one critically important property missing from all real-world studies: the randomization of treatment. (See, however, so-called natural experiments.) In an RCT, a coin flip determines who gets the vaccine but in the real world, who gets the shots is anything but random. Most countries have priority lists and specific types of people are getting the vaccines earlier than others. In any “real world” study, the unvaccinated group is different from the vaccinated group, not just by vaccination status but also by many other factors, both known and unknown. Any of those other factors could contribute – majorly or minorly – to the observed difference between the two groups. Randomizing treatment ensures that on average, these other factors will not bias the finding in an RCT, a condition that does not exist in real-world studies.
A partial solution is to define a better control group. We’d take a subset of the unvaccinated people who look like those who are vaccinated. This is an effort to manufacture artificially the randomization condition. This matching process is highly subjective, and we can only match people using measurable and influential factors. There is no law that dictates that every important factor is measurable. As an illustrative example, perhaps people with higher-speed Internet connections are more likely to land vaccination slots. Since medical researchers do not have individual data on people’s Internet speeds, we cannot control for this effect. Even the most obvious adjustments have problems. For example, young people can be excluded because they don’t qualify for vaccination yet but young people who are front-line workers are getting shots.
Another common issue is incurable imbalance: if all care home residents have been vaccinated, we won't find care home residents in the unvaccinated group. The biggest problem with matching studies is lack of transparency. The typical disclosure is vague and confusing – I can’t even figure out what they did. None of these studies publish their data or code.
So, a real-world study that corrects for selection bias is better than one that doesn’t but in no case is an observational study superior to an RCT. Larger sample sizes produce more precise estimates (with lower margins of error) but they do not auto-correct biases. More precise estimates derived from biased data pose even greater risk precisely because they feel more solid when the biases are ignored.
The randomized clinical trial is a victim of abuse.
In reviewing the torrent of studies that have come out of the vaccine trials, I can’t help but notice that analysts are mercilessly abusing the RCT framework. I’m going to echo Andrew Gelman’s criticism of the standards of research in psychology. We’re concerned about the high probability of false-positive findings.
The VE numbers from Pfizer, Moderna, Astrazeneca, etc. are not comparable to each other. That’s because each team measures case rates differently. Pfizer drops all cases prior to 7 days after the second shot; Moderna and Astrazeneca drop cases prior to 14 days after the second shot. Then, you have people reanalyzing the data, who use their own case-counting windows. The U.K. government published a post-hoc analysis of the Pfizer data, counting only cases between 15 and 28 days after the first dose; this is echoed in a Canadian re-analysis of the same data, in which they set the window of 14 days to 20 days. A more recent “real-world” study out of Scotland counted from 28 to 34 days after the first dose. This is no small matter as VE is an improvement metric relative to a baseline, and we can’t keep our heads straight which the baseline applies.
Many scientists justified the choice of when they count cases, on record no less, by pointing to the cumulative case curve, arguing for counting when the vaccine’s curve starts flattening because that’s exactly how long it takes for the vaccine to do its magic. This type of post-hoc thinking is what teachers warn students not to do. These clinical trials were sized to measure the endpoint, and not sized to identify a turning point on a case curve. Imagine repeating this trial many times over. Assuming that the VE after 90 days rises to roughly 90 percent in each trial, it is likely that the turning point on the curve is not precisely at day 14 (or whenever it was in the first trial).
Analysts appear to have looked at their individual datasets, identified the period in which the vaccine showed the best performance against the placebo, defined their VE metric to hone in on that time window, and justified the definition with some sciency blather. This is a recipe for over-estimating the true performance.
The proof is the absence of new studies that go through similar post-hoc thinking to adjust down the findings of the respective RCT. In fact, the re-analyses of the Pfizer data, which moved the estimated VE of the first dose from around 50% to 90% is a nice example of a C student correcting an A student.
To prevent the danger of post-hoc theorizing, scientists agree to pre-register their studies so they have to define how they would count cases ahead of time. Unfortunately, those protocols are written loosely to allow a wide range of options, such as 7 days, 14 days, 21 days, etc. as possible evaluation metrics. This maybe exposes how little investigators know about the behavior of their inventions.
If only the start of the case-counting window were the only “researchers’ degree of freedom”. The researchers also follow their noses about (a) the duration of the case-counting window (b) whether a positive test is required to confirm cases (c) which test is allowed for confirmation and the parameters for running the test (d) the maximum length of time between reporting symptoms and testing positive (e) what symptoms are included and excluded as qualifying (f) how many symptoms are qualifying. As I showed in this post, VE drops fast with just a few more cases on the vaccine arm so any of these decisions matters.
The RCT framework does not protect us against these potential abuses. RCTs are the gold standard, but they aren’t fool-proof.
Story time after RCT
The potential problems outlined in the previous section relate mostly to trial design decisions. The vaccine RCTs are also being suffocated by post-hoc hacking. I already devoted an entire post to this phenomenon of “story time”. Not all results obtained from analyzing data from RCTs have the crucial randomization property. That’s because analysts can destroy the property by cherry-picking the data. We’re lulled into thinking these lesser findings have the same status as a proper RCT finding.
The most infamous example of such analysis is the 90% VE attributed to the so-called low-dose, standard-dose subgroup in the Astrazeneca-Oxford trial (link). This VE is indeed based on comparing case rates of a vaccine arm to that of a placebo arm. In a true RCT, those two arms are statistically the same except for what was placed in the two shots. Nevertheless, the low-dose subgroup happened as an accident, and it subsequently emerged that this subgroup contained only younger participants, have earlier enrollment dates, have longer intervals between doses, etc. And yet, on the front page of the Lancet paper summarizing this trial, the investigators stated “In participants who received two standard doses, vaccine efficacy was 62.1%... and in participants who received a low dose followed by a standard dose, efficacy was 90.0%...” By this point, there has been no disclosure about the “accident,” and anyone reading this sentence is likely to assume that there were two randomized dosage schedules used in this trial.
Likewise, much ink has been spilled on dose intervals. Many analysts have compared subgroups that took their second shots later and earlier. Even though the underlying data were collected in an RCT, dose intervals were not randomized. And yet, these results were published and publicized using methods for analyzing RCTs and not methods for analyzing observational studies. In some cases, the researchers even disclosed that these subgroups differed on numerous important factors, and still they plodded on.
**
In this era, even results coming out of RCTs must be scrutinized. Early analyses, post-hoc analyses, deep-dive analyses, and side analyses typically discard the crucial randomization property, and should be treated as observational studies.
Of course, results from observational ("real-world") studies should be scrutinized even more. In a future post, I’ll outline how to review analyses of observational data.
02/24/2021 in Assumptions, Big Data, Cause-effect, Controls, Covid-19, Current Affairs, Data, Ethics, Fairness, False positive, Health, Medicine, Science, Significance, Statisticians, Story time, Tests, Variability | Permalink | Comments (2)
I spent a good chunk of my career designing, running and analyzing industrial experiments, and as policymakers stray further away from the underlying science, I have a bad feeling about the rollout of the Covid-19 vaccines
At issue is a push to deviate from the treatment protocol that produced the highly promising clinical trial results. During the clinical trial (by Pfizer), the vaccinated participants were given one dose of the vaccine on day 1, and a second dose on day 21, the vaccine was then given one week to take effect, so that the 95% efficacy was computed by counting cases from day 28 (i.e. nullifying cases between day 1 and 28). Moderna and AstraZeneca had a similar treatment protocol, all specifying two doses, but the number of days separating the doses, and the waiting period for the effect to take hold depended on the protocol.
When the outcome of a carefully-designed experiment returns positively, the straightforward next step is to roll out the treatment "per protocol". Any deviation from the protocol requires assumptions (coming from intuition, experience, and gut feelings) modifying the science. The point of running a randomized controlled trial is to follow the science rather than one's guts.
***
The specific alterations to the treatment protocol are as follows:
So far, the U.K. is the only country that adopts most of these as official policy while the U.S. and others currently claim they allow them for "exceptional cases". The U.S. announced that second doses are no longer reserved for those who have taken their first shots, increasing the chance that people would not receive their second doses at the prescribed time, if ever.
***
A fundamental best practice of running statistical experiments on random samples of a population is that once the winning formula is rolled out to the entire population, the scientists should look at the real-world data and confirm that the experimental results hold.
This post-market validation is hard even if properly done. That's because on rollout, everyone is eligible for the treatment, and those who have received the vaccine up to the time of analysis do not form a random sample of the entire population. So any difference between the vaccinated and unvaccinated groups may not be a pure effect of vaccination. (This difference is why we conduct randomzied controlled trials, which allow scientists to isolate causes.)
The action of the U.K. government (and others who may follow suit) has severely hampered any post-market validation. It is almost impossible to compare real-world evidence with the experimental result, because most people are not even getting the scientifically-proven treatment per protocol!
Even if the rollout was perfect, the real-world outcomes are likely to deviate from the experimental findings. Statistical science gives us confidence that any such deviation is immaterial.
With the altered protocol, the real-world data come from a variety of dosing schedules - some with one dose, some with two, some with two doses 21 days apart, some with two doses 10 weeks apart, etc.
What happens when the aggregate real-world outcome falls below expectation? How will scientists figure out what are the reasons for the under-performance?
First, the scientists can't even say what the expected performance is because the experiment yielded data on just one dosing schedule, which could be the least used in the rollout.
Second, there may be several simultaneous changes to the dosing schedule. For example, one might be comparing someone with 1 dose of Pfizer on day 1 with another who got 1 dose of Pfizer on day 1 and 1 dose of Moderna on day 36. Is the difference of outcomes due to mixing and matching, or number of doses, or the gap between doses? How do we learn which factors are contributors, which are not, and the relative importance of these factors?
Third, many post-market validation studies are doomed by focusing exclusively on factors that the investigators can control, such as the factors listed above. It's possible that the policymakers are correct - that none of the dosing changes would affect the vaccine efficacy. In that case, an observed difference in outcomes is due to other factors, such as viral escape by new variants, as-yet unclear duration of protection, compliance to mitigation measures after vaccination, which people choose to take the vaccine, etc. Notably, these are factors outside the control of scientists.
Why do we need answers for those questions? If the rollout does not meet expectations, we need to correct the course. To rectify the problems, we need to know what they are. If we can't even diagnose the potential problems, as will be the case here, we will be swinging in the dark.
***
Actions have consequences. The decision to deviate from the treatment protocols that delivered the promising vaccine trial results makes it very challenging, if not impossible, to measure the impact of these vaccines. One of the key lessons of managing this pandemic so far is that good data drive good decisions, and bad data drive bad decisions. Unfortunately, policymakers have signed up for bad data, so no one should be surprised if future policies turn out badly.
P.S. On Twitter, some are offering this UK government document as evidence that the scientists have a plan for post-market validation. I recommend reading this document to learn about what appropriate validation steps would have been in normal circumstances when the rollout of the treatment is per protocol. In the vaccine efficacy section, it addresses unanswered questions from the trials such as the duration of protection, subgroup analyses, other outcome metrics, etc. I didn't find anything in that document that estimates how deviations from the treatment protocol affect outcomes.
01/26/2021 in Assumptions, Behavior, Bias, Big Data, Cause-effect, Controls, Covid-19, Current Affairs, Data, Decision-making, Health, Medicine, Politics, Science, Significance, Statisticians, Variability, Web/Tech | Permalink | Comments (0)
I fear that the U.K. policy of one-dose vaccines will backfire, and cause the pandemic to continue longer than necessary. Here are several reasons why.
Partial protection provides a convenient excuse for vaccinated people to do away with inconvenient mitigation measures
Assuming that the trial results hold, people receiving two vaccine shots have a very low chance of getting infected, and thus, it is acceptable that they return to normal living after vaccination. (I myself would not, since I think two doses are closer to 70 percent effective in reality, and we don't know that the vaccine stops asymptomatic spread, and so continuing to reduce contacts is advisable. Yet, under this scenario, one should not begrudge people who decide to return to normalcy.)
If people are given partial protection through just the first shot, it is irresponsible to return to normal living. Even the vaccinated must continue to wear masks, maintain social distancing, and stay at home as much as possible for the foreseeable future. Even now, it has proven impossible to enforce mitigation measures - at least not in the U.S. where I live. Vaccination presents a convenient justification for non-compliance.
This behavioral problem alone should have doomed the one-dose lobby.
One-dose treatment solves a problem that does not exist
The first weeks of vaccination has shown that the problem is demand not catching up to supply, not the other way round. We have more doses shipped than doses that are going into arms. If people don't like your cakes, you can't solve this problem by selling half-cakes!
The high efficacy of single doses is wishful thinking
Anyone who believes that the efficacy of the first dose is 90% lives in a fantasy world. Such a conclusion requires cherry-picking the best-case scenario, ignoring results that don't fit the theory, ignoring the margin of error, assuming that the second shot is essentially worthless, and assuming that the first shot is effective for the pertinent period of time without any boost from the second shot. The last two assumptions expose the one-dose strategy as a tautology - a conclusion that stems directly from one's assumption.
In a prior post, I explained how any guess of single-dose efficacy requires an assumption of (a) the efficacy of the second dose and (b) the duration of protection of the first dose, for neither of which do we have direct evidence coming from the vaccine trials.
It's a jigsaw puzzle, not a Barbie doll
People who propose the one-dose strategy are treating the clinical trial results like a Barbie doll. They are switching the green dress with the red skirt, or the high heels with flats. In fact, the toy is a jigsaw puzzle, in which each piece has its exact place, and cannot be haphazardly substituted.
For example, the headline vaccine efficacy numbers of 95% - recall the decimal obsession by Moderna - is based on a case definition in which they only count cases that occur at least 7 to 14 days after the second shot. You simply can't talk about that number in the same breath as efficacy for the first shot which requires counting cases after the first shot and before the second shot. They are not comparable numbers.
Use the scientific method properly
What supporters of the one-dose strategy should do is to lobby the FDA and/or pharmas to immediately start new clinical trials to test the one-dose strategy against the two-dose treatment.
From a scientific perspective, the biggest problem is post-hoc theorizing. Let's use stock market performance as an example. Do you feel like the business press are having it both ways? If stock prices went down yesterday, it's because the unemployment numbers were terrible. But if stock prices went up yesterday, it's because the terrible unemployment numbers were not as bad as imagined, or because businesses expected the job market to start improving, or because executives shrugged off the terrible unemployment numbers. They end up fitting a story to the observed data, as opposed to using the observed data to confirm a hypothesis.
Given the setup of the vaccine trials, before seeing these results, no one would have argued for one dose. The argument emerges after one sees the particular outcome from the trial. Trouble is, if the trial were run many times, would we have seen the same efficacy curve? (The variance across different trials already tells us the answer is NO.)
If the front part of the curve looks less favorable (say 30%), these same people would say we can't draw inferences about one dose (as opposed to one dose isn't going to work). In other words, this analysis contains an optimism bias, as does any theorizing after peering at the data.
***
Despite what I said above, I rate it as likely that certain countries will eventually follow the U.K. example and allow one-dose treatment. (As of now, the FDA has sensibly decided not to go down this path.)
Politicians score easy PR points by claiming they are helping "more people" when in reality, such a policy will lengthen the public-health crisis. Previously, I already noted that the PR agency of the UK government should be given an award.
P.S. Sadly, it appears that the incoming Biden administration is intent on elevating PR over science, as news has emerged that they will "release all available vaccines" and "not withhold second doses". This is euphemism for giving out one dose to everyone (as opposed to reserving the second doses for those who have gotten the first dose). For those "scientists" who are jumping behind this policy, I'd request that they answer the questions listed on my previous post. I can be persuaded that the one-dose strategy will lead to a better outcome but I'm having trouble understanding what set of assumptions lead to such a policy.
Added to the problems listed above, I think this latest news may persuade some of us to become vaccine skeptics. Those of us who believe the data show the vaccine needs two doses to work, and who do not get assurance that the second dose will be available in the right time frame, may decide to opt out of the vaccine.
P.P.S. In justifying this misguided policy, people are going to say "don't let the perfect be the enemy of the good". This is the cue that they are about to ignore the science. No one ever said the vaccine trial results are "perfect"; in fact, many of us believe they are less than perfect. I have written frequently about the shortcuts that were taken. We are not letting the perfect be the enemy of the good; we are opposing letting the unknown replace the good enough.
Other than some handwaving, I have not heard any description of what "the good" entails. If you can't quantify it, it's not science. If we are not going to argue about the science, then we must debate the assumptions, and evidence supporting such assumptions. Let's go!
01/11/2021 in Assumptions, Bias, Cause-effect, Covid-19, Current Affairs, Decision-making, Health, Medicine, Science, Significance, Statisticians, Story time, Story-first, Tests, Variability | Permalink | Comments (3)
I love "data-driven" arguments. You probably do too if you're reading this blog.
What does it mean to be "data-driven"? One prerequisite is that the conclusion should change as the data change. The analog that I write about frequently on the dataviz blog is to qualify as a data visualization (as opposed to just a visualization), the visual should change as the data change.
***
Recently, there have been numerous arguments that are immutable in the face of changing data.
John Burn-Murdoch at FT wrote this tweet:
John dislikes this argument:
If there are spare beds, this proves ex-post that mitigation measures were not necessary.
There are different potential problems with this argument.
John is talking about limiting factors. In a complex system, many factors combine to produce an outcome. Typically, only one or a few factors are "limiting" in the sense that if those factors were either relaxed or tightened, the outcome would be directly affected. Meanwhile, there are many other factors for which manipulating them would not alter the outcome. These other factors may matter under other combinations but they don't matter right now.
Another counter-argument is related to the direction of causality. One reason why the government may shut Nightingale hospitals (for those in the U.S., these are akin to the makeshift beds set up at conference centers) is because mitigation measures were successful in reducing the demand.
Yet another counter-argument is the assumed counterfactual. Because mitigation measures were taken, we could not observe what could have happened if they weren't. The absence of evidence is being taken as evidence of absence.
***
But I think the best way to counter this type of argument is to point out its inherent data-agnostic autonomous structure:
I) On the one hand, if the outcome is X, then we conclude Y.
II) On the other hand, if the outcome is not X, then we conclude Y.
In other words, in all cases, whether X or not X, we conclude Y.
In the concern over Nightingale hospitals, this structure translates to:
I) On the one hand, if there are spare beds, then we conclude that mitigation measures were unnecessary, causing harm without benefit.
II) On the other hand, if they run out of beds, then we conclude that mitigation measures were ineffective.
In all cases, we conclude that mitigation measures were unwise.
The problem is that the same conclusion results regardless of the data (need for spare beds). So this is not a data-driven argument despite the presence of data.
***
The Bayesian way of thinking explicitly endorses the premise that conclusions should change as the data changes. Bayesian analysis leads to a posterior probability estimate, which explicitly depends on the data. If you feed the analysis different data, you get a different probability distribution. For an example of this, see this post in which I showed how the Pfizer vaccine efficacy estimate changes, assuming different trial outcomes.
The classicial (frequentist) method of hypothesis testing also produces different conclusions given different data inputs. The "null" hypothesis does not change but with different data, the location of the observed sample with respect to the "null" distribution changes, leading to different p-values.
[P.S. 1-1-2021. I switched from "agnostic" to "autonomous" as I think the argument being autonomous of the data conveys the meaning better.]
12/30/2020 in Analytics-business interaction, Assumptions, Bayesian, Cause-effect, Covid-19, Current Affairs, Data, Decision-making, Health, Science, Statisticians | Permalink | Comments (3)
Recent Comments