01/19/2022 in Covid-19, Current Affairs, Data, Errors, Ethics, Health, Mass media, Medicine, Science | Permalink | Comments (0)
As we head into the third year of the novel coronavirus pandemic, I've been reviewing some questions that are still open. The last two posts can be found here and here.
Today's further questions are inspired by the following chart put out recently by the White House.
On the surface, this chart provides evidence that the Covid-19 vaccine may protect people from getting hospitalized. If you're serious about causal analysis, you'd say all it depicts is a correlation. For the above chart, which plots "real-world" data, cannot be interpreted as if the data came from an RCT. In other words, the two groups (vaccinated and unvaccinated) are not identical except for vaccination status.
6. Why are unvaccinated people being hospitalized at an accelerating rate?
Focus on the blue line for the time being. This line shows the hospitalization rate for unvaccinated people, expressed in number of hospital admissions per 100,000. Between start of July and end of November, 2021, the weekly hospital admissions of unvaccinated people exploded from 6 per 100,000 to 68 per 100,000. Those are rates not counts.
This seems strange to me since I'd have expected the VE to come from a sharply reduced rate in vaccinated people while the rate among unvaccinated remains steady.
During that same period, the proportion of Americans who became fully vaccinated grew from 50% to 60%. The vaccines are said to be extremely effective, which should lower the amount of virus circulating in the population. Further, the unvaccinated proportion shrank from 45% to 30%, which is a 33% reduction rate!
The same dynamic shows up on the deaths chart as well. In other words, as more and more Americans get two doses, and fewer and fewer remain unvaccinated, the death rate among the remaining unvaccinated has skyrocketed.
Why? Why? Why?
7. How come VE is increasing while immunity is waning?
The definition of vaccine effectiveness is the relative ratio of the event rates of vaccinated versus unvaccinated. Events may be cases, hospitalizations, deaths, etc. Disregarding many methodological issues that I frequently discuss on the blog, I'm taking their analysis at face value. VE when measured by hospitalizations is then the ratio of the two curves shown. What is very odd about the chart is that the value of VE fluctuates wildly from week to week.
As a reminder, this is the overall trend of hospital admissions in the U.S. in 2021:
Hospitalized cases tumbled to a seasonal low in July, and then started rising again. In the first half of the year, the media kept reiterating that the vaccines were directly responsible for the dropping hospitalizations. Later, they said lower immunity against Delta and/or waning immunity was the reason for the return of cases. This narrative implies that the VE would be highest in July and dropping as the cases rose in the second half of 2021.
Surprise, surprise. If we compute the ratios of the rates shown in the White House chart, VE in July was the lowest (87%) and as the hospitalization rate of unvaccinated jumped 10 times, VE nudged up from 87% to 94%.
Why? Why? Why?
***
Any seasoned data analyst can tell you the very first step to good data analysis is asking the right questions. In this current series of posts, I show that the media have not been asking some of the most difficult questions. We can't pretend that we understand what's going on, unless we confront these challenges head on.
01/18/2022 in Cause-effect, Covid-19, Data, Errors, Ethics, Health, Mass media, Medicine, Science, Statisticians | Permalink | Comments (2)
The media could not conceive how the CDC could revise its estimate of the proportion of omicron variant so drastically from a heart-stopping 73% to a blood-curdling 59% in a matter of two weeks (for example, Bloomberg scratch that since you can't even read one article on Bloomberg. Here's MSN, so 90s.)
The reason why the media is surprised, stunned, shocked, dismayed - is because the media didn't do its homework when they excitedly reported the 73% number.
I knew because I hopped on the CDC page that contained this number. From there, you immediately learned that 73% is a "Nowcast", which is described as "a model that estimates more recent proportions of circulating variants and enables timely public health action". In plain English, it is a forecast, not actual real-world data.
My first instinct when I see a model (because I build models for a living) is to click the very helpful button that toggles between "Nowcast on" and "Nowcast off". You can't understand any model without first looking at the actual real-world data sitting beneath it.
I was indeed surprised, stunned, shocked, dismayed. Because this was what I found (these screenshots were taken before the latest revision):
The orange section is the Delta variant. The tiny slither of purple at the bottom of the very last column is the Omicron variant. On the table, you see that the actual proportion of Omicron in the week ending Dec 4, 2021 was 0.7%.
The next screenshot was taken when Nowcast was turned on:
The last column showed 73% Omicron, which was all over the news when this came up. Notice that the date axis changed. There are two additional weeks shown: ending Dec 11 and Dec 18. The 73% apparently concerned the week ending Dec 18.
It appears that "Nowcast" is not really a forecast but a missing data imputation procedure because this information was released right after Dec 18. This CNet news article was dated Dec 20. Presumably, the flow of data did not support real-time reporting, and so they had to resort to a model.
***
What is this Nowcast model that can aggressively turn 0.7% to 73%? Unfortunately, your guess is as good as mine. The link behind the word Nowcast on the CDC page leads to the chart itself. There is nothing on the chart that explains what kind of model is Nowcast. I found nothing on the page that explains how they turned 0.7% to 73%.
But we can measure how horrible this Nowcast model has performed. The media got this wrong too. It's not 73% versus 59%. Look at the current view of the chart with Nowcast on:
The 59% estimate is for the week ending Dec 25 while the 73% estimate is for the week ending Dec 18. The correct comparison is 73% versus 23% (the purple section of the second column from the right). They "projected" a 10 100 fold increase but now they say it was a 3 30 fold increase. No wonder they didn't want to tell us what this model is!
***
To take a Bayesian perspective, the model estimate is a kind of weighted average between past data and "prior" knowledge. In this case, the prior knowledge is "art" reflecting someone's subjective belief. We don't know much about the model but we know that this prior belief exceeds a 10 100 fold increase because it cancelled out the past data (0.7% of cases) and more.
Science in the pandemic age is just like this. Scientists running away from other scientists who are capable of evaluating the science.
[12/29/2021: correcteed 10 to 100. 12/30/2021: added "the proportion of" in the first sentence, responding to Antonio's tweet. 1/2/2022: corrected 3 to 30]
12/29/2021 in Bayesian, Covid-19, Data, Errors, Ethics, Health, Medicine, Models, Politics, Science, Statisticians, Variability | Permalink | Comments (7)
The Covid19 pandemic has transformed human communications in the U.S.
Most business meetings are (still) being conducted via Zoom or a variety of similar services, in lieu of in-person meetings. Scientists have relied on preprints instead of peer-reviewed publications; increasingly, preprints are even abandoned in favor of press releases. Merck's Covid19 pill, known as molnupiravir (MOV), represents a case in point. (The Pfizer pill is another example.)
Previously, I have mentioned how press releases have been used to seed public opinion prior to any preprints or peer-reviewed publications being available (link). Merck just took this to the next level. When I first researched MOV, I looked for a preprint or a journal article, and I couldn't find any. This is weeks after the FDA advisory board recommended to authorize molnupiravir. This practice means independent observers are blocked from seeing any data, except those selected to support the pharma's findings.
Thus, I had to pull together information from different places. There are three press releases by Merck, a FDA briefing document - which is a report by FDA analysts about the second of Merck's press release, with an erratum in which they disclosed misprinting the value of the key efficacy metric (52% instead of 48%), an Addendum to the FDA briefing document - in which the FDA acknowledged Merck's third press release containing revised data, without comment. No detailed protocol related to the trial has been released, and I have to work with an incomplete and extremely abridged version uploaded to ClinicalTrials.gov.
Those documents are substantively different from a research article - as they present specific conclusions, and offer only data in support of those conclusions. They leave more questions than they answer.
***
In this post, I trace how information about Merck's pill was staged in an apparent joint production with the FDA.
The FDA convened a meeting of advisors right after Thanksgiving holiday on November 30, 2021. Reference materials were released ahead of the meeting, the key document being a briefing document prepared by FDA analysts who reviewed the Merck data. This document made no reference to the final result but reported the interim results, which were almost twice as good. The prespecified primary endpoint was a difference in event rates (3% at final analysis, 7% at interim); nevertheless, they computed a relative ratio since that sounded more impressive. They then misprinted this number as 52% when it was 48%, an embarrassing mistake disclosed in a separate Erratum. Why they didn't correct the original report directly, I cannot understand. This practice is similar to the New York Times putting up a correction several days after an article went to press, tucked into a corner in the back pages, unlikely to be noticed by most readers.
Meanwhile, on Black Friday (Nov 26), the day after Thanksgiving, Merck issued a third press release announcing the final outcomes. This announcement came out literally two working days before the FDA meeting.
In an "Addendum" to the main report, the FDA analysts acknowledged being informed of the final analysis on Nov 22, four days before the press release. This development should have caused a five-star alarm as their primary briefing document would now be misleading readers. Instead of revising the briefing document, and pushing back the meeting of advisors if necessary, they continued. They now added an "Addendum".
The very first paragraph of the Addendum repeats the "50%" improvement talking point.
The second paragraph acknowledges receiving new data from Merck, and refers readers to Merck's own addendum.
The third paragraph tells us how many people were in the trial.
The fourth paragraph repeats the (now meaningless) interim analysis results again!
What are they waiting for? Finally in the fifth paragraph, at the bottom of the page and continuing on the next page, they described the full analysis results.
Were the FDA analysts alarmed by the drastically reduced effect at full analysis? You would not know based on reading the Addendum. The fact that the primary endpoint value dropped from 7% to 3% apparently did not cause any concern. Here is the sixth paragraph that appeared after they stated the final results:
The Agency continues to evaluate the known and potential benefits and risks of MOV considering the results from all randomized participants. During the meeting, the Agency will provide additional key safety and efficacy results based on all 1433 randomized participants (full population). The review issues and benefit/risk assessments may therefore differ from the original assessments provided in the briefing document which was based on the interim analysis.
I reviewed the slides presented at the meeting, and I didn't find any additional information on efficacy.
This situation reminds me of the interim analysis of the Moderna vaccine (link). The FDA briefing document endorsed data that did not meet the FDA's requirement of half the participants reaching 6 months of follow-up while acknowledging that they have received updated data that did not make it to the briefing document.
***
What is the harm of science by press releases?
Look at the following list of key information that are absent from those press releases:
***
There are several other interesting tidbits I gathered that didn't make it to the last post.
The trial defines someone as having high risk of having a severe outcome if they meet one or more of the following criteria: 60 years old +, diabetes, obesity, chronic kidney disease, serious heart conditions, chronic obstructive pulmonary disease, active cancer.
At first glance, they obviously succeeded in selecting a subset that has high chance of severe disease as the event rate (on placebo) was 10%. On the other hand, this group does not seem as high-risk as advertised, based on the very limited information that was published.
Surprisingly, only 14% of the study population was 60 years old+, and only 3% above 75. About 14% have diabetes. Think about that for a moment. If everyone above 60 years old have diabetes, then no one under 60 years old in the trial have diabetes. If no one above 60 years old have diabetes, then 14% of the under-60 have diabetes.
I find it odd that only 14% were over 60. I'd think that many of the other serious conditions are correlated with age so if you pick a random cancer patient or a random person with serious heart conditions, the person is more likely to be older than younger. Thus, I'm imagining that the trial enrolled older but healthy people, and younger people with more comorbidities. We can't be sure since they didn't disclose any details.
Also, it appears that the most at-risk are excluded from the trial. According to the abridged protocol on ClinicalTrials.gov, they excluded anyone who "is on dialysis or has reduced estimated glomerular filtration rate (eGFR) <30 mL/min/1.73m^2 by the Modification of Diet in Renal Disease (MDRD) equation." That would mean someone with chronic kidney disease can only participate in the trial if they are not on dialysis. The protocol lists many other exclusions.
Merck counts hospitalizations and deaths on "all causes". This practice differs from what was used in the vaccine trials, when each case was adjudicated as to whether it is related to Covid-19. We don't have the Merck protocol so we don't know if any adjudication was used.
The study population was almost entirely recruited outside North America. There were only 18 participants from North America, and 40 from Western Europe. Most of the participants came from Latin America or Russia.
Quick quiz. Who wrote this?
I did not test the data for irregularities, which after this painful lesson, I will start doing regularly.
Of course, you won't know. But which of the following people is the most likely to have said such a thing?
A) A Stat 101 student
B) A data scientist working in industry during the first year of employment
C) A graduate student research assistant
D) An assistant professor in the publish-or-perish game
E) A senior tenured professor with multiple best-selling books and tens of thousands of citations
***
If you think the least likely is the most likely, then you'd be right. The answer is (E).
The person who said the sentence (in 2021) is Dan Ariely, currently a Duke professor of economics but probably known to my readers as author of a series of best-sellers in behavioral economics, starting with Predictably Irrational. (You can find the quotation in this PDF.)
Andrew has documented a series of scandals in which several of his seminal studies have been called into question. The latest one involves data obtained from an insurance company by Ariely. He now claims to have no role in collecting, and processing the data.
Neither did the people who uncovered this potential research fraud. Hence, the ability to suss out data problems does not require first-hand involvement in collecting or processing data.
To read about how the fraud was detected, see here.
***
Next time you hear about a publication in a "peer-reviewed" scholarly journal, think about this case. It is entirely possible to publish data analysis in a peer-reviewed journal without "testing the data for irregularities". In fact, researchers with hundreds of peer-reviewed publications may have never once tested data for irregularities! Better late than never, right?
P.S. (1) If you need something light for the holidays, here is an advice column Andrew, hmm, wrote for the WSJ.
(2) One of Ariely's best-sellers is "The Honest Truth about Dishonesty". He is an expert on honesty.
11/24/2021 in Analytics-business interaction, Crime, Data, Errors, Ethics, Science | Permalink | Comments (3)
Observational studies have some common characteristics that annoy the analyst. In the previous post about Pfizer's booster studies, I described a few.
These problems create the need to extrapolate results. Such extrapolation is usually supported by unverified assumptions (aka expert opinion). Of interest, the FDA allowed Pfizer to impose a causal assumption: that if the booster achieves similar levels of antibodies as the original 2nd dose, then the booster also produces similar clinical outcomes as the original two doses. See the previous post for why modifying an indirect outcome does not invariably lead to a change in the direct outcome. (This matter is yet another real-world example about correlation and causation.)
As I have explained often on this blog, making assumptions is not a sin. All statisticians make assumptions - those who claim they make no assumptions make assumptions! Let me give a quick example. Suppose a college runs a survey asking recent graduates about their post-graduation salaries. Only 50% of the graduates responded with a valid number. Should the college impute the salaries of the non-respondents? If I were the analyst, I would impute the salaries, baking in the assumption that those who don't respond are likely to be earning lower salaries (possibly even zero). However, some statisticians will argue against using imputation. They'll claim that one should just report the average of those who responded, because it is a horrible thing to modify the data - one should only correct egregious data entry mistakes but not otherwise touch the data. Are they making no assumptions?
They are making a huge assumption: they assume that the non-respondents have the same average salary as the respondents. What is the basis for that assumption? There is none, other than the desire to not tamper with the dataset. In fact, this assumption is almost surely wrong. As I said in Numbersense (link),
Beware of those who don't tamper with the dataset!
***
In this post, I'll cover several additional extrapolations that were used in analyzing the booster data.
The key immunogenicity results all measure antibody levels against the original virus (so-called wild type) but as we've been constantly reminded, the only virus currently in circulation is said to be the Delta variant. How did Pfizer bridge this gap?
According to the FDA Briefing document (link to PDF), "Pfizer proposes to infer effectiveness of the booster dose against the Delta variant from exploratory descriptive analyses of 50% neutralizing antibody titers against this variant evaluated among subjects from the Phase 1 portion of the study."
"Exploratory descriptive analysis" is not regarded as a serious form of scientific evidence - the rules have changed during this pandemic emergency. I pity the statisticians working on these studies. They were assigned an impossible task. Recall that the gold-standard randomized clinical trial (RCT) has been thrown overboard, replaced by "immunobridging" analysis. The antibody levels stimulated by the booster shot are deemed comparable to those registered (by other individuals) after their second shots, and then the "non-inferiority" in antibody levels is assumed to result in the "non-inferiority" in vaccine effectiveness (a clinical outcome not directly measured).
The Delta variant introduces another complication: as it was not present around the time of the original Pfizer adult trial, researchers do not know what the VE of the first two doses is against Delta. This breaks the causal assumption of Immunobridging.
We do not have even an approximation of the VE outcome against the Delta variant for the 300-odd people in the booster trial. So if one accepts that one can expose the old blood samples to the Delta variant in the lab, and show that the antibody levels are comparable, one has to make an even stronger assumption by claiming that those antibodies will result in the same VE as measured in the Pfizer adult trial.
This line of argument unfortunately results in a contradiction. The original vaccine either works just as well against Delta as it worked against the original strain; or it is less effective, which explains recent surge in cases. Only one of these scenarios can be true.
If the first scenario were true, then the immunobridging assumption holds as VE for Delta is the same as VE for the wild type, but in this scenario, the recent surge in cases has nothing to do with Delta, and there is no need for a booster - since VE is the same.
If the second scenario were true, then the booster may be necessary, but in this scenario, the immunobridging methodology is broken as the prior value of VE no longer applies.
Since Pfizer applied for approval for the booster, it is assuming the second scenario. One might be tempted to prefer assumption #1 because it permits the immunobridging analysis. That is a popular justification you find in academic papers, in the same vein as "computational feasibility" or "tractability". It's risky to use this logic in studies with real-world consequences. Besides, it creates a contradiction.
Beyond the logic of the analysis method, notice that the Delta analysis is performed using only Phase 1 participants, so we are talking about 11 people between 18 and 55, and 12 people between 65 and 75 years old. I'm not sure why blood from the other 300 people cannot be similarly analyzed. Without those results, the immunobridging assumption is expanded yet again to generalize from the 23 people to the 300-plus people (then to the 20K-plus in the adult trial, and finally to the U.S. population.)
A proper RCT would have yielded a much cleaner analysis of the direct clinical outcome.
***
Another missing piece of the puzzle is the provenance of the 300 or so Phase 2/3 trial participants who were "selected" to take part in the booster study. Nowhere in the FDA Briefing document can I find the selection mechanism. Usually, this means the selection was not random from the vaccine arm of the earlier trial. If it is not random, what are the criteria?
The 300 people reduced to about 200 that ultimately were used in any of the primary calculations. That's a drop-off of a third of the starting population, which is a high drop-off rate. (Note: Table 3 claims the evaluable immunogenicity population is 268 but the results shown in Tables 4 and 5 unexpectedly have N=210 and N=179.)
Some of the reasons for exclusion are perplexing. Let's review a few of these reasons.
Six people refused the booster shot. There is an argument that they should be excluded as they did not get the treatment under study, and there is no blood to be analyzed.
Fifteen were dropped because they did not have "at least 1 valid and determinate immunogenicity result within 28 to 42 days after the booster shot." This sets my alarm bells ringing because this exclusion can only be applied after the primary outcome of the study is measured. It's not clear what constitutes "valid" and "determinate". It's concerning to lose 5% of the study population by effectively saying we failed to obtain the primary endpoint.
Then, they dropped a further 30 people (10% of the study population) because their clinician decided that these people committed "protocol deviations" before the 3D+30 day evaluation time.
Last but not least, they also removed 34 people from the study population (15%) based on a key clinical endpoint. This exclusion criterion is described as "evidence of infection up to 1 month after booster dose". Notice it says "after" booster dose, not before. So, it appears that the following happened: they selected about 300 people to start this study regardless of their prior infection status, various people were excluded for the reasons described above, blood samples were collected, PCR tests were performed (on the day of the shot, and then when the participants self-reported symptoms), and if anyone gets sick within 30 days of receiving the booster shot, they were kicked out of the study.
Remember the design of this study. The indirect outcome of antibody levels is used to infer the direct outcome of infection. So when someone is dropped because of infection, we should infer that the dropped person has inadequate antibody levels. These deletions induce a bias in the primary endpoint, raising the average antibody levels of those who remain in the study.
Why is the case-counting window set to 30 days instead of the 7 days used for the 2nd dose? That's not explained in the Briefing document.
***
I may come back to the safety data some day but the sample size is so small that only really loud signals can be heard. If no adverse effects are found, one can't conclude that there are no adverse effects; one can only say that there are no adverse effects that can be detected by this study design.
10/12/2021 in Assumptions, Bias, Big Data, Cause-effect, Covid-19, Decision-making, Ethics, Health, Medicine, Science, Significance, Statisticians, Variability | Permalink | Comments (0)
The FDA Advisory Panel met in mid September to discuss Pfizer's application for a booster shot. This led to a partial authorization (link). The key statement is:
Today, the U.S. Food and Drug Administration amended the emergency use authorization (EUA) for the Pfizer-BioNTech COVID-19 Vaccine to allow for use of a single booster dose, to be administered at least six months after completion of the primary series in:
In today's post, I read the FDA's Briefing Document (link to PDF), released as part of the September meeting in order to understand the science behind this decision. This briefing represents the FDA's interpretation of data submitted by Pfizer.
***
The most striking discovery is that the FDA no longer requires a randomized clinical trial (RCT) to authorize vaccines. As explained in my research methods talk (link), RCTs - long considered the gold standard for data-driven medical science - have six key components: random sample of underlying population, randomization of treatment, double-blinding, control, placebo, pre-specification.
With the FDA's blessing, Pfizer offered data from two observational studies; these studies do not have most of those key components that make RCTs the standard.
The first study is a Phase 1 trial, consisting of a total of 23 people. All of them were given a third (i.e. booster) shot about 7 to 9 months after their 2nd shot.
The second study involved about 300 participants who enrolled in the big Pfizer adult trial. (Here is the link to results from that trial.) All of them were given a third (i.e. booster) shot, about 6 months after their 2nd shot.
In all, the FDA looked at two studies that enrolled fewer than 350 people, without a control group, without placebo, and not blinded. The investigators wished to generalize from about 300 to about 300 million people in the U.S. I'm unsure as to why they can't find 300 other participants from the big trial so that they could randomize treatment into extra shot versus placebo.
The FDA advisory committee apparently were not fully convinced by the data but they nevertheless authorized the booster shot for 65 plus and others at high risk. This is a curious decision.
The second study with 300 participants contained zero people aged 65 and above. The age cutoff was 55 years old. So the only older participants came from the first study, and there were N=12 such people labelled as 65 to 85 years old. Table 2 in the Briefing document revealed that despite the labeling, the 12 people's ages ranged from 65 to 75 years old so there was no data submitted for people above 75 years old. This is an example of tokenism, which I described here.
Further, the Phase 1 trial excluded all people who are at high risk of Covid-19 infection, and so all the evidence we have for 65 plus are from 12 healthy, low-risk individuals aged 65 to 75. Table 2 confirms that this dozen did not represent the diversity of Americans - not surprising given the tiny sample size. They are 100% white, 0% obese, and 0% with comorbidities.
The evidence for the second recommendation for 18-64 who are at high risk of severe Covid-19 is similarly stretched. Presumably, indicators of risk of severe disease include comorbidities and minorities. The second study only included people aged 18-55 so the only data available for the 55-64 age group came from the first study - however, none of these people are at high risk of Covid-19, let alone severe Covid-19, by virtue of the Phase 1 study's exclusion criterion.
As for those in the 18-55 age range, the second study enrolled 306, of which only 55 have comorbidities, only 28 are black Americans, and only 2 are native Americans. In short, the FDA authorization is based on observational studies that have really small sample sizes, which get attenuated further when the recommendations are restricted to subgroups. The advisors extrapolated these results to subgroups that have few or zero data based on their expert opinion. (I'm not denigrating expertise, just pointing out how it is applied.)
***
The FDA has months ago told the pharmaceutical companies they no longer have to run RCTs. They said that future "modified" vaccines can be approved based on "immunobridging" analysis. In this Briefing Document, the FDA further extends the rule to booster vaccines that are not modified from the original, which is what the Pfizer booster shot is.
The logic of immunobridging is to show "non-inferiority" in antibody levels after taking the booster shot compared to after taking the two-dose series. Instead of comparing vaccinated to placebo as in an RCT, such analyses compare people who took the booster shot to people who took the first two shots. If the antibody levels are non-inferior - as defined by statistical criteria, then it is presumed that the booster shot will provide the same level of protection as the original two doses, i.e. the famous 95% vaccine efficacy.
Pfizer submitted data that showed the antibody levels after the booster shot were non-inferior, using thresholds to which the FDA has agreed.
***
What is "immunobridging"? This bridge is what I'd call a "causal assumption".
You see, neither Pfizer nor the FDA can define the relationship between antibody levels and getting infected. During the Advisory Committee meeting, a Pfizer scientist acknowledged, "We actually looked at our breakthrough cases in our placebo-controlled phase 3 study, and have compared the antibody titers where we had the opportunity in individuals that got the disease versus the ones that didn’t, and we were also unable to really come up with an antibody threshold. So I think there’s probably a much more complex story and not easily just addressed with neutralizing antibodies."
What they were hoping for is a simple rule that says antibody levels > X means vaccinated person is protected. The "much more complex story" implies that the rule must involve not just antibody levels but many other factors.
We therefore know that antibody levels is necessary but not sufficient to induce immunity. The causal model requires other variables, some of which may be unmeasured. The chosen solution is to make a causal assumption: just assume that antibody levels is sufficient for protection. This is the meaning of the following sentence in the summary section of the FDA Briefing Document (link to PDF):
Effectiveness of the booster dose against the reference strain is being inferred based on immunobridging to the 2-dose primary series, as assessed by SARS- CoV-2 neutralizing antibody titers elicited by the vaccine.
Two steps are involved here: the 95% VE estimated in the adult trial is assumed to be directly caused by the vaccine producing enough antibodies to fight off infection - while assuming no other factors play a role; then, the booster shot is assumed to produce 95% VE if it induces "non-inferior" antibody levels compared to the original two shots.
***
The next question is how long will immunity coming from the booster shot last. The short answer is 1-2 months; it could be more but we don't have the data since this is another interim analysis with a brief follow-up period.
The second study has the bulk of the people (about 300). The follow-up periods range from 1.1 to 2.8 months post booster shot. But we should subtract 1 month because there is a case-counting window that would start 30 days after the booster shot. Effectively, the period of time during which these 300 people can accumulate cases is as low as 3 days, and up to 1.8 months (54 days). (The first study with 23 participants has the same maximum follow-up time.)
Where does the 30 days come from? This is analogous to the 14-day period in which cases are removed if they occur in vaccinated people. The antibody measurements were taken 30 days after the booster shot and compared to measurements taken at 2D+30 days in the previous adult trial. Since we are inferring protection based on antibody levels, we can't infer protection for any time prior to 3D+30 days by measuring antibodies at 3D+30 days. Effectively, for the third dose, the case-counting window has been pushed from 14 days after the shot to 30 days after the shot. Anyone who gets sick within one month after getting the booster will not be considered "breakthrough".
***
This post just scratches the surface of all the issues that arise when we don't have RCTs to evaluate medical interventions. These observational studies are small, do not have randomly selected participants, do not have a placebo/holdout arm, and are not blinded. As a result, any observed outcome could have been caused by a host of factors, including unknown or unmeasured variables. This forces data analysts to make assumptions. The quality of the analysis depends heavily on the quality of these assumptions.
10/05/2021 in Assumptions, Bias, Cause-effect, Controls, Covid-19, Data, Decision-making, Ethics, Health, Medicine, Models, Science | Permalink | Comments (0)
Why do we have biased datasets?
The answer is simple. It's usually because biases have been actively injected into the data collection processes.
***
As U.S. colleges reopened to in-person teaching this semester, on the strength of a successful vaccination campaign, some schools continue to enforce testing. Last year, I covered how many colleges - such as Cornell and Georgia Tech, succeeded in keeping on-campus infections down by running a strict testing and tracing program (link, link). Staff and students were tested once or twice a week.
This term, the testing policies have been modified. At some colleges, people are still getting tested weekly. Make that some people. Specifically, these colleges require people who do not show proof of vaccination to get tested weekly. Meanwhile, fully vaccinated people are tested, only when they decide to - which means, when their symptoms have become so severe that they present themselves to a testing clinic.
This is a perfect example of injecting bias into one's data collection process. More testing leads to more reported cases. This bias is due to asymptomatic cases and mild cases, and as described before, is (inadequately) revealed by the positivity rate (what proportion of test results come back positive). Compulsory weekly testing adds asymptomatic and mild cases - as well false positive cases - to the tally. This is actually a good thing as Cornell and other schools demonstrated last year.
The new testing policies at some colleges mandate one set of rules for the unvaccinated, and another set of rules for the vaccinated. Because only the unvaccinated are subject to weekly testing, the case count for the vaccinated will include only severe cases while the case count for the unvaccinated will include everything.
***
Such biased data directly result in biased statistics, which lead to errant decision-making.
The differential testing policy guarantees that most of the reported cases happen to unvaccinated people - even if the vaccine were useless. Assume weekly surveillance testing last year found a run rate of 100 cases. A good working assumption is that half the Covid-19 cases are asymptomatic so 50 asymptomatic cases. Now this year, assume 80% on campus are vaccinated, and one's vaccination status is independent of infection risk. Then if the run rate stays the same, then 80 cases will be among the vaccinated subpopulation while 20 cases will be among the unvaccinated. However, testing policy has changed so that all 20 unvaccinated cases will be detected while only the very sick get tested among the vaccinated. Conservatively, we assume 10% of infections become severe. That means 8 of the 80 vaccinated cases will enter the database, accounting for 8/28 = about 30% of all reported cases.
This mix of cases is then turned into a naive estimate of vaccine effectiveness: the unvaccinated has a (70/20)/(30/80) = 9 times higher chance of getting infected. The trouble is that this result is entirely driven by the reporting bias due to differential testing. No assumption of vaccine efficacy was made in the above calculation.
If we add an assumption that the vaccine is 50% effective at stopping infections, then the 80 cases among vaccinated - the run rate from last year - should have turned into 40 cases this year. Then, only 10%, or 4 cases, would be detected because only very sick people present themselves to be tested. In this scenario, the total number of reported cases is lower, and 4 out of 24 reported cases (17%) occur among the vaccinated. The imputed "vaccine effectiveness" (using the naive methodology) shows that the unvaccinated has a (83/20)/(17/80) = 20 times higher chance of getting infected relative to the vaccinated.
A relative ratio of 9 times corresponds to VE of 89% while 20 times is 95%. In other words, bias in the data due to differential testing explains 89% of the 95% naive vaccine effectiveness. The decision-maker operates on the real-world evidence of 95% VE, unaware that the bulk of this number is explained by the misguided differential testing policy.
***
The potential for harm goes beyond counting cases. There have been some reports that claim that vaccinated people who got sick carry even higher viral loads than unvaccinated people.
Once again, I hypothesize that the differential testing policy may explain some if not all of this effect. The average vaccinated person who gets tested has a more severe case of Covid-19 than the average unvaccinated person.
***
The reason why one should resist all biased data collection processes is that it introduces additional factors that can explain some or even all of our outcome metrics, making it harder to prove that the thing we are interested in (vaccination, in this case) is the true driver of those outcomes.
***
What about those colleges that simply require all community members to get fully vaccinated? In this case, they will only test severely sick people (under the assumption that the vaccine is an invitation to the virus.). There are no unvaccinated people to draw comparisons to, and so it may appear that the above problem has been avoided.
Not so. For these colleges are likely to trumpet a comparison of case rates between this year and the previous year, and conclude that vaccinations work. However, there are other differences between the two years. One glaring difference is the amount of testing, which has declined dramatically. Last year's case count included asymptomatic and mild cases (and false positives) while this year's case count don't. Therefore, even in the absence of vaccinations, we should expect the reported case count to fall significantly.
***
I looked up what Cornell and Georgia Tech are doing in terms of testing. Cornell conducts weekly surveillance testing "regardless of vaccination status". This is excellent.
Georgia Tech makes participation in surveillance testing voluntary, which all but guarantees that their dataset is biased by preferentially selecting people who have experienced symptoms or severe illness, and they also say "You may participate in regular testing even if you have been fully vaccinated, but I especially encourage those who have not been vaccinated to get tested weekly," which means the symptomatic bias is more severe among the vaccinated subpopulation than the unvaccinated.
10/01/2021 in Bias, Cause-effect, Covid-19, Data, Decision-making, Errors, Ethics, Health, Medicine, Rules, Science, Tests | Permalink | Comments (3)
As your smartphone scans your face and then unlocks the device, have you ever asked how well biometrics authentication work? Does it work just as well as fingerprints? How does one go about measuring its accuracy?
***
Biometrics started with fingerprints, expanded to face recognition, and now encompasses other types of measurements such as voiceprints.
The basic steps of biometrics authentication are: capturing the signal (image, voice, video, etc.), turning the signal into data, converting data into scores, which measure the likelihood that the biometrics data came from the device owner, and determining whether to grant access based on exceeding a certain threshold.
As with any AI/predictive model, the software must be pre-trained using labeled datasets, e.g. images known to be those of the device owners.
The authentication software makes a binary decision (Allow/Block). Block might involve iterative Retries but I'll ignore this complication. Such a prediction system makes two types of errors: blocking someone who is the device owner (false rejection error FRR) or accepting someone who isn't the owner (false acceptance error FAR). While we've all heard anecdotes of people who've been erroneously shut out of their phones, I haven't actually seen a numeric error rate ... until now.
Reading Significance magazine recently, I came across one study that quantifies the error rates of biometrics authentication. A more detailed article by the same team is found in Communications of ACM.
We tend to think fingerprints are unique. It turns out that authenticating users with other biometrics data are much less accurate - multiple orders of magnitude less so! According to these authors, the field summarizes the two error rates with one number known as "Equal Error Rate" (EER), which is the setting under which FRR and FAR have the same values. If you've read Chapter 4 of Numbers Rule Your World (link), you'll hopefully question why FRR should equal FAR. The costs of the two types of errors are clearly different, and reflect an individual treadoff between convenience and security. (Given the lack of scrutiny of these systems, one infers that most people sacrifice security at the altar of convenience.) Note also that falsely rejecting the owner is an annoyance each and every time it happens, directly felt by the true owner, while falsely accepting an imposter may not be discovered until the owner realizes s/he has been harmed.
The researchers said that their face recognition software has EER of 4 percent, which means that 4 of 100 times the owner's face is read, the software erroneously deny entry, and 4 out of 100 times someone else requests access, their intrusion goes undetected.
Voice recognition software (voice-printing) is shown to be more than 8 times worse than face recognition: the error rates are 35% each. About a third of the time the owner requests access, the software would decide to block. (These authors are pitching a system that combines multiple sources of biometrics data.)
***
When reading reports about predictive models, we should examine how the error rates are measured.
Take the false rejection rate. One would have to present images of true owners to the phone, and then compute what proportion of these images the phone erroneously decides to be imposters. Whether the FRR is credible depends on how the investigators select the set of test images. For example, if the test images are the same images used to train the model, the chance of error is smaller.
Each face recognition system has its strengths and weaknesses. Some, for example, are fooled by glasses. People find that they have to take off their glasses to unlock their phones. If the set of test images does not include images of the true owners wearing glasses, then the FRR is under-estimated. If detecting glasses is a strength, not a weakness, of the system being evaluated, the FRR is now over-estimated.
The false acceptance rate is even harder to measure accurately because the investigators must compile a set of images of imposters. There are infinitely many ways to evade the software. The researchers of this study adopted a popular method: "We performed the testing through a randomly selected face-and-voice sample from a subject we selected randomly from among the 54 subjects in the database, leaving out the training samples." This method relies on "randomization".
Almost always, such a method of selecting test samples over-estimates the accuracy rates. That's because in real life, the bad guys are not randomly selected from the entire population - but from the subset of bad guys. The bad guys who are attempting to unlock your phones are likely to exploit weaknesses of the technology. The authors included an interesting example of one such attack strategy. Certain software assesses the quality of the images submitted for verification, and assigns higher importance to higher-quality images. This makes sense but the system is open to a form of attack: the bad guys deliberately submit poor-quality images, knowing that the software can't keep locking out true owners who provide low-quality images.
It's challenging to compile test images for measuring false acceptance rate, as it requires specifying how imposters are likely to attack the system.
The key takeaway is that error rates coming from research studies is heavily affected by the investigators' choice of test images. As modelers, we can fool ourselves by presenting easy but unrealistic images for validation. Ideally, we should employ a neutral third party to evaluate these systems.
***
This last section dives into some technical details, which you can skip if not interested.
Below is a table and a figure included in the Significance article, which provides data on the error rates:
The researchers compared three authentication schemes: face recognition, voice recognition and one that fuses both face images and voice prints (described as "feature-level multimodal fusion"). Table 1 says that the fusion scheme has the lowest EER, followed by face recognition; voice recognition has a terrible EER of 35%.
Figure 2 presents the components of the EER in the form of an ROC curve. However, the results shown in this Figure does not match what is shown in Table 1.
The ROC curve plots the "true positive rate" against the "false positive rate". According to the authors, the false positive rate is the FAR, the chance of letting an imposter through. (This definition implies that a positive result is positively identifying the true owner.) The inverse of a "true positive" is a false negative, that is to say, to mistakenly block the true owner of the device. Thus the FRR (false rejection rate) is the false negative rate, which is 1 - the true positive rate.
With ROC curves, the top left corner represents the perfect system that makes zero false positiive mistakes and 100% true positives (i.e. zero false negative mistakes). The ranking of the three authentication schemes should therefore be Blue > Red > Green, i.e. Feature/Fusion > Voice > Face. But this chart contradicts the data shown in Table 1 where the ranking is Feature/Fusion > Face >>> Voice.
The EER is "the value at which FAR and FRR are equal". The table shows a summary statistic computed from the FAR and FRR while the chart plots the two metrics separately. If the underlying values of FAR and FRR are the same, the ranking should not differ.
Let's locate the points at which FAR equals FRR on the ROC curve. Recall that the vertical axis is 1-FRR while the horizontal axis is FAR. Thus, when FAR equals FRR, y = 1-x.
As shown above, all three schemes have EERs around 20%, with the best one closer to 10%. None of these have EER under 10% and voice recognition isn't 5 times worse than face recognition. So I'm very confused by these figures. I can't find any more details about this research beyond those two articles.
A 20% error rate is a far cry from what we have come to expect from fingerprinting.
09/30/2021 in Algorithms, Bias, Big Data, Business, Chapter 4, Data, Decision-making, Errors, Ethics, False negative, False positive, Models, NumbersRuleYourWorld, Science, Statisticians, Tests, Variability, Web/Tech | Permalink | Comments (0)
Today is a good day to review some of the things you've read on this blog all through the pandemic.
***
As governments pushed very hard for third doses of the Covid-19 vaccine, it may feel like ancient history when - back in January, only about six months ago - many experts were pushing for one-dose mRNA vaccines, claiming that a single dose gives you close to 90% effectiveness.
I sounded the alarm in a post called One Dose Vaccine Elevates PR Over Science (January 2021), which began with:
"I fear that the U.K. policy of one-dose vaccines will backfire, and cause the pandemic to continue longer than necessary."
We are now certain that those who argued for single doses advocated a policy that likely resulted in unnecessary suffering.
In another post, appearing right after the Pfizer EUA in December 2020, I laid out the reasons why the data cannot support a one-dose regimen. (One Dose Pfizer is Not Happening and Here's Why) Nothing I said in December has aged. They are all still true.
***
In that same post, I predicted that "Partial protection provides a convenient excuse for vaccinated people to do away with inconvenient mitigation measures." Little did I know that CDC would subsequently endorse this folly by telling vaccinated people to drop their masks.
The CDC guidance was based on hope-fueled, over-interpretation of the data. Until recently, most experts have claimed that the vaccine stops infection, even asymptomatic infection, and spread. Now, they retracted those claims. But none of those outcomes were ever formally measured in the clinical trials.
Eight months ago, when vaccinations were just starting, you read here: "I think two doses are closer to 70 percent effective in reality, and we don't know that the vaccine stops asymptomatic spread, and so continuing to reduce contacts is advisable." Many places that removed those restrictions have been forced to reimpose them.
***
"One of the key lessons of managing this pandemic so far is that good data drive good decisions, and bad data drive bad decisions. Unfortunately, policymakers have signed up for bad data, so no one should be surprised if future policies turn out badly."
That was the conclusion of another post from January, Actions have Consequences: the Messy Aftermath of One-Dose Pfizer. The situation, as of August, has not improved. At every turn, governments have failed to collect relevant data, sufficient data, and good-quality data. In fact, they have actively interfered with the ability to learn from data.
The central example from that January post explains one (of many) reasons why real-world vaccine effectiveness studies have wildly exaggerated the effectiveness of the mRNA vaccines.
"A fundamental best practice of running statistical experiments on random samples of a population is that once the winning formula is rolled out to the entire population, the scientists should look at the real-world data and confirm that the experimental results hold."
"The action of the U.K. government (and others who may follow suit) has severely hampered any post-market validation. It is almost impossible to compare real-world evidence with the experimental result, because most people are not even getting the scientifically-proven treatment per protocol!"
That's right. The original vaccine efficacy measure came from clinical trials in which the participants followed a precisely-timed two-dose regimen. In the real world, many countries deviated from the prescribed protocol, adopting policies such as one doses, or extending the dose interval from three weeks to three months, or mixing and matching vaccines. It is an insult to science to pretend we are comparing apples to apples.
These issues did not catch anyone by surprise. When I raised those concerns in January, Pfizer was just being rolled out and not a single real-world study has commenced.
A particularly galling detail is very telling. In the U.K., the average duration between two doses is 80 days (almost 3 months!) When the U.K. studies apply the (in)famous 2D+14 case-counting window - that is to say, the researchers nullify all cases that occur after the first shot and before 14 days after the second shot, they are discounting cases for more than three full months! The rationale for the case-counting window is that the vaccine requires time to attain optimal effect - surely, a vaccine that needs 94 days to become effective isn't one that has practical value!
***
It is sobering to look at the data today (well, yesterday when I pulled this chart), courtesy of the fine folks at OurWorldinData:
Israel and the U.K. won countless headlines in the first half of the year when their case rates dropped to historic lows - they were the loudest in attributing the entire drop in reported cases to one and only one cause: widespread vaccinations. Today, they have some of the highest infection rates, much higher than in South Africa and India, which had experienced brutal surges but have relatively low vaccination rates.
In the first week of May, I wrote a post called Curve Watching During the Pandemic. Even when the cases were low in Israel and the U.K., if one was willing to look at non-conforming data, one would have noticed gaping holes in the idea that vaccinations were the sole explanations for the trends at the time.
In that post, I concluded: "Simple one-factor models aren't going to work to explain trends across countries and time. A good causal model should include a baseline trendline, a vaccine factor, plus lockdowns and other mitigation measures." To this day, none of the studies covered by the media follow this strategy.
***
It is well past time to admit that the real-world vaccine studies have dramatically over-stated the effectiveness of the mRNA vaccines. This is a failure of science. Not only, it is a systematic failure because I do not know of a single study that has committed the opposite error of under-estimating the VE! It's clear science has not fielded an A team in this crisis.
On this blog, I reviewed many of the influential real-world studies that created the narrative that vaccines are as effective in the real world as they were in clinical trials. I have indicated a wide variety of problems with such studies, and all the reasons why their conclusions are over optimistic.
If you are interested in research methodologies, read this series of posts from March 2021:
Real-world Studies: How to Interpret Them (Mayo clinic/nference)
Real-world studies: limits of knowledge (Mayo clinic/nference)
The Confusing Picture in Israel and in the Israel Study (Israel Clalit)
Note on a Simpson's Paradox in Real World Vaccine Effectiveness Studies (Denmark)
What the Danish Study tells us about the CDC Study on Real-World Effectiveness (Denmark, CDC)
Eventually in July, Public Health England published a study that provided some raw data which I harnessed to explain these abstract concepts. This exercise led to a post called Real World Vaccine Studies Consistently Overstate Vaccine Effectivness. In this post, I laid out several key adjustments that should be made to real-world VE calculations, and how these adjustments would have resulted in far less rosy pronouncements. My conclusion about mRNA vaccines: "A realistic estimate of VE is probably closer to 60%, which is an excellent number."
This is no mere squabbling amongst scientists. I repeat what I said above: "Good data drive good decisions, and bad data drive bad decisions." Throughout this crisis, the policymakers did not work with good data, by this, I include good analytical findings.
***
In my very first review of a real-world study (Mayo Clinic/nference), I stated the fundamental challenge facing real-world studies: "The simplest first analysis is to compare the case rate of the vaccinated people to the case rate of the unvaccinated people. This is hopelessly flawed because in a real-world study, we must not assume 'all else equal.' People who have received the vaccines at this point are apples and oranges to those who haven't."
That was March 2021. Fast forward to August, and I'm saddened to report that the situation has deteriorated. Unfortunately, the more recent studies have almost all adopted the "simplest first analysis" that is "hopelessly flawed".
Take the recent full court press relating to the first few days of data after Israel started giving third doses to older people. On the next day, the case rate amongst those who got the third shot is immediately below those who only had two shots. To a statistician, this gap presents the best estimate of selection bias we know of - no vaccine can be expected to work at full strength within one day so the more likely explanation for it is that people who rushed to the front of the line have lower baseline propensity to get infected by the coronavirus. In other words, the third dose unintentionally reveals who the lower-risk people are.
I discuss this in a recent blog post, which showed up on the dataviz section of my blog. This selection bias is yet another reason why the early real-world studies - even those using more advanced methods - over-estimated the VE. Thus, I concluded: "Statistics is about grays. It's not either-or. It's usually some of each. ... When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough."
***
This is a very long post to say thank you for your past year of support, and tell your friends about this blog. I can't promise I will get everything right but so far, the record looks good :)
P.S. The big news of this week is the FDA's "full approval" of the Pfizer vaccine. I am not devoting more space on the blog for it than this paragraph because there is nothing to discuss: the only additional information that was publicized since the interim analysis in December was the "6-month" update, which is nothing but (link). This was not a decision based on science. In fact, on cable news yesterday, the experts were not talking about the science. The entire rationale for this decision sounded as if someone at the FDA spent time reading consumer surveys. I refer you to Peter Doshi who nicely summarized the issues at BMJ (link).
08/24/2021 in Behavior, Bias, Cause-effect, Covid-19, Data, Decision-making, Errors, Ethics, Health, Mass media, Medicine, Models, Politics, Science, Statisticians, Surveys, Tests | Permalink | Comments (0)
Recent Comments