I offered a few high-level comments on the widely publicized CDC study of real-world effectiveness of the mRNA vaccines in my previous post. Today, I take a deeper dive into the study.
The main value of this real-world study comes from the weekly swabs requested from each participant. Unlike other real-world studies based on "found data", this CDC study is an organized effort with enrolled participants who agreed to send in swabs so their infection status can be determined each week. Of the vaccine trials I reviewed, only the Astrazeneca included this feature (although only for the U.K. trial and possibly only a part of that). These swabs reveal asymptomatic cases. As a result, the measured infection rate is likely to exceed what was observed during the vaccine trials.
The study reported an admirable compliance rate on those swabs. The median participant submitted all requested swabs.
Similar to the Danish study, the analysis population consists of specific higher-risk people. The CDC study focuses on healthcare workers and essential workers. Like all real-world studies, we must be careful when generalizing study results. It's one thing to say the vaccine was found to be highly effective for healthcare & essential workers; it's a completely different thing to claim that the study showed the vaccine to be highly effective for everyone (which is what the media and many "experts" have been touting all day).
Unlike the Danish study (based on comprehensive, found data), the CDC study does not include every healthcare worker or essential worker. Because of the enrollment requirement, the study population is self-selected so the first question to ask is whether the analysis population is representative of all healthcare workers and essential workers. There is nothing in the paper to answer this question. Table 1 discloses that 7 out of 10 people in the study are under 49, over 80% are white and non-Hispanic, and 70% has zero chronic condition (left undefined in the paper). The analysis population is thus significantly younger and healthier than our average American.
There exists two validity questions. The above deals with the difference between the analysis population and the general population. The next question of validity concerns the differences between the vaccinated group and the unvaccinated group within the analysis population.
There are two sentences in the CDC study that perfectly describes the challenges of any real-world effectiveness study:
For example, the infection rate in Miami, Florida was 8.6%, which is 65% higher than the average of 5.2% in the study. Twelve percent of the unvaccinated group work in Miami, compared to 3.5% of the vaccinated group. The infection rate in Portland, Oregon was 1 percent, much lower than average, and 90% of the Portlanders included in the study was in the vaccinated group.
Another example is occupation. The infection rate of primary healthcare workers was 2%, and over 90% of them were in the vaccinated group. Meanwhile, the infection rate of first responders was 9% (over 4.5 times higher), and 40% of them were unvaccinated.
Those two sentences suffice to explain the field of real-world studies. The researchers conclude that VE is very high, that is to say, the unvaccinated group has a much higher infection rate than the vaccinated group. They suggest that the vaccine is wholly responsible for the observed difference in infection rates. But any of the following statements can also be true:
- Some or all of the difference is explained by the over-representation of higher-risk males in the unvaccinated group
- Some or all of the difference is explained by the over-representation of higher-risk Hispanics in the unvaccinated group
- Some or all of the difference is explained by the over-representation of higher-risk first responders (and under-representation of lower-risk healthcare personnel) in the unvaccinated group
- Some or all of the difference is explained by the over-representation of higher-risk people living in Arizona, Florida and Texas (and under-representation of lower-risk residents of Minnesota and Oregon)
The two groups being compared differ not only by their vaccination status but also by gender, race, occupation and state of residence. The unvaccinated group has an over-representation of higher-risk individuals of each category.
If you recall my posts about the studies by Mayo Clinic or Israel's Clalit, this is the imbalance problem they solved by the matching procedure. In the CDC study, they ignored all these biases except for study site.
For studies using regression models, the usual corrective mechanism is to include adjustment terms, similar to the calendar time bias adjustment used in the Danish study. For more details, see my previous post.
In the CDC study, the only factor they adjusted for is study-site bias.
In a side comment, the researchers said they considered regression models that adjusted for sex, age, ethnicity and occupation ("individually"), and the change to VE was "<3%." They appear to be saying that those factors are unimportant because the change to VE is small. And yet, in Table 2, we learn that the study-site adjustment changed the VE from 91% to 90%, which is a difference of 1%.
I'd have preferred to see a model that includes all variables that can explain the difference in infection rates between the vaccinated and unvaccinated groups. Even if the effects of some of these variables are not statistically significant, they need to be in the regression model to obtain a better estimate of the effect of vaccination status - this is because of the correlation between vaccination status and those demographic variables (which is due in part to self-selection).
Curiously, the CDC study did not do a calendar time bias adjustment, and therefore, the bias identified in the Danish study is present. This bias originated from the drastic decline in infection rates during the first quarter of 2021. Upon vaccination, an individual migrates from the unvaccinated group to the vaccinated group. If we tally up the person-hours for the unvaccinated, they will skew towards the start of the study compared to those for the vaccinated group.
The time bias in the CDC study might be mitigated by the unusual enrollment cadence. While the study ran from mid December (start of vaccinations) to mid March, 60 percent of participants got their first doses in December. That leaves 40 percent. Since 25 percent remains unvaccinated to the end, only 15 percent got their first shots between January and mid March. So, the vaccinations were concentrated heavily in the first two weeks of the study.
This sheds some light on the self-selection bias problem described above. It seems that the 25 percent who remain unvaccinated chose not to get inoculated as there was plenty of time for them to do so.
Further, because the case counting window applied to the vaccinated group only, infections in the first few weeks of this study period can only appear in the unvaccinated group, and we know the overall infection rate was higher in those weeks.
Adjusting for these remaining biases will not wipe out the vaccine effectiveness but will provide a more realistic and believable measurement.
I realize I still haven't gotten to the "partial vaccination" analysis. That will appear in a future post.
Hey Kaiser,
So in Danish I don' t think any much regression they just reweight by patient day. E.g. in care home population they vaccinate early leaving only 1868 not vaccinated from maybe 15-17 days.
Here is my guess at raw cases in unvaccinated cover by their time windows.
454. 0-14 days. Etc...
22
7
5
Why do they not present this. Its very annoying. Maybe you have an idea. Possibly a feeling that it will not look robust with the shrunken unvaccinated sample?
Thinking about that how should that be dealt with?
This connects to this post because the link I see across studies is which is better patient days at risk or patients for presenting.
I think days better because of relationship between infections and exposure, but obvious this gets same effect you detail in Simpsons paradox.
So how does this connect here. Well you make a slip in one post saying "hours at risk" and for the CDC this applies.
One more adjustment for these special high risk groups would be hours of exposure to high risk *degree of risk.
So just to add one more to your great list, some of the variance might also be explained WITHIN and across risk groups by such factors.
I now look at all studies that do not present detailed patient days with disappointment and the question why.
Posted by: A Palaz | 04/01/2021 at 06:04 PM
AP: I think the level of disclosure in these studies are well below what's required, especially since most of these are "interim" studies where the data have not been baked in yet. The calendar time is made much more crucial because of (a) the use of case-counting windows and (b) the changing environment. Also, I do not understand why they do not publish their model - the CDC study did not publish their model either. Saying it's a Cox regression is not enough.
The CDC study addresses the issue you brought up - that you can't use infection rates per person when more and more of the cohort are getting vaccinated. So these new studies - unlike RCTs - use person-time as the denominator. As I said above, this effectively splits a vaccinated person's timeline into two parts, first counting as unvaccinated and then as vaccinated. This is the so-called Andersen-Gill extension to Cox. It's possible that this is what the Danish study did as well but nothing in the paper tells us that.
But that adjustment does not deal with the sharp decline in infection rate from December to March, and the fact that the unvaccinated exposure is primarily in the earlier weeks when infection rates were much higher.
I also think - and someone please correct me if I'm wrong - that AG extension does not address the self-selection bias problem in this data. All it does is to address a timing bias that would arise if a standard analysis were applied.
Posted by: Kaiser | 04/02/2021 at 03:30 PM