Why do we have biased datasets?
The answer is simple. It's usually because biases have been actively injected into the data collection processes.
***
As U.S. colleges reopened to in-person teaching this semester, on the strength of a successful vaccination campaign, some schools continue to enforce testing. Last year, I covered how many colleges - such as Cornell and Georgia Tech, succeeded in keeping on-campus infections down by running a strict testing and tracing program (link, link). Staff and students were tested once or twice a week.
This term, the testing policies have been modified. At some colleges, people are still getting tested weekly. Make that some people. Specifically, these colleges require people who do not show proof of vaccination to get tested weekly. Meanwhile, fully vaccinated people are tested, only when they decide to - which means, when their symptoms have become so severe that they present themselves to a testing clinic.
This is a perfect example of injecting bias into one's data collection process. More testing leads to more reported cases. This bias is due to asymptomatic cases and mild cases, and as described before, is (inadequately) revealed by the positivity rate (what proportion of test results come back positive). Compulsory weekly testing adds asymptomatic and mild cases - as well false positive cases - to the tally. This is actually a good thing as Cornell and other schools demonstrated last year.
The new testing policies at some colleges mandate one set of rules for the unvaccinated, and another set of rules for the vaccinated. Because only the unvaccinated are subject to weekly testing, the case count for the vaccinated will include only severe cases while the case count for the unvaccinated will include everything.
***
Such biased data directly result in biased statistics, which lead to errant decision-making.
The differential testing policy guarantees that most of the reported cases happen to unvaccinated people - even if the vaccine were useless. Assume weekly surveillance testing last year found a run rate of 100 cases. A good working assumption is that half the Covid-19 cases are asymptomatic so 50 asymptomatic cases. Now this year, assume 80% on campus are vaccinated, and one's vaccination status is independent of infection risk. Then if the run rate stays the same, then 80 cases will be among the vaccinated subpopulation while 20 cases will be among the unvaccinated. However, testing policy has changed so that all 20 unvaccinated cases will be detected while only the very sick get tested among the vaccinated. Conservatively, we assume 10% of infections become severe. That means 8 of the 80 vaccinated cases will enter the database, accounting for 8/28 = about 30% of all reported cases.
This mix of cases is then turned into a naive estimate of vaccine effectiveness: the unvaccinated has a (70/20)/(30/80) = 9 times higher chance of getting infected. The trouble is that this result is entirely driven by the reporting bias due to differential testing. No assumption of vaccine efficacy was made in the above calculation.
If we add an assumption that the vaccine is 50% effective at stopping infections, then the 80 cases among vaccinated - the run rate from last year - should have turned into 40 cases this year. Then, only 10%, or 4 cases, would be detected because only very sick people present themselves to be tested. In this scenario, the total number of reported cases is lower, and 4 out of 24 reported cases (17%) occur among the vaccinated. The imputed "vaccine effectiveness" (using the naive methodology) shows that the unvaccinated has a (83/20)/(17/80) = 20 times higher chance of getting infected relative to the vaccinated.
A relative ratio of 9 times corresponds to VE of 89% while 20 times is 95%. In other words, bias in the data due to differential testing explains 89% of the 95% naive vaccine effectiveness. The decision-maker operates on the real-world evidence of 95% VE, unaware that the bulk of this number is explained by the misguided differential testing policy.
***
The potential for harm goes beyond counting cases. There have been some reports that claim that vaccinated people who got sick carry even higher viral loads than unvaccinated people.
Once again, I hypothesize that the differential testing policy may explain some if not all of this effect. The average vaccinated person who gets tested has a more severe case of Covid-19 than the average unvaccinated person.
***
The reason why one should resist all biased data collection processes is that it introduces additional factors that can explain some or even all of our outcome metrics, making it harder to prove that the thing we are interested in (vaccination, in this case) is the true driver of those outcomes.
***
What about those colleges that simply require all community members to get fully vaccinated? In this case, they will only test severely sick people (under the assumption that the vaccine is an invitation to the virus.). There are no unvaccinated people to draw comparisons to, and so it may appear that the above problem has been avoided.
Not so. For these colleges are likely to trumpet a comparison of case rates between this year and the previous year, and conclude that vaccinations work. However, there are other differences between the two years. One glaring difference is the amount of testing, which has declined dramatically. Last year's case count included asymptomatic and mild cases (and false positives) while this year's case count don't. Therefore, even in the absence of vaccinations, we should expect the reported case count to fall significantly.
***
I looked up what Cornell and Georgia Tech are doing in terms of testing. Cornell conducts weekly surveillance testing "regardless of vaccination status". This is excellent.
Georgia Tech makes participation in surveillance testing voluntary, which all but guarantees that their dataset is biased by preferentially selecting people who have experienced symptoms or severe illness, and they also say "You may participate in regular testing even if you have been fully vaccinated, but I especially encourage those who have not been vaccinated to get tested weekly," which means the symptomatic bias is more severe among the vaccinated subpopulation than the unvaccinated.
One way of solving this is to do an additional smaller random sample of the vaccinated. Then you can treat it as a missing data problem and due multiple imputations of the COVID status of the remaining subjects. It is similar to a problem in diagnostic testing, where there is a screening test and only the positives are given a more accurate test.They are now doing similar in Data Science but don’t understand what they are doing so end up with inflated ideas of their test accuracy.
Posted by: Ken | 10/04/2021 at 02:19 AM
Hi Kaiser,
I thinknyou find these two paper interesting because carrying some of the errors you point out in your postings.
Here then problem for me is how to try to solve best to make the 2 able to compare.
http://dx.doi.org/10.1136/bmj.n2244
https://doi.org/10.1016/S2589-7500(21)00080-7
Of course not so easy even though both big sample just to eqaalise results on sample size. Can you make any suggestion ...or tell me it is not possible.
Context is for to see who best for boosting.
Thanks!
Posted by: A Palaz | 10/04/2021 at 03:44 PM
AP: Thanks for those links. They are interesting work that I'll review later. But they don't address the issue of bias in data collection between vaccinated and unvaccinated, as the main model only does prediction for vaccinated people. (There are biases in determination of cause of death which I'd like to see some discussion of.)
Ken pointed out one possible fix, which is to run a small random sample on the side. (The React-1 study I reviewed before uses random samples.)
Another source of information is the time series for the vaccinated. They all transitioned from unvaccinated to vaccinated at some time. So we might find correlation between the time series of percent vaccinated and the time series of testing rate and/or positivity rate.
These lead to crude adjustments to the averages but better than not adjusting at all.
The problem with boosting (and I'm foreshadowing the next post on the blog) is that the evidence is weak in terms of its usefulness, and in particular, the idea that the booster shot works better for at-risk, older people is more of a belief than an empirical result. If the signal is weak, then it's hard to find a predictive model that work well. I'd consider using matching to balance your training dataset first before modeling.
Posted by: Kaiser | 10/04/2021 at 05:55 PM