From the beginning, the U.S. has never gotten religion about accurately measuring the pandemic. In March 2020, I wrote a piece for Wired about the need for broad-based testing because selectively testing only severe cases is bound to give us a biased picture of the crisis. It was known even then that a significant portion of infections is asymptomatic - which constitutes a silent path by which the virus passes from person to person. The situation has not improved much since. Recent counting rule changes imposed selectively on vaccinated people have made it far worse but that's the subject of a different post.
I'm delighted to report on a valiant effort to correct this oversight - in Slovenia, a country of about 2 million people. The relevant papers have just been published here and here.
***
Because we want to minimize bias, randomization is our friend. I have previously explained how randomly assignment treatment in a clinical trial creates covariate balance (here) which allows researchers to assume "all else equal" except for the treatment.
In this case, we want to test a random sample of people selected from the population to avoid the self-selection bias that permeates the usual counts of "cases". The demographic and medical profiles of the sample should mimic that of the entire population. In the Slovenian study, they apply standard statistical sampling methodology, as follows: the country is divided into 300 census districts; from each district, they select 10 persons at random; a total of 3,000 Slovenians were sent communications inviting them to participate in this study; just under half agreed, leading to a sample size of 1,368.
Because of this random (i.e. broad-based) testing, the researchers uncovered not only self-reported symptomatic cases but also asymptomatic cases. All participants were given PCR tests at the start of the study in April, which was towards the end of the first wave in Slovenia. The estimated prevalence was 0.15%, or 150 per 100,000.
It's important to understand prevalence as a measure of infections at a point in time (late April-early May 2020 in this case). The PCR test reveals whether someone is currently or has recently been infected with the novel coronavirus. This metric does not answer the important epidemiological question: what proportion of the population has been infected since the start of the pandemic?
***
The research team's second paper deals with this question - the canonical approach is through serological testing i.e. testing for antibodies. Most people who have been sick with Covid-19 have developed antibodies. So we expect the proportion of population that have had Covid-19 to be larger than the proportion who are currently/recently infected. Technically, we say the seroprevalence is greater than the prevalence.
I have discussed serological testing before - at the start of the pandemic. Two research teams - one at Oxford and one at Stanford - jumped the gun when they published serological studies during the first wave of Covid-19. The Oxford team (in)famously declared that up to 40% of the U.K. population have already been infected with SAR-CoV-2 by April 2020 - a result that has proven extremely embarrassing. To be precise, they used a mathematical model to come up with the alarming headline, and called for serological studies. The Stanford team argued that total infections in California were 50 to 85 times the reported number of cases - back in April 2020. This group actually performed tests to derive their estimate. Their analysis was roundly criticized. Recent handwaving analyses tend to multiply cases by 4 times, which gives you an idea of the magnitude of the error in the Stanford study.
Meanwhile, I'm still expecting that one day, the New York State Department of Health will release the data behind the almost daily pronouncements by Governor Cuomo - again in spring and summer of 2020 - that at least 15% of New Yorkers have been infected. Those numbers were supposed to have come from antibody testing throughout the state of New York but no data or science has seen the light of day. While not as far-fetched as the two other studies, that estimate of seroprevalence is also clearly repudiated by subsequent waves of Covid-19.
In the Slovenian study, the unadjusted estimate of seroprevalence around mid-November 2020 (in the middle of a devastating second wave) was only 4 percent. At the time, the cumulative reported cases was around 27,000, or 1.3% of the population. This estimate is much more solid than the previous studies because these researchers used a random sample. The design of research studies matters a lot - when devices like randomization are utilized, fewer adjustments and assumptions are necessary to interpret the data. (See my three-part talk to learn more about research methods.)
***
Much of the second paper deals with the complications of the analysis even though the team utilized a random sample. The study interviewed participants at 3-week intervals. There were two rounds of blood samples (April 2020, and October 2020).
In April 2020, the reported cases (about 700) was 0.03% of the Slovenian population. With serological testing, they estimated seroprevalence to be roughly 0.8-0.9%, so roughly 18,000 people have been infected. More precisely, they estimated that 0.8-0.9% of their random sample have evidence of SARS-Cov-2 antibodies (that are true positives), and generalized that result to the entire population.
The raw data actually showed 3 percent, which was revised down to 0.8-0.9%. This is because of two factors that artificially inflated the unadjusted number.
The first is test inaccuracy. All tests are inaccurate to some degree - which is a key message of Chapter 4 of Numbers Rule Your World (link). In this situation, because the vast majority of Slovenians have not been infected, false-positive errors can be devastating to any analysis. Even a small false-postive error rate (e.g. 2 or 3 percent) gets amplified by the large base population into an abundance of incorrect positive test results. Correcting for test inaccuracy therefore pulls down the number of positive cases, and thus seroprevalence.
We have encountered this issue before - when pointing out the methodological weaknesses of the 2020 Stanford study. Their initial analysis did not correct for false-positive results. But it's actually a really hard problem. To know that a positive result is definitely a false positive implies that we know the true result - if we did, we would not need to run a test. In practice, the researchers must obtain an estimate of the probability of false positives from other studies.
For example, they gather blood samples collected for other reasons - preferably from before first half of 2019 when they are sure the blood samples did not have any SARS-Cov-2 antibodies (or RNA, depending on what test). Then they apply the test to these blood samples, and check what proportion of them give positive results despite clearly being negative.
Such definitively negative blood samples are hard to procure and so our knowledge of test accuracy is limited. Many researchers just trust the test manufacturer's claims, which is a bit shocking since they market their tests, in essence, as 100% accurate. The Slovenian team did a literature search, and discovered that the accuracy of the antibody test they were using was not 100%, and the accuracy measures varied widely from study to study. Thus, the second factor the researchers adjusted for was the variability of test accuracy. (*)
***
Because of concerns about the first testing platform, when they analyzed the second set of blood samples In October 2020, they switched to a different testing platform, which appeared to have better and less variable accuracy statistics.
Despite random sampling, there may be non-response bias - the people who chose to participate in this study are not necessarily the same as those who opted out of the study. If the non-responders are different from responders, the result from this study cannot be generalized to the entire population without more corrections.
If you notice, this study is much more rigorous than the real-world vaccine effectiveness study that I have recently featured - those studies sweep most of the biases under the rug.
Nevertheless, the responders and non-responders look similar enough - by the variables they analyzed. It is therefore not surprising when the authors reported that a further correction for non-response bias did not materially change their seroprevalence estimates.
***
This Slovenian study is a good example of proper science that doesn't cut corners. What's notable is that seroprevalence grows at a much lower rate than reported cases. While cases have jumped almost 40 times during the six months of the study, seroprevalence has risen about 4 times (call it 10 times if we allow for the error bar).
P.S. [*] The study employed Bayesian methods throughout. Here is how they adjusted for variability of test accuracy. Different prior studies have come up with a range of estimates for test accuracy. If one doesn't adjust for this, one would take the average value from these studies, and apply this to the analysis. For example, if the test specificity (true negative rate) is 98%, then you assume exactly 2% of the negative blood samples would show up as false positives.
In a Bayesian model, the test specificity is treated as a variable: its average value is 98% but it can vary within some reasonable range (as informed by the variability observed across those prior studies).
Comments