The tracking app studies are on rush order, just like all sorts of Covid-related preprints that have come under scrutiny. I learned about this collaboration between King’s College (UK) and an app developer Zoe Global through a news article proclaiming that 13 percent of the UK have already been infected with the novel coronavirus. This is the catch of the moment, being able to declare that you know what proportion of the population has already been infected.
I discussed the Stanford study the other day, which has received robust criticism from the statistical community (e.g. Gelman). This Covid Symptom Tracker study is much less convincing as a way to measure population prevalence. Its potential value is for rationing test kits to those most likely to test positive but notice that the objective of targeting the most afflicted conflicts with the objective of measuring prevalence, which is a feature of the general population.
***
Methodology
The UK team used an indirect path to estimating the prevalence of the SARS-CoV-2 virus. They launched a mobile app, which users downloaded and by which they submitted symptoms on a daily basis. A tiny subset of these users reported having taken a test, and submitted their test results. Using those users with known test results, the team then built predictive models relating the symptoms to the test results. They claimed that loss of smell and taste (anosmia) was the top predictor, even higher than fever or cough. They then used the model to score the large chunk of app users who submitted symptoms but not test results. The model predicted 13 percent of this group to test positive (should they get tested). The team then generalized this result to the general UK population. They (or the press) made a host of breathtaking claims: a) their predictive model can replace blood tests (“detect infections….without the need of extensive biospecimen testing”); b) the single predictor (anosmia) should be “used as a screening tool to help identify potential mild cases who could be instructed to self-isolate”; c) the 54,000 users predicted by the model to test positive “are likely to be infected by the virus”, which led to suggestions that official case counts drastically undercount infections.
Comparison to Stanford Study
This method of estimating population prevalence of the virus differs from the Stanford study in major ways. The Stanford study involved actual blood tests combined with survey data collected on the day of testing. The Symptom Tracker study collected survey data over time via an app interface, while the published analysis focused only on the first five days after the app launched. The Stanford study used actual results from antibody tests while the Symptom Tracker study asked users to report their test results from previously taken PCR (diagnostic) tests. The former looks for people who have recovered from a previous infection while the latter confirms people who were infected at the time they were tested.
***
I took plenty of notes while reading the preprint submitted to Nature (link). I grouped my comments into three buckets that reverse the analytical workflow. First, I discuss their conclusions while assuming that the analysis and data collection are accepted. Then, I point out major issues with the analytical framework. Last but not least, I raise concerns about how the data are collected and processed.
[P.S. This post became so long that I have to break it up into multiple parts.]
***
The fruit is dry, and the researchers kept squeezing
#1
Despite the headline number of 1.6 million downloads, 75% of those users never made it past the starting block. The data driving the predictive model contain about 1,700 users, those who reported both their symptoms and results from prior testing to the app. The model is then applied to another 412,000 users who reported symptoms but not test results. The authors referred these 412,000 people as a “test set”. This is an abuse of terminology. A test set must come with the answers. In this case, a proper test set should include the test results for every individual scored by the predictive model. The analyst can then compare the predicted result to the actual result in order to measure the model’s accuracy. No one knows a single test result for the 412,000 individuals in the “test set”.
#2
The oversight of #1 led the researchers to make the gravest error. They believed their own model. In stating their conclusion, they strayed from 13 percent predicted to test positive to 13 percent “likely to be infected by the virus”. This generalization presumes 100% accuracy for model prediction as well as 100% accuracy for testing.
#3
The press took the liberty to generalize further. Instead of 13 percent of users of the Covid Symptom Tracker app, they speak about 13 percent of the UK population. That step of logic requires that the app users are a random sample of the general public, which is obviously false. [I will address data collection and processing in a separate post.]
#4
We next look at accuracy metrics for the 1,700 users used to develop the predictive model. The reported sensitivity is 54% and the specificity is 86%. I’ll round these to 55% and 85% for our discussion. We are familiar with the concepts of false positives and false negatives because of Covid testing (if not, see my previous post). In the set of 1,700 users, the researchers said, one in three reported positive test results. So, the model’s job is to fish out the 567 positives from the pool of 1,700. Sensitivity of 55 percent means the model correctly predicted 312 of those 567 users. Specificity of 85 percent means the false positive rate is 15 percent, so out of the 1133 users with negative test results, the model incorrectly predicted 170 to be positive. Taken together, the model issued 312+170 = 482 positive predictions out of 1,700 users, roughly 28 percent.
The so-called positive predictive value for the predictive model is 312/482 = 65 percent. Out of every 100 people predicted positive by the model, we expect 65 to be correct while 35 are false positives.
From a predictive accuracy perspective, this is pretty good. If we grab a random sample of those 1,700 users, we will find a third of them to be positive (self-reporting) but if we select from the subset that are predicted positive, we will find almost two-thirds to be positive.
But the model (like any other) is nowhere close to perfect. Thirty-five percent of those predicted to test positive will test negative, if they take the test. The model also makes false negative errors. It flags 28 percent of the 1,700 as positives so even if every predicted positive is a true positive, the model cannot capture all self-reported positives, which is one-third of the group.
If you have the Numbersense book (link), you can review all of this material in Chapter 5, where I applied the same analysis to Target’s model for predicting which shopper might be pregnant.
#5
The percent truly positive is not the same as the percent predicted positive because of false positives and false negatives. A prediction cannot be equated to being “infected with the virus”.
The authors of the preprint did qualify their statement with “likely to be infected”. How likely is likely? Don’t hold your breath, the chance of testing positive can be just above 50/50.
To see this, you have to understand model scoring. The main model is a logistic regression, which means that each of the 412,000 individuals is given a score, to be interpreted as the predicted probability of testing positive for the coronavirus should they take the PCR test. The more relevant symptoms someone has, the higher the score. Since the score is a probability, its value ranges from 0% to 100%.
The analyst must translate these probabilities into a binary outcome (will the individual test positive or not?). This study did the most standard thing one can do - picking 50% as the cutoff score. All individuals scoring above 50 percent are predicted to test positive, and those scoring below 50 percent are predicted to test negative.
With a 50% cutoff probability, the marginal cases within the 13 percent predicted to test positive will have a chance of testing positive of around 50 percent. It’s quite, hmm, bold to say these people are “likely to test positive”.
#6
Now, let me explain why the percent predicted positive isn’t just somewhat useless; the percent predicted positive has nothing to say about population prevalence.
It turns out that for a given predictive model, the analyst decides what percent to predict positive. It’s not a passive outcome of the model. The analyst chooses its value by trading off false positives and false negatives.
If the cutoff is increased from 50% to 75%, then the proportion predicted positive will drop below 13 percent, but we will have more confidence that those predictions are correct – the people at the margins now have at least 75% chance of testing positive. On the other hand, we can expand the proportion of predicted positives above 13 percent, and then, we will be less confident of those predictions – some of the predicted positives will have less than 50 percent chance of testing positive. The point being that the modeler decides what proportion of predictions are positive, and thus, the methodology of this Symptom Tracker study can justify any level of “prevalence”. It’s really this bad.
#7
I previously told Wired one of the evils of “triage” or targeted testing is that it biases the samples available to build predictive models that are necessary to conduct triage. In other words, we’ve set up a vicious cycle. This predictive model is being proposed to implement triage testing – that is, let the model decide who should get tested. (It’s hard to lay bad ideas to rest! The other being “herd immunity”.)
We have a real-life demonstration of the problem in this preprint. As shown in #4, this predictive model guesses that 28 percent of the 1,700 users who had prior test results would have tested positive. Then, the model is used to score 412,000 users who have not taken the test. It guesses 13 percent of this group to test positive. Why not 28 percent?
If both groups of users have the same characteristics (say, both are random samples of the general population), then we should expect both proportions to be 28 percent.
However, in the UK, they severely ration testing, and only test people with severe symptoms. In Table 1 of the preprint, you can see that people in the 412,000 group reported lower incidence of almost all the symptoms listed, compared to the 1,700 who already were tested (even lower than the ones who tested negative).
Did the model realize that the untested people are less likely to test positive, and therefore predicted a lower proportion to test positive? I wish our models could be so smart. But no way! So why did the model flag only 13 percent of th 412,000?
I'm going to let you stew on this one a bit, and reveal the reason in my next post. It might help to use the Target pregnancy prediction model that I featured in the Numbersense book (link).
That model predicts the chance that a female shopper is pregnant because pregnant women spends money. Assume that I built the model using a list of women shoppers. The model makes predictions based on past purchases of certain products.
Now, I apply this model to score a list of male shoppers. This means I feed in data on whether these shoppers purchased those products included in the model. What scores (probabilities) will the men get? Does the model know the training data contain all women, and the "test set" contain all men? Please comment below if you need help thinking through this.
#8 (a minor point)
In the preprint, the 13 percent comes with a ludicrously tiny margin of error: [12.97%, 13.15%]. Typically, a small margin of error instills confidence that what was observed was unlikely to occur due to random chance. In this instance, the tiny margin of error is purely a result of throwing 412,000 names into the “test set”. When the sample size is so huge, the model almost always flags 13 percent as positive.
The uncertainty of the study’s conclusion arises from the modeling not the scoring of the model. So, if you want to understand the model’s accuracy, you have to go back to the discussion in #4. Those accuracy metrics have their own margin of errors, which are in the plus or minus 5 to 10 percent range.
***
What about the other big talking point coming out of this study - that loss of smell and taste is more important than even cough as a predictor of having SARS-CoV-2?
#9
Triage testing also dooms this finding (I know, it's sad to see). Because of pre-selection, the entire set of 1,700 people with test results have high incidence of all the symptoms. Thirty-three percent of them experienced loss of smell and taste. For persistent cough, it’s almost 50 percent.
Here’s the problem: if half the people have cough, cough isn't a good predictor of anything. Remember that only a third of this group self-reported positive test results. Any model that uses cough as a single indicator splits the group 50/50, which immediately leads to 28 false positives.
It’s no coincidence that 33 percent experienced loss of smell and taste, compared to 34 percent (to be exact) who self-reported positive test results. The reason why anosmia was the “best predictor” is because its incidence in the training data approximates the frequency of the positive outcome.
***
For data scientists and analysts, the big lesson is don’t feed biased data into model building.
[P.S. I have further comments on the data pre-processing and the data collection of the Covid Symptom Tracker study.]
Comments
You can follow this conversation by subscribing to the comment feed for this post.