Tracking apps are being touted as essential for the pandemic life. We are starting to see data analytics coming out of app developers (through their research partners). One of the earliest is the Covid Symptom Tracker app, developed by a team from King’s College (UK) and Zoe Global. I previously explained how they took such “big data” and drew conclusions about the general prevalence of SARS-CoV-2, and loss of smell and taste being the strongest predictor of an eventual positive test. (Link to the preprint)
(Here’s how I use the term “big data”. The OCCAM criteria are different from the usual interpretation of large volume, etc.)
In this post, I turn my attention to how the data was processed before entering the analyses. This is a particularly thorny problem for big data because the analysts exercise far less control over the data collection process than in more traditional studies.
In big data studies, one of the greatest challenges is sample bias. Sample size is usually of lesser concern. If the analyst wants to draw conclusions on the general population, the challenge is due to low or no representation of certain types of people in the sample. If the analyst intends to define new subgroups, there are no prior population studies to measure the bias.
In this post, I have divided the material into two sections. The first section discusses biases resulting from the pre-processing steps we've been told about. The second section speculates about potential biases, and asks for more disclosure on pre-processing.
Discussion of Disclosed Processing Steps
#1 Impossible values
Self-reported data are often filled with errors. In addition, bot traffic generates fake data. The Covid Symptom Tracking App is not immune to these issues. The researchers countered by filtering out impossible values. For example, body temperature was required to fall between 35 and 42 C. Weight must fall under 200 kg (440 lb). The maximum age allowed was 90, which is their most questionable decision. Covid-19 is the most deadly among those over 80. (That said, even if they allowed the oldest cohort, that group might not use mobile apps, which is still a bias.) What I’d like to see is reporting what percent of users were dropped because of these exclusions. Such accounting is key to judging the quality of data collected from apps. The preprint disclosed just one tidbit: over 80 percent of users were under 60.
#2 Missing values
In a second filtering, the Symptom Tracker team “selected 1,702 participants that reported (i) having had an RT-PCR COVID-19 test, (ii) having received the outcome of the test and (iii) symptoms including loss of smell and taste.” Item (iii) is difficult to fathom. It sounded like they retained only those users who answered the question about loss of smell and taste.
If this interpretation is correct, then they have flashed their hand – because they had a particular interest in loss of smell and taste (anosmia), they treated that symptom differently from other symptoms, thus introducing a confirmation bias.
Users who answered the anosmia question are more likely to have that symptom. Depending on the co-existence of anosmia and other symptoms, such filtering may have unintended consequences on the prevalence of the other symptoms in the analytical sample. There is no reason why anosmia should be treated differently from other symptoms.
A clear effect is that all symptoms have missing values except for loss of smell and taste. This creates the problem #3.
#3 Missing value imputation
In a subsequent step, the research team filled in the blanks. According to the preprint, “Prior to the modeling, to preserve the sample size, we imputed missing values for the symptoms of interest using missForest package in R”. So for any skipped question about symptoms, they used a model to guess what that user’s answer would have been, based on other answers of that user. Imputing missing values always introduce errors into the dataset; we do it because it’s the lesser evil. The problem is these errors only affect symptoms other than loss of smell and taste, because of #2.
I can’t say more about the impact of the imputation because the preprint did not show a comparison between the variables before and after imputation. What percent of the symptoms data were missing and subsequently filled in by this model, and how did the imputation change the statistics?
#4 People with no symptoms
In forming the analytical sample, the researchers also dropped all users who skipped all symptom questions. But not reporting symptoms might indicate no symptoms. Some users are too lazy to scroll through screens where every answer is No. Dropping all such users means the analytical sample is biased toward people with more symptoms, which is correlated with chance of infection. So the percentage testing positive in this sample will be higher than in the general population. (This is exacerbated by issue #5.)
#5 Triage testing
The other filter, those who have results from the PCR test, is another source of selection bias. That’s because the UK has been running “triage testing”: only those with severe symptoms, or have been in contact with infected people, or specific high-risk subgroups, are allowed to take tests. This too causes the set of app users not to mirror the general population. (The US, which is the only other country this app is currently available, also does triage testing.)
Since the tests missed all asymptomatic carriers, any analysis of test results will also miss all asymptomatic carriers.
Here is my Wired article about why triage testing creates problems for data analytics, and this Symptom Tracker App study is living proof.
***
The above discussion concerns pre-processing steps that were clearly outlined in the preprint. The next set of issues covers pre-processing for which no details were provided. I explain why how these steps may also introduce bias. It would be nice to know how the researchers handled these risks.
Further Questions about Pre-processing
#6 Levels of severity of each symptom
From one of the tables, I learned that the symptom of short breath has four responses: “short breath”, “short breath mild”, “short breath sign(sic)” and “short breath severe”. A fifth level of “no short breath” probably also exists. In the modeling, this information is magically collapsed into a Yes/No variable called shortness of breath. The same pre-processing affects most symptoms.
With five levels of short breath, the analyst has four places to plant the cutoff between Yes and No. What analysis was done to determine the cutoff?
The danger here is (unintentional) p-hacking. This is a technical issue which I won’t explain here. Just realize that the conclusion of the study may change depending on where those cutoffs are positioned. If you have 10 variables and 4 levels each, there are 4^10 = over a million (!!!) possible settings of these cutoffs. The analyst selected one of these settings.
#7 The time dimension
To complicate things even more, the Symptom Tracker app asks for daily reports. In one extreme, a user may have reported on shortness of breath for five consecutive days. Given 5 levels and 5 days, there are 5^5 = 3,125 unique sequences of {No, Severe, Severe, Mild, Mild}, etc. Which sequences are mapped to Yes and which to No? And why? If they went to the trouble of collecting the data in five levels, why not use them all?
The research team should provide data about the frequency of reports. Within the five-day analysis window, how many users submitted 0, 1, 2, 3, 4, 5, 6+ reports? When you ask this question, you might also ask about #8.
#8 Start dates
Compare two users who both submitted exactly one report during the five-day analytical window. One user downloaded the app on Day 1, submitted a report and was never seen again. The other user downloaded the app on Day 5, submitted a report, and has been reporting every day since. A proper study should establish a moving 5-day analytical window, adjusting for start date. Most big data studies I have seen ignore this bias.
#9 Users with intermittent missing values
Recall in #2, we learned that the analytical sample contained only those users who reported “symptoms including loss of smell and taste”. Does this mean the selected user must have reported symptoms at least once? Or every day s/he submitted a report? The problem for the analysts is that the act of not reporting symptoms may be highly correlated with not having the symptom at the moment.
#10 Dropouts, and inactivity
To add even more spice, the analyst must also deal with dropouts and inactive users. Some users may have downloaded the app and submitted a couple of days of data, then disappeared. If they didn’t delete the app, how did the analyst decide whether the users dropped out or were temporarily inactive? What were the reasons for suspending or stopping usage? Could it be related to recovering from Covid, or not experiencing symptoms? I'd like to see some statistics on dropouts and activity levels. Was the partial data retained, adjusted, or dropped?
#11 Mid-survey corrections
When the preprint said “the research team can add or modify questions in real-time to capture new data”, I was more troubled than reassured. How often was the survey changed? Assuming they did make at least one change, I wonder what adjustments were applied to the data to correct for it. These are tricky problems no analyst asked for. If an answer choice was removed mid-survey, what did the analysts do with the responses collected up to that point? If a new answer choice was added mid-stream, how did the analysts infer this variable for the period prior to the change? The analysts have two bad choices: deleting data, or filling in entire sections of missing values.
#12 Users who reported test results but not symptoms
Did some of the app users report test results but not any symptoms? Is this group retained in the analytical sample or removed? Based on how the pre-processing was worded in the preprint (Issue #2), it appeared that they were dropped from the analysis. Is it possible that this group are the elusive asymptomatic carriers? If the analysts kept this group, did they impute symptoms using the missing value imputation method (Issue #3). How accurate are these imputed values for this group?
***
Without having fuller details of the pre-processing, it’s hard to predict the direction or magnitude of biases introduced by specific steps. I hope you’ll appreciate how much massaging is required of “big data”.
In almost every case, leaving the raw data alone is like stepping into the big puddle of water right in front of you – not because you didn’t see it, but because you didn’t care about how deep the puddle is and what’s in the water. The wiser thing to do is to find a bypass. You may notice there are many other puddles all around you, but there might just be one less treacherous path.
***
I still have two loose ends to tie up in the near future. I’ll take another step back in the workflow, and examine the data collection through this Covid Symptom Tracker app. The other item is to return to the puzzle I included in the previous post on the methodology of the Symptom Tracker preprint.
[P.S. The blog on data collection is now posted.]
Recent Comments