Data science is a field filled with axioms. This is a sign that the practice is running ahead of (and hopefully not away from) the theory. Many of the things students are told to do or not do are not fully explained. One such rule of thumb is the idea that a predictive model should be built (trained) using data that are representative of how the model will eventually be applied.
This post is inspired by an op-ed column I just published in Wired. You might want to read that first to get the context. This post then goes deeper into the discussion around building predictive models, and it ties up the loose end from my prior post, where I posed a question about Target's model for predicting pregnancy.
***
Any practitioner who has seen models implemented in real life, and has had to explain unexpected lack of performance will have compared the analytical and scoring samples. The analytical sample are the data used to train the model. The point of having a model is to issue predictions for the scoring sample - these are, say, users for whom we have inputs and would like the model to issue predictions of an as-yet-unknown outcome.
To use a recent example of a predictive model, the research team behind the Covid Symptom Tracker app analyzed a sample of users of their mobile app, and built a model to predict the chance of testing positive for the coronavirus given their self-reported symptoms. In the analytical sample, users reported both their symptoms and their test results so that the model can learn from these cases. The point of having a model though is to score other users who have symptoms (inputs) but not test results (outcome).
The professor might argue that it is simply invalid to build a predictive model using training data that are not representative of the scoring sample. This argument has two flaws: first, in real life, most scoring samples will not faithfully replicate the training data (usually, historical records); second, what is the basis for the claim of invalidity?
***
The Covid Symptom Tracker study is a great example to examine this claim of invalidity. The researchers built a model using an analytical sample of app users who self-reported their symptoms and their test results. The model did a reasonably good job predicting these users in the analytical sample. Then, the model is applied to a much larger scoring sample comprising other app users who self-reported symptoms but said they haven't been tested. The model flagged 13 percent of the scoring sample to be likely to test positive.
In this case, the analytical sample and the scoring sample are very different, in fact, possibly disjoint. This is because in the UK and the US, where this app is more popular, coronavirus testing is rationed and triaged - only people who experienced severe qualifying symptoms are eligible for testing. So anyone who have been tested are likely to have those symptoms while those who haven't are likely to not have them.
What's the harm of extrapolating from one sample to another sample which is not statistically comparable? Is it really not acceptable?
***
In a previous post, I raised the following scenario: consider Target's model for predicting pregnancy. The analytical sample consisted of past purchase records of female shoppers. The analysts discovered 20 or so items, the purchases of which are correlated with pregnancy. Once the model scans the purchase records of any shopper, she can be given a score indicating the likelihood of being pregnant. What happens if this model were fed a list of men?
The model examines the purchase records of these men, and if they bought some of those specific items (including things like a blue rug), they will be assigned a positive chance of being pregnant. Assuming cisgender, any man with a positive score is a false positive.
The source of the false-positive problem is the mismatch between the analytical sample and the scoring sample. This example illustrates the extreme case of complete mismatch.
For the Covid Symptoms Tracker, triage testing restricted the analytical sample to those with severe symptoms while the scoring samples - those who haven't yet been tested - are disproportionately people without severe symptoms. It's clear that a good chunk of the 13 percent predicted to test positive will be false positives.
***
Note that the predictive model itself can still be useful. If Target's model is applied to women shoppers, the performance in the scoring sample will meet expectations. If the Symptom Tracker model were applied to people with severe symptoms, we expect to see about 20 percent false positives - roughly what was found when the model was applied to the analytical sample.
Modelers tend to re-weight their analytical samples to look like the scoring samples. This practice ensures the predictive model is effective at both stages. Triage testing, however, pre-empts this approach. The purpose of the Symptom Tracker app is to collect data to score those who have mild symptoms. It’s a necessary step in implementing triage. For the analytical sample, the analysts need mildly ill people with prior test results. No such group exists because of triage testing.
***
To learn about the perils of analyzing low-quality data coming from mobile apps, see my Wired column.
Biostatisticians have done some work on this. A screening test is often applied to a population and then those who score highly are given a gold standard test. The only solution is to include some subjects with low scores to receive the gold standard, then analyse it properly. The secret to analysing it properly is that those who don't receive the gold standard are missing data. Judging by one talk of been to data scientists don't understand that.
Posted by: Ken | 06/03/2020 at 03:32 AM