This is a follow-up post to Part 1. Read that post first to catch up. In Part 1, I set up a scenario, in which 1,000 people are flagged by a predictive-policing model and sent to jail for their future crime of burglary.
At the end, I asked how many of the 1,000 people were accurately flagged by the predictive model.
***
At the crudest level – and this is the argument made by those who sell predictive policing technologies, the model was 100 percent accurate. Look, it flagged 1,000 people, and subsequently, none of these people committed burglary! Amazing! Genius! Wow!
The absurdity of this calculation is that such perfection could have been obtained with any model – even a random model. If I randomly selected 1,000 people from the list, and convinced the courts to impound them because of their future crimes, those 1,000 would also not commit burglary. Therefore, the fact that the predicted burglars didn’t commit the crime proves nothing about the model’s effectiveness! It just states the obvious fact that if we lock someone up, that person couldn't commit burglary.
For each person flagged by the predictive model, the relevant question is: if we didn’t throw this person in jail (or take other actions that would alter the outcome), would s/he have committed burglary? If the answer is yes, then the model has made one accurate prediction. The trouble is it's counterfactual. Taking action using this model removes our ability to measure its accuracy.
***
In practice, a few strategies can be deployed to assess such a predictive model.
Strategy 1 is back-testing. Apply the model to historical data. Train the model using data before a cut-off date. Use it to make predictions after the cut-off date. Run time forward, and compare the burglary rate between the flagged and unflagged populations.
This is easy to do but fraught with dangers. It only works if the modeler does not cheat by using data after the cut-off date. The most common way of cheating is to tinker your model after learning that it isn't accurate enough.
Since every analyst building models tinkers the model upon learning it isn't accurate enough, every analyst "cheats". (Textbook authors, though, have memberships to the private clubs where analysts only use their test sets once.) Thus, basing any decisions purely on back-testing is foolish. Buying predictive technologies from a vendor who only reports back-tested results is foolish.
Strategy 2 is dry run. Run the predictive model, flag the predicted offenders but take no action. After sufficient time, look at flagged versus not flagged and the burglary rates between the two groups. If the model is any good, the burglary rate amongst those flagged should be substantively higher.
People who don't want you to know about accuracy counter with these objections: (a) it takes too long (depending on the burglary rate of the neighborhood, this could well take 6-12 months to collect enough cases); and/or (b) you're letting crime happen while this experiment is running - this critique is usually delivered with a contempt for science: "you're not being practical, you're being a scientist!".
Reporting results from a dry run builds confidence in the accuracy of the predictive models - on average. It still doesn’t change the fact that for any individual flagged and arrested based on the model’s prediction, we would never know if the model was correct.
***
Further reading:
Chapter 4, Numbers Rule Your World: discussion in the context of predicting terrorists (link)
Chapter 5, Numbersense: discussion in the context of Target's pregnancy prediction model (link)
Comments
You can follow this conversation by subscribing to the comment feed for this post.