Those who have read my books or taken my courses would be the least bit surprised by the news that just came out about the naked emperor of Chicago's much ballyhooed predictive algorithm for child abuse. Chicago Tribune has the story (link).
Key points:
- Data scientists (over-)sold an algorithm to predict children "at risk for serious injury or death."
- The director of the government agency that purchased this product concluded after two years that "[the model] isn't predicting much."
- At least $366K were spent i.e. wasted on this program. This number most likely does not account for the costs of support staff and infrastructure and the actual waste described below.
- "More than 4,100 Illinois children were assigned a 90 percent or greater probability of death or injury. And 369 youngsters, all under age 9, got a 100 percent chance of death or serious injury in the next two years" [a false positive problem - not to bring up a sore subject but several infamous outlets proclaimed that Hillary Clinton would win the Presidency with 90% or higher probability!]
- "high-profile child deaths kept cropping up with little warning from the predictive analytics software; Predictive analytics (wasn't) predicting any of the bad cases" [a false negative problem]
- Selling the dream: "If it is possible to use big data to spotlight a child in trouble and intervene before he or she is hurt, then doing so is government's moral obligation, advocates for the technology say."
- The algorithm, despite being paid for by public funds, is hidden from public view but the Tribune report pieces this together: "Eckerd [the vendor] retrospectively analyzes thousands of closed abuse cases and from them draws data points that are highly correlated with serious harm. The parents' ages could be a factor — or their previous criminal records, evidence of substance abuse in the home, or the presence of a new boyfriend or girlfriend."
The story does not invalidate the practice of predictive analytics but points out pervasive problems within the industry. Most importantly, we are over-hyping this technology. I cover the statistics of such models in depth in the marketing chapters of Numbersense (link). Predictive models are impressive in a relative sense (in marketing, we use the term "lift") but not usually in an absolute sense. Every social science based predictive model I have seen misses a sizeable number of targets while falsely labeling lots of "cases".
When a vendor does not report on accuracy metrics, or when such metrics are not interpretable in practice, or when the metrics are generated by third parties, or when metrics are computed purely from historical data, one has to be very careful about separating the good products from the scams.
***
It bears repeating that the "big data" used in this type of analysis has multiple complications which I collectively describe as OCCAM.
Observational - we cannot run an experiment and assign people to have criminal records, or other characteristics at random, so it is tough to get any read on causal mechanisms. Thus, there is a good chance to suffer from spurious correlations.
Seemingly Complete - but not really. Many cases are unreported and not in the training data. One of the excuses used by Eckerd when confronted with inaccuracy is that certain children are not even scored because they do not appear in the data. That is technically correct but does not change the fact that the model did not predict those deaths.
No Controls - many predictive algorithms do not use controls explicitly. Another issue is that the possible controls all come from cases reported to the authorities. Without a doubt, if we take the profiles of the at-risk children as determined by the algorithm, we will find many other children in the general population that do not show up in the database at all.
Adapted - the data collection process was not designed specifically to support predictive modeling.
Merged - various datasets get merged during the analysis, which introduces errors.
More discussion of OCCAM here and here.
***
While I appreciate that the Chicago Tribune wrote and published this article, this is yet another media report on predictive modeling using "big data" that does not contain any data or quantified metric of predictive accuracy.
The two chapters most relevant to this post are Chapter 4 in Numbers Rule Your World (link) and Chapter 5 in Numbersense (link).
Comments