Big (surveillance) data is changing everything - that is undeniable. But is Big Data changing everything for the better? That is what you have to think about.
As someone who's done data analytics for a long time, I can say confidently that the potential for harm is at least as large, but probably larger, than the potential for good.
A case in point is data as related to crime and policing. This post picks up from the perceptive article in the New York Times, "How Reporting Practices can Skew Crime Statistics" (link).
The article points out something really important - data is as data is defined. Different definitions result in different analytical results, frequently contradictory. No one should interpret analytical results without first learning the data definitions.
The author describes two extremely popular styles of "big data" analytics: top N ranking, and trend analysis. In crime statistics, sample insights are (1) "South Bend, Indiana is the 27th worst city for violent crime rate among all cities with over 100,000 people." and (2) "Violent crime has worsened in South Bend since mayor Pete got elected in 2012."
With the volume of data available today, those two simple analyses still account for the vast majority of "analytics" out there.
South Bend, Indiana is the 27th worst city for violent crime rate among all cities with over 100,000 people.
Here is a list of definitions that affect the interpretation of this statement, most of which are described in the NYT article:
- whether the suburbs are included as part of the "city", or more broadly, how the boundaries of the city are drawn. This matters because the crime rates in the suburbs are different from the crime rates in the city centers.
- whether the error in the count of population is comparable across all cities under consideration
- what is considered a crime
- what proportion of crime is reported
- what crime is classified as "violent"
- are there any exceptions
You might think the population of a city is an objective fact but it isn't. Definitions matter, and they matter a lot when you're comparing different cities.
Violent crime has worsened in South Bend since mayor Pete got elected in 2012.
In addition to the points listed above, we also need to know if any of those definitions changed between 2012 and 2019. Changing how something is defined almost always breaks the continuity of any analysis, thus undermining any trend analysis. I raised this point in my book Numbersense (link) in the chapter on obesity - if the medical community changes how we measure obesity, then all historical data are rendered useless so such proposals must be considered using great caution.
Further, we should find out whether violent crime was increasing or decreasing prior to 2012 in South Bend, whether violent crime was increasing or decreasing in other cities comparable to South Bend since 2012, whether violent crime was increasing or decreasing in the nation as a whole since 2012, etc.
***
Some readers of my book Numbersense (link) might be mystified why I spent so much energy discussing definitions - of grad school admissions statistics, of U.S. News ranking, of obesity measures, of unemployment rates, of inflation rates, of fantasy football metrics, etc., etc.
A key to numbersense - to judging the statistics and data thrown at you - is to understand clearly how the data are collected, and in particular, how any metric is defined.
The discussion of crime statistics by the New York Times makes this case cogently.
***
Now let me get to the absurdity of predictive policing.
A particularly popular and cited application of Big Data "for social good" is predictive policing. The goal of predictive policing is to prevent crime from happening.
The ideal of "Minority Report" is often hoisted on us. Wouldn't it be great if some AI or machine learning or predictive model can figure out who will be burglarizing your home next year, and have the police lock the would-be burglar up before s/he could commit the crime?
I have done a good amount of predictive modeling in my career so I am not saying what I'm saying because I'm a luddite. It's because of that experience that I am leading you down the following rabbit hole:
There are not many burglaries. According to Statista, the worst state (New Mexico) suffered 770 burglaries per 100,000 residents in 2018. This means that if the analyst is given a list of 100,000 New Mexicans chosen at random, a perfect model should label 770 of those as potential burglars. That is to say, we want the model to pick out 0.77 percent of the list.
A perfect model does not exist. Any model generates false positives. Because burglars are rare, any reasonably predictive model would need to red-flag many more than 0.77 percent of the list to have any chance of capturing all 770 potential burglars. Let's say the model points the finger at 1 percent of the list. That would mean 1,000 potential burglars. Since there should be only 770 burglars, the first thing we know is that at least 230 innocent people would be flagged by this model.
Next, think about what happens to the 1,000 flagged people. In the ideal of predictive policing, police would knock on their doors and arrest them for their future crimes. These people who haven't burglarized anyone would be duly jailed by courts who buy into predictive policing. (How would they defend themselves against an allegation of future crime?)
We could say with certainty that those 1,000 did not subsequently commit burglary - given that they were incarcerated.
So humor me, and try to answer the following question: Of the 1,000 people flagged by this model, how many of these were accurately predicted?
It is even worse because most burglars commit more than one crime, so it might be 1 in a 1000 persons who are burglars. There is some good news, and that is that of people who are already convicted of burglary a much higher proportion will be repeat offenders. Prediction models probably work well on that group.
One point is that these models have all the same problems as with diagnostic testing in medicine but no one in the data mining/science community seems to understand them. Some of the methods of evaluating performance are wrong and would lead to apparently better performance than was the case.
Posted by: Ken | 02/09/2020 at 06:29 AM
Burglaries != burglars.
This example ignores many simple issues in order to contrive a straw man. Foremost, the model is part of a decision process that includes setting thresholds for acting, and many other process steps that can mitigate or exacerbate the model results. The model does not "point the finger". Humans take the actions by using the scores to decide what to do.
Overlooking the burglaries issue, if we have 770 expected events, then we could limit the action step to just 770 events. Less people should mean less false positives, but also less true positives.
What we need, what all models need, is a good economic model of the errors to chose better thresholds.
After that, the police and elected officials & lawmakers(!) can decide an appropriate follow-up to the scores. Maybe extra surveillance on some people, but not necessarily precautionary incarceration.
Maybe if the prediction was really good, then the burglar needs only to be diverted at the time the burglary is predicted? Like give him an education or have a Koch bro loan him $1000?
Posted by: Chris | 02/11/2020 at 10:52 AM
Chris: Thanks for the discussion. I am not disagreeing with your comments. My concern is on the difficulty/impossibility of measuring the actual impact of such algorithms. How many people were flagged? How many of these would not have been flagged by traditional methods? How many flagged by traditional methods would not be identified by the algorithm? How many were treated in what ways? How many were false positives? How many were false negatives? etc. If we can't measure the performance in real life, we can't assess its impact properly, and we can't improve these algorithms. And answering those questions I just listed takes a lot of effort and genius. In a follow-up post, I explored some ways of measuring them. Has any municipality or vendor published detailed statistics on the performance of these algorithms? Would be happy to look at them.
Posted by: Kaiser | 02/11/2020 at 01:17 PM
Thanks Kaiser. I commented here before reading Part 2 where you do answer some of my concerns. I work in health care on models of event prediction so I don't have first hand knowledge of these recidivism models, but health care has the same issue with counter-factual evaluations. We cannot measure how a person would have progressed without a treatment if they in fact received the treatment. We can only search for "twins" and evaluate their futures. Yes we need to answer all those questions you raise in the data before implementation.
But I do think the solution is to look beyond the algorithm into the entire decision process that embeds the algorithm. There is a deeper moral question here as well, one well covered by ProPublica and various rebuttals on the COMPAS model. That question is "to what extent should we use historical data with known systematic bias to make these life altering decisions?"
Thanks for all the time you put into sharing these important questions. Its not easy to target as wide an audience as you do.
Posted by: Chris | 02/12/2020 at 12:17 PM