Big (surveillance) data is changing everything - that is undeniable. But is Big Data changing everything for the better? That is what you have to think about.
As someone who's done data analytics for a long time, I can say confidently that the potential for harm is at least as large, but probably larger, than the potential for good.
A case in point is data as related to crime and policing. This post picks up from the perceptive article in the New York Times, "How Reporting Practices can Skew Crime Statistics" (link).
The article points out something really important - data is as data is defined. Different definitions result in different analytical results, frequently contradictory. No one should interpret analytical results without first learning the data definitions.
The author describes two extremely popular styles of "big data" analytics: top N ranking, and trend analysis. In crime statistics, sample insights are (1) "South Bend, Indiana is the 27th worst city for violent crime rate among all cities with over 100,000 people." and (2) "Violent crime has worsened in South Bend since mayor Pete got elected in 2012."
With the volume of data available today, those two simple analyses still account for the vast majority of "analytics" out there.
South Bend, Indiana is the 27th worst city for violent crime rate among all cities with over 100,000 people.
Here is a list of definitions that affect the interpretation of this statement, most of which are described in the NYT article:
- whether the suburbs are included as part of the "city", or more broadly, how the boundaries of the city are drawn. This matters because the crime rates in the suburbs are different from the crime rates in the city centers.
- whether the error in the count of population is comparable across all cities under consideration
- what is considered a crime
- what proportion of crime is reported
- what crime is classified as "violent"
- are there any exceptions
You might think the population of a city is an objective fact but it isn't. Definitions matter, and they matter a lot when you're comparing different cities.
Violent crime has worsened in South Bend since mayor Pete got elected in 2012.
In addition to the points listed above, we also need to know if any of those definitions changed between 2012 and 2019. Changing how something is defined almost always breaks the continuity of any analysis, thus undermining any trend analysis. I raised this point in my book Numbersense (link) in the chapter on obesity - if the medical community changes how we measure obesity, then all historical data are rendered useless so such proposals must be considered using great caution.
Further, we should find out whether violent crime was increasing or decreasing prior to 2012 in South Bend, whether violent crime was increasing or decreasing in other cities comparable to South Bend since 2012, whether violent crime was increasing or decreasing in the nation as a whole since 2012, etc.
***
Some readers of my book Numbersense (link) might be mystified why I spent so much energy discussing definitions - of grad school admissions statistics, of U.S. News ranking, of obesity measures, of unemployment rates, of inflation rates, of fantasy football metrics, etc., etc.
A key to numbersense - to judging the statistics and data thrown at you - is to understand clearly how the data are collected, and in particular, how any metric is defined.
The discussion of crime statistics by the New York Times makes this case cogently.
***
Now let me get to the absurdity of predictive policing.
A particularly popular and cited application of Big Data "for social good" is predictive policing. The goal of predictive policing is to prevent crime from happening.
The ideal of "Minority Report" is often hoisted on us. Wouldn't it be great if some AI or machine learning or predictive model can figure out who will be burglarizing your home next year, and have the police lock the would-be burglar up before s/he could commit the crime?
I have done a good amount of predictive modeling in my career so I am not saying what I'm saying because I'm a luddite. It's because of that experience that I am leading you down the following rabbit hole:
There are not many burglaries. According to Statista, the worst state (New Mexico) suffered 770 burglaries per 100,000 residents in 2018. This means that if the analyst is given a list of 100,000 New Mexicans chosen at random, a perfect model should label 770 of those as potential burglars. That is to say, we want the model to pick out 0.77 percent of the list.
A perfect model does not exist. Any model generates false positives. Because burglars are rare, any reasonably predictive model would need to red-flag many more than 0.77 percent of the list to have any chance of capturing all 770 potential burglars. Let's say the model points the finger at 1 percent of the list. That would mean 1,000 potential burglars. Since there should be only 770 burglars, the first thing we know is that at least 230 innocent people would be flagged by this model.
Next, think about what happens to the 1,000 flagged people. In the ideal of predictive policing, police would knock on their doors and arrest them for their future crimes. These people who haven't burglarized anyone would be duly jailed by courts who buy into predictive policing. (How would they defend themselves against an allegation of future crime?)
We could say with certainty that those 1,000 did not subsequently commit burglary - given that they were incarcerated.
So humor me, and try to answer the following question: Of the 1,000 people flagged by this model, how many of these were accurately predicted?
Recent Comments