New Yorkers have been traumatized by sensational stories of subway riders being pushed into the tracks, randomly stabbed, etc. Personal security is a hot topic. So, the MTA, which runs public transportation, has been initiating various efforts to enhance security. Mayor Eric Adams in particular loves technological solutions.

One such effort is hiring an AI company that claims they could detect who's carrying weapons. Evolv's technology reportedly uses scanners.
The city conducted a pilot program in 20 stations for 30 days. They have not released full results, and, after multiple requests by media outlets, only issued a four-sentence statement, so some of the following is necessarily speculative. (link)
It seems like passengers were selected and instructed to walk through the scanners, and a total of 2,749 scans triggered 130 positive signals.
The limited released data are incoherent in many ways.
The rate of scanning is surprisingly low. 2,749 scans in 30 days is about 90 scans per day on average, which means fewer than 5 scans per station per day. That makes no sense to me.
On top of that, who was responsible for picking which people to send through scanners, and how? They were definitely not scanning everyone. Let's say they do five scans a day, and since this is a small number, let's say they do all five during one hour of the day. Which five of the passengers do they stop and scan?
Ideally, they should pick five random passengers but the base rate of carrying guns in the NYC subway is probably low, so a proper test would have to scan a lot of people, much much larger than five per station per day.
The algo flagged 130 out of 2,749 scans, which is just under 5%. If we assumed all positives are correct, then at least 5% of New York subway riders carry guns, one in 20, so on average there are several (concealed) guns in each car of the subway during rush hours.
But, how many of those 130 positive signals were correct? Presumably, the police then searched those suspects, and they found ... ... ... found ... ... ... zero guns. Oops. So, the experiment yielded a positive predictive value of 0% (0/130). In plain English, a positive scan result holds zero value as an indicator of gun carry. Additionally, if any of the 2,749 people who were scanned carried guns, they were not detected. The false negative rate is 100% (all true positives were predicted to be negative.).
Wait, that's not what the city officials said in those four sentences. They disclosed that Evolv found 12 knives, so they claimed a positive predictive value of just under 10% (12/130). I guess that sounds better than 0%! I highly doubt the MTA (or any other establishment) would invest in scanners that can detect only knives but no guns.
Several reports I have seen computes the ratio 118/2,749 = 4.3% and called it the "false positive rate". This is the wrong definition of the FP rate; the denominator should be the number of true negatives, not the total number of scans. In any case, the 4.3% is not a useful metric.
I discussed the kind of math that underlies all security-related statistical detection problems in Numbers Rule Your World (Chapter 4; link); refer to that chapter for more details.
***
For this blog post, let's focus on a simple mental model for this type of problem. We start with a guess of the base rate of gun carry (Bayesians call this the prior). Let's say it's 1%.
If the base rate is 1%, and we picked 2,749 people randomly, then we should expect to find 28 guns. The first thing to realize is that if the predictive algo were perfect, it would flag exactly 28 scans as positive. In this case, the positive rate for the scanning program should be 28/2749 = 1%.
But perfection is illusory. All models make mistakes. To allow for mistakes, a realistic algo ought to flag more than 28 scans as positive, which means it must make false positive mistakes.
The Evolv model tagged 5% of the scans as positive but at most 1% of the 5% can be positive so its accuracy is upper bounded at 1/5 = 20%. If we demand a higher positive predictive value, the algo cannot spit out that many positives.
Moreover, the accuracy can dip below 20% because it might not catch all of the gun carriers (false negatives). The lower bound is 0%, as demonstrated by the MTA experiment. If we require the algo to declare fewer positives, the chance of false negatives increases.
If we were evaluating these algos, one of the most important metrics should be the false negative rate i.e. of those who were carrying guns, what proportion went through the scanners undetected?
***
What if the base rate were much higher, like 10%, or 280 gun carriers?
Because the algo only tags 5% of the scans as positives, it "gave up" on 5% of the gun carriers, so at most it could catch half of them. The other half could also be incorrect.
If the harm of a single gun is intolerable, then the algo has to flag more subway riders, annoying them. But if the goal isn't to banish all guns from the subway, then the inconvenience of getting scanned can be reduced by lowering the frequency of positive signals. With Evolv, they can't lower the frequency since they are not finding any guns.
In this Wired article (link), the reporter found that Evolv's technology when installed at schools or hospitals has also underperformed expectations.
Recent Comments