In the previous post, we discussed how charts need to address the key question posed by the data. In this case, the journalist was trying to show that police shots often go errant, and are largely unpredictable even when the distance of the target is given.

In the comments, there is interest in seeing the hit rate v. distance chart. Because the data came to us in buckets, we do not have enough to continue the analysis. If one were to guess, the real curve would start out with 100% accuracy at distance 0, fall sharply to a plateau in the 20-40% range at modest distances, and then drop again at large distances, decaying to zero.

Andrew Gelman has conducted this analysis for a similar problem, that of predicting accuracy of golf putts based on distance from the hole. Here are two key charts from his paper (joint with Deborah Nolan):

The left chart is our hit rate chart above, except the golf data set is larger, allowing a curve fitting. The right chart is the fitted curve which is a "model" for the true relationship between accuracy and distance from the hole. The model fitted the data well.

Gelman and Nolan didn't just find any best fitting line through the data. They started out with a trigonometric model (shown on the right), with the angle of the putt as a random variable. With this setup, they wrote down the formula for computing the probability that the putt will fall in, that is, the proportion of success. The angle is assumed to follow a normal distribution with the standard deviation being an unknown parameter. The standard deviation is estimated from the available data.

Of course, the human body is a bit harder to model than the hole in the ground but this procedure could very well apply.

For more details, check out the paper (PDF). This example is also found in their book on teaching statistics.

Source: Gelman and Nolan, "A Probability Model for Golf Putting".

With 95% confidence intervals (these assume that shots are independent, maybe not correct but should be conservative)

0-6 42.5 (36.3,48.8)

6-21 23.2 (15.1,32.9)

21-45 40.0 (24.9,56.7)

45-75 14.3 (0.4,57.9)

75+ 6.7 (0.8,22.1)

The increase from 6-21 to 21-45 may just be statistical variation. I expect though that at shorter distances there is a need to fire rapidly which reduces accuracy. At 21-45 there is time to aim and stabilise so accuracy improves.

Posted by: Ken | Dec 18, 2007 at 02:29 AM

I don't think you can trust any analysis of this data because its nature makes it wildly inaccurate. Just like shooting someone in such a situation involves a lot of stress and lack of precision, I doubt that even a rough estimate of the distance from which the shot was fired can be made with any accuracy. Police reports are written hours after the incident, when many other details were much more important than the exact distance from which each shot was fired. Besides, estimating distance is not something we are particularly good at, even under good conditions. Also, the rather large number of shots where the distance is completely unknown needs to be factored into this.

So based on that, I don't think anything other than the overall hit rate from all the shots fired makes much sense to analyze. The chart makes the data look more interesting, but I wonder if it would have made more sense to compare to some other data rather than go into detail based on such bad data - perhaps the number of arrests, or the number of shots fired over the year (are there more incidents around the holidays, etc.).

Posted by: Robert Kosara | Dec 18, 2007 at 09:13 AM

"Gelman and Nolan didn't just find any best fitting line through the data. They started out with a trigonometric model..."

Excellent. A physically-based model is by nature simpler and more accurate than any sixth order polynomial fit. You can tell the statistics newbies by their reliance on such awkward fits.

Posted by: Jon Peltier | Dec 18, 2007 at 09:52 AM