## Hits and misses 2

##### Dec 18, 2007

In the previous post, we discussed how charts need to address the key question posed by the data.  In this case, the journalist was trying to show that police shots often go errant, and are largely unpredictable even when the distance of the target is given.

In the comments, there is interest in seeing the hit rate v. distance chart.  Because the data came to us in buckets, we do not have enough to continue the analysis.  If one were to guess, the real curve would start out with 100% accuracy at distance 0, fall sharply to a plateau in the 20-40% range at modest distances, and then drop again at large distances, decaying to zero.

Andrew Gelman has conducted this analysis for a similar problem, that of predicting accuracy of golf putts based on distance from the hole.  Here are two key charts from his paper (joint with Deborah Nolan):

The left chart is our hit rate chart above, except the golf data set is larger, allowing a curve fitting.  The right chart is the fitted curve which is a "model" for the true relationship between accuracy and distance from the hole.  The model fitted the data well.

Gelman and Nolan didn't just find any best fitting line through the data.  They started out with a trigonometric model (shown on the right), with the angle of the putt as a random variable.  With this setup, they wrote down the formula for computing the probability that the putt will fall in, that is, the proportion of success.  The angle is assumed to follow a normal distribution with the standard deviation being an unknown parameter.  The standard deviation is estimated from the available data.

Of course, the human body is a bit harder to model than the hole in the ground but this procedure could very well apply.

For more details, check out the paper (PDF).  This example is also found in their book on teaching statistics.

Source: Gelman and Nolan, "A Probability Model for Golf Putting".

## Hits and misses

##### Dec 16, 2007

In this NYT article, we are told that "the most likely result when a policeman discharges a gun is that he or she will miss the target completely."  That's a shocker for those of us conditioned by Hollywood movies to think anyone who picks up a gun for the first time hits the villain right on the temple.  The following graphic attempts to tell the story.

The one hit here is how the distances are visually presented.  The elliptical lines remind us of the neglected variable of direction; it also means the scale is correct only along one direction.

The dot matrix construct highlights the absolute numbers of shots, hits and misses but barely addresses the key issue of hit rates (accuracy). Specifically, this data set was presumably collected to explore the relationship between hit rates and distances from the target.  The use of different widths clouds our judgement of proportions.  To wit, it is not obvious that the 10-wide block and the 40-wide block shown left depict roughly equal hit rates (23%, 29%).

The junkart version adopts a different approach.  This is the Lorenz curve, often used to show income inequality (see also here and here).  Here, the shots were ordered from closest to furthest from target, then summed up by distance segments.  For example, shots from 0 to 6 feet accounted for 60% of all shots but 72% of all hits.

If distance does not affect hit rates, we'd expect 60% of all shots to result in 60% of all hits.  This data point would show up on the 45-degree diagonal on the chart, labelled "totally unpredictable".  Any data appearing above the diagonal indicates that closer shots are more accurate, accounting for more than their fair share of hits.

Comparing the fitted blue line and the diagonal, one sees that distance is a weak predictor of hit rate.  The police commissioner explains this in the article; many other variables also affect accuracy, including "the adrenaline flow, the movement of the target, the movement of the shooter, the officer, the lighting conditions, the weather..."

Note that the shots with "unknown" distances were removed from the analysis.  Also, the categories of 21-45 and 45-above were combined: the rates were similar and with only three hits, it does not make sense to treat these as separate categories.

Of course, this version would not work well in the mass media.  For that, one can just plot hit rates against the distance categories.

Source: "A Hail of Bullets, a Heap of Uncertainty", New York Times, Dec 9 2007; New York Firearms Discharge Report 2006.

## The Immigrants' Path

##### Jun 15, 2007

A recent Wall Street Journal editorial used this chart (via the National Foundation for American Policy) to claim success for the "Bracero" guest worker program, initiated in 1942.  Their analysis:

... illegal border crossings subsequently plummeted.  Between 1953 and 1959, they fell by some 95%.  In 1960, mainly in response to complaints from labor unions, the program was scaled back and eventually phased out.

Long-time readers may recall Friedman's Crossover Law of Petropolitics, where the opportune criss-crossing of lines
plotted along double axes was taken as proof of causality.  Friedman's Law lurked here, right in the 1953-1959 range.

The NFAP went one better: in their original version, they blew up the 1953-1959 period to show us the criss-crossing lines!

We see trouble right from the start.  The "subsequent" effect that proved the case occurred in 1953, over 10 years after the program started. During that first decade, the number of apprehensions rose 4388%, in spite of the guest worker program.

A scatter plot (below left) now shows the lack of any meaningful relationship between these two variables.  While high admissions appeared together with low apprehensions, any level of admissions had historically been paired with low apprehensions.

On the right, I connected the dots in chronological order.  Any claim of a negative relationship between admissions and apprehensions has been debunked.  From 1942 on (as we trace the line clockwise from lower left), first the nation experienced stepwise increasing admissions coupled with stepwise increasing apprehensions; then it witnessed sharply dropping apprehensions with relatively stable admissions; and finally it saw plummeting admissions while apprehensions remained low.  Three separate episodes, three distinct patterns.  There was no association, let alone causation.

Source: "Immigration Plan B", Wall Street Journal, June 13 2007.

## Shower of bullets

##### Apr 25, 2007

Here's one of those infographics that makes the reader work hard (via Dustin J).  The graphic in its full glory is here; it's much too large to be reproduced, and I have clipped off the bottom half.

Much to the designer's credit, he extracted data of interest, rather than trying to cram everything onto the page.  In particular, he was most interested in the distribution of deaths among different age groups, the types of deaths (suicides, homicides) and the identities of the deceased (race, gender).

Just like the election fraud graphic, such rich data lend themselves to multiple levels of aggregation.  Here, the designer focuses on the most detailed level, making it easiest to see facts like "among the 18-25 age group, there were 6 black men murdered per day".

However, it takes much more attention to notice higher-level facts like "homicides per day are relatively flat across age groups while suicides heavily skew toward 40+".

In the junkart version, I decided to emphasize the more aggregated data, showing the number of deaths of each type across age groups. The detailed break-down of race and gender is shoved into parentheses, as they can be omitted by less serious readers.

The reader who discovers that the homicide/suicide pattern described above may surmise that homicide gunfire deaths are more "random" while suicides, being  premeditated, may affect older people disproportionately.  More research would be needed to confirm such and other suspicions.

Source: "An Accounting of Daily Gun Deaths", New York Times, April 21 2007.

## Embedding logic

##### Apr 20, 2007

Bernard L. (from France) submitted this bubble chart for consideration.  It accompanied an NYT article claiming the absence of evidence of election fraud.  (Of course, as is well-known, absence of evidence is not the same as evidence of absence.  Here, I'm purely interested in data presentation.)

As a seasoned consultant, Bernard asked if a Marimekko chart would be superior.

This is one ambitious chart.  Ignoring the bubbles (which are more nuisance than anything), we are asked to interpret data at three different levels of aggregation in one go.

First, there were 95 cases classified into five indictment types.  Second, these cases resulted in either convictions or acquittals/dismissals.  Third, among the cases ending in convictions (the highlighted area), we were shown the occupations of those convicted.

By flattening three levels into one table, some key information is obscured.  For example, how many cases resulted in conviction?  The reader has to compute either 95-25 or 26+31+10+3.  What percent of civil rights violation convictions were committed by party/campaign workers?  It's not 2/3 = 67% (bottom row) but rather 2/2 = 100%.

The following junkart brings out the logic that is embedded in the complicated bubble-table.  While there is a lot on the page, the text labels plus the flow directions allow readers to absorb the data one level at a time.

I have not attempted the Marimekko as I am not a fan of such charts.  You're welcome to try.

Source: "In 5-Year Effort, Scant Evidence of Voter Fraud", New York Times, April 2007.

PS. I will be working through the backlog of reader submissions.  Thanks for your patience.  Keep them coming!

Remark (Apr 25 2007): Thanks to readers for keeping me honest (see comments below).  The conviction rates shown previously were indeed the inverse.  I have now fixed them.

## Criminal chart

##### Mar 08, 2007

The Times found a sharp surge in violent crimes.

Uh-oh.

The legend for the columns is missing.

The maximum murder rate of about 45 per 100,000 in the top chart is depicted by a column 9x as tall as that showing the minimum rate of about 60 per 100,000 of aggravated assaults in the bottom chart.

Sorting by murder rate does disservice to the bottom chart, rendering it essentially unreadable.

Reference: "Violent Crime in Cities Shows Sharp Surge", New York Times, March 9 2007.

## Finding dots

##### Nov 04, 2006

Erik W. alerted me to this CNN map that shows FBI statistics about safety of American cities.  As Eric pointed out, this is prototypical of chartjunk a la Tufte.  A lot of ink is used to depict 12 points of data (top 3 cities in safety, crime, improvement and decline).

Imagine the reader trying to find the 3rd most improved city.  She either has to find all the blue dots and then figure out which is #3; or she needs to find all the #3 dots and figure out which is blue.  As they say, it's "hard work".  In fact, finding the dots among the forest of large text is hard work by itself!

How would I re-make this chart?

• Highlight only the states containing data (California, Michigan, Missouri, Ohio, Georgia, New Jersey, New York); gray out all other states and their boundaries
• Separate the states from the cities; only write the State name once for each State; reduce the font size
• Instead of dots, use numbers.  So the most dangerous city (St Louis) gets a red "1", Oakland gets a purple "3", etc.
• Remove Mexico, Canada and water from the map

The map gives the false impression that crime is relevant only along the coasts and the lakes, when in fact, the map is just saying that most cities in the U.S. are located along the coasts and the lakes.  Using such a map to depict city-level statistics creates distortion because cities are not evenly distributed across America.

Beyond that, what is the point of this map?  Is it merely a geography class telling us where each city is located?  How is it better than a simple table listing the cities in order?

Reference: "U.S. City Safety Rankings", CNN, 2006.

## Where are the crimes?

##### Sep 29, 2006

The author of this data table and the readers are asking the same question, "Where are the crimes?", but for different reasons.

While the author wanted to convey regional differences in crime growth, as readers, we are not sure which part of the table to look at; every cell is given equal "weight".

Judging from this "profile plot", we can conclude:

• the Mid-West (blue line) experienced a crime spurt that is very much worse than the national average (dots) in all categories except forcible rapes and murder
• the West (red line), in general, had crime increases less severe than the national average
• that said, the regional profiles are relatively similar, showing few meaningful regional differences (compared to other profile plots I've seen)

Reference: "Communities Grapple With Rise in Violence", MSNBC.com
Thanks to Maya for sending in the link.