## What if the Washington Post did not display all the data

##### Apr 23, 2015

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.

In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.

The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

One way to solve this problem is to use a density chart (heatmap).

You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.
This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:

Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.

More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.

The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

• There is a national bias of whites being more likely to be in the police force
• In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
• Most cities confirm to the national bias, within an acceptable margin of error
• There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.

You can follow this conversation by subscribing to the comment feed for this post.

The problem with the histogram of the ratio is that you've compressed the entire left tail (i.e. all municipalities in which there are proportionally more non-white officers) into one bar. I'd be interested to see the histogram with x on a log-scale, which might make the skew more apparent.

timthompson: a log odds scaling is appropriate for this data but not appropriate for Washington Post readers. I neglected to mention that the histogram I posted already lobbed off a much longer right tail. The problem comes from this particular ratio scale. If the proportion of white residents is low, the ratio will explode easily.

I've never been called "Charles P." before ;-)

What software are you using for the Heatmaps?

All the charts in this post are created in JMP's Graph Builder, which I use often to make sketches. R also has good heatmap functions.

In the case of average it is clear: densest regions are clustered around the 45-degree dotted diagonal by default.

Average is good to compare different regions - how far from the middle is on region but in this case it makes no sense.

The comments to this entry are closed.