Nice analysis of racial composition of police forces

The Washington Post has a good idea. Using Census data, they computed the proportion of police force who are white and the corresponding proportion of citizens who are white, in different cities.

In the following scatter plot, they singled out North Charleston, SC where the police force is 85% white but the citizens are only 40% white: (Link to the interactive chart.)


This plot itself is well done, with helpful coloring and labels.

One must be careful about "story time": it's easy to infer from the graph that blue dots mean worse racial tension but that interpretation requires an assumption not proven in the data. (What is missing is the correlation between this data and some other data measuring tension.)

The secret to reading this chart is to look at the slopes of lines from the origin to each point. Above the 45-degree diagonal separating the blue dots from the gray are the cities where the police is more white than the people. The steeper the line to the origin, the more unrepresentative. Once you pass the 45-degree line, do the reverse.

The slope is really the metric of X police per Y residents. So the two dimensions can be collapsed into one. With the one dimension, I'd try a histogram view. If you find the data, let me know. Or just post it to the comments.

Three short lessons on comparisons

I like this New York Times graphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.

There are three sections to the graphic, and each displays a different form of comparisons

The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.


One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.


The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.


The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).

The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!


The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.


Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.



A small step for interactivity

Alberto links to a nice Propublica chart on average annual spend per dialysis patient on ambulances by state. (link to chart and article)


It's a nice small-multiples setup with two tabs, one showing the states in order of descending spend and the other, alphabetical.

In the article itself, they excerpt the top of the chart containing the states that have suspiciously high per-patient spend.

Several types of comparisons are facilitated: comparison over time within each state, comparison of each state against the national average, comparison of trend across states, and comparison of state to state given the year.

The first comparison is simple as it happens inside each chart component.

The second type of comparison is enabled by the orange line being replicated on every component. (I'd have removed the columns from the first component as it is both redundant and potentially confusing, although I suspect that the designer may need it for technical reasons.)

The third type of comparison is also relatively easy. Just look at the shape of the columns from one component to the next.

The fourth type of comparison is where the challenge lies for any small-multiples construction. This is also a secret of this chart. If you mouse over any year on any component, every component now highlights that particular year's data so that one can easily make state by state comparisons. Like this for 2008:


You see that every chart now shows 2008 on the horizontal axis and the data label is the amount for 2008. The respective columns are given a different color. Of course, if this is the most important comparison, then the dimensions should be switched around so that this particular set of comparisons occurs within a chart component--but obviously, this is a minor comparison so it gets minor billing.


I love to see this type of thoughtfulness! This is an example of using interactivity in a smart way, to enhance the user experience.

The Boston subway charts I featured before also introduce interactivity in a smart way. Make sure you read that post.

Also, I have a few comments about the data analysis on the sister blog.

Conventions, novelty and the double edge

This chart from Reuters is making the rounds on Twitter today.


Quickly, tell me whether the Gun Law in Florida did well or poorly.

That of course is the entire purpose of the chart.


If you are like me, that is, you have knowledge in your head of time-series line charts, you probably experienced that moment where the bottom fell out and you didn't know which way was up.

This is the double edge of novelty in charts. There should be a very high bar against running counter to convention. Readers do bring their "baggage" to the chart, and the designer should take that into consideration.

Some commentators are complaining about trickery. That may be true. But it's also possible the designer actually thought reversing the direction of the vertical axis made the chart better.

Don't forget about we have another convention: up is good and down is bad. Fewer murders is good and more murders is bad. So why not make it such that a rising line indicates goodness (fewer murders)?


Going back to the Trifecta Checkup. This chart has dual problems. We just talked about the syncing between the data and the graphical element.

The other issue is that the data is insufficient to draw conclusions about the underlying question: what explains the shift in number of murders since the late 2000s? This is a complex problem--the chapter in Freakonomics about abortion and crime rate is still instructive, not for the disputed conclusion but for the process of testing various hypotheses. The reduction of the complex causal structure to a single factor is dissatisfying.




Advocacy graphics

Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.


Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.

In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:


The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.

The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.

Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.


The switch to the dots aesthetic introduces a host of problems.

Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.

This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.

Pew_deathpenalty_noteAnother design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.


I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)

It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.


A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):


I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.



Pets may need shelter from this terrible chart

Josh tweeted quite a shocking attack ad to me last week. He told me it came from the DC Metro. The ad is taken out by a group called HumaneWatch.Org, which apparently is a watchdog checking up on charity organizations. The ad attacks a specific group called the Humane Society of the United States. Here is the map that is the centerpiece of the copy:


Trifecta_checkupI like to use the Trifecta checkup to evaluate graphics. It's a nice way to organize your visualization critique. You progress through three corners: figuring out what is the practical question being addressed by the graphic, then evaluating what data is being deployed, and finally whether the graphical elements (the chart itself) is well executed in relation to the question and the data.


Based on the map, it appears that HumaneWatch is interested in the spending on pet shelters. Every number shown is tiny: on a quick scan, the range may be from 0% to 0.35%. The all-caps title "A Whole Lotta Nothing" confirms that this is the intended message.

Knowing nothing about either of these organizations leaves me confused. Should the "Humane Society" be spending the bulk of its budget on pet shelters? If it doesn't, is it because the staff is pilfering money, or because it has wasteful spending, or because pets are not its major cause, or because pet shelters are not the key way this organization helps pets?

I did look up Humane Society to learn that it is an animal rights group. The four bullet points at the bottom of the ad provide a clue as to what the designer wanted to convey: namely, that this charity is a scam, with too much overhead spending, and spending on pensions.


So I think the question being asked is sufficiently clarified, and it's a pretty important one. How is this organization spending its donations? Is it irresponsible compared to other similar organizations?


The data should be in sync with the question being addressed; that's why there is a link between the two corners of the Trifecta. Given the trouble I endured understanding the question being addressed, it would come as no surprise that this chart scores poorly on the DATA corner.

I don't understand why budget spent on pet shelters is the key bone of contention. Based on the perceived objectives, it seems that they should display directly what proportion of the budget went to overhead, and what proportion went to pensions, with suitable comparisons.

The analysis by state is a disease of having too much data. Let's imagine that the proportions averaged across all states come to 0.1%. If we replaced those 50 numbers with one number printed across all states: "The Humane Society spends less than 0.1% of its budget on pet shelters.", the message would have been identical, while being less confusing.

And it's not just confusion. Cutting the data by state introduces complications. The analyst would need to make sure that any differences between states are not due to factors such as the number of pets, the proportion of households owning pets, the average spending per pet, the supply and demand for pet shelters, the existence of alternatives to pet shelters, etc. None of these issues need to worry the designer who does not slice the data down.

The same reason goes for why the absolute amount of spending (encoded in the colors of individual states) is not worth the ink it's printed on. The range between 0% and 0.35% has been chopped into seven pieces, which creates artificial gaps between the states. This design muddles the graphic's key message, "A Whole Lotta Nothing".


As we land on the final corner of the Trifecta, we ignore our previous complaint and accept that the proportion of budget is an interesting data series to visualize, and turn attention to the graphical elements. This chart scores poorly on chart execution as well!

Notice that the designer simultaneously plots two data series on the same map, the dollar value of pet shelter spending, and it as a proportion of budget. The former is encoded in the color of the state areas while the latter is printed directly as data labels. This is a map equivalent of "dual-axes" line charts, and equally unreadable.

Dcmetro_map_colorsBased on the color legend, our brain tells us the yellow states are better than the blue states but the huge numbers printed on the map conveys the opposite message. The progression of colors makes little sense. The red and yellow stand out but those states are in the middle of the range.

It's a little blurry but I think there is a number of New England states in the high spending category (black and dark gray colors), and the map just happens to obscure this key feature.




DATA: Very Poor


Light entertainment: Behold the 10 percent change!

Reader Orjan L. sent in this Swedish delight:


It's on the last page of this report, and I'm told it's about the number of weapons seized by Swedish customs each year.


On p. 8, I found a hockey-stick chart:


Sweden in ecstasy.


For those who love cross-over charts, look no further than p. 3 which has a reverse hockey stick.


Hard work pays off

At the NY Tech Meetup, Andrei Scheinkman showed off some work his team at Huffington Post did relating to gun violence in America.



Interactive version is here. The animation shows day by day, where the victims of gun violence were located. The table below contains the details of each victim, and links to the news story covering the event.


What is not seen on the chart is even more impressive. Andrei described how they looked around for databases that would provide them the raw materials for creating this chart but no timely source exists. This means that a team of 15 (if I heard correctly) spent a month or so manually collecting all the data on a spreadsheet.

It's also the reason why they cannot continue the map indefinitely, as people have other things to do.

Andrei also contrasted this visualization with a text article that describes the state of gun violence in words. You guessed it, the visual presentation is hands-down more compelling.

Doing legwork, doing justice

The New York Times brought attention to the Bronx courtrooms this weekend. (link) The following small-multiples chart effectively illustrates how the Bronx system is uniquely unproductive, compared to the other boroughs:


The above chart shows the outcomes. The next chart shows the possible cause.

Nyt_bronx_courtsIt appeared that at any time of the day, at least one-third of the courtrooms are not actively conducting business. In fact, outside of the period between 10:30 and 12:30, and 2:30, less than half of the courtrooms have a judge present.

I want to draw your attention to the caption below the chart. It said: "The Times visited all 47 courtrooms at the Bronx County Hall of Justice in 30-minute intervals totally how many were open and actively in session, ..."

Too often, we analyze and plot whatever data has been collected conveniently by some machine. Such data frequently do not address the questions we'd like to answer. We let the data dictate our research question.

Most great work in statistics come from people who put in the effort to define their research goals first, and then manually collect the specific data needed to accomplish those goals.


Interpreting some charts about guns

Felix linked to a set of charts about guns in the U.S. (and elsewhere). The original charts, by Liz Fosslien, are found here.

I like the clean style used by Fosslien. Some of the charts are thought-provoking. Many of them may raise more questions than they answer. Here are a few that caught my eye.


A simplistic interpretation would claim that banning handguns is futile, and may even have an adverse impact on murder rate. However, this chart does not reveal the direction of causality. Did some countries ban handguns because they are reacting to higher violence? If that is the case, this chart is confirming that the countries with handgun bans are a self-selected group.



The U.S. is an outlier, both in terms of firearm ownership and firearm homicides. This makes the analysis much harder because the U.S. is really in a class of its own. It's not at all clear whether there is a positive correlation in the cluster below, and even if there is, whether we can draw a straight line up to the U.S. dot is also dubious.



Fosslien is being cheeky to deny us the identity of the other outlier, the country with few firearms but even higher death rate from intentional homicide. These scatter plots are great by the way to show bivariate distributions.



I'd still prefer a line chart for this type of data but this particular paired bar chart works for me as well. The contents of this chart is a shock to me.



I just don't get this one. Why is there a fan?