On Twitter, someone pointed me to the following map of journalists who were killed between 1993 and 2015.
I wasn't sure if the person who posted this liked or disliked this graphic. We see a clear metaphor of gunshots and bloodshed. But in delivering the metaphor, a number of things are sacrificed:
the number of deaths is hard to read
the location of deaths is distorted, both in large countries (Russia) where the deaths are too concentrated, and in small countries (Philippines) where the deaths are too dispersed
despite the use of a country-level map, it is hard to learn the deaths by country
The Committee to Protect Journalists (CPJ), which publishes the data, used a more conventional choropleth map, which was reproduced and enhanced by Global Post:
They added country names and death counts via a list at the bottom. There is also now a color scale. (Note the different sets of dates.)
In a Trifecta Checkup, I would give this effort a Type DV. While the map is competently produced, it doesn't get at the meat of the data. In addition, these raw counts of deaths do not reveal much about the level of risk experienced by journalists working in different countries.
The limitation of the map can be seen in the following heatmap:
While this is not a definitive visualization of the dataset, I use this heatmap to highlight the trouble with hiding the time dimension. Deaths are correlated with particular events that occurred at particular times.
Iraq is far and away the most dangerous but only after the Iraq War and primarily during the War and its immediate aftermath. Similarly, it is perfectly safe to work in Syria until the last few years.
A journalist can use this heatmap as a blueprint, and start annotating it with various events that are causes of heightened deaths.
Now the real question in this dataset is the risk faced by journalists in different countries. The death counts give a rather obvious and thus not so interesting answer: more journalists are killed in war zones.
A denominator is missing. How many journalists are working in the respective countries? How many non-journalists died in the same countries?
Also, separating out the causes of death can be insightful.
Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.
In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.
The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.
This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.
One way to solve this problem is to use a density chart (heatmap).
You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces. This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.
For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:
Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.
But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.
More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.
The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.
To conclude, the Washington Post data appear to show these insights:
There is a national bias of whites being more likely to be in the police force
In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
Most cities confirm to the national bias, within an acceptable margin of error
There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.
Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.
The Washington Post has a good idea. Using Census data, they computed the proportion of police force who are white and the corresponding proportion of citizens who are white, in different cities.
In the following scatter plot, they singled out North Charleston, SC where the police force is 85% white but the citizens are only 40% white: (Link to the interactive chart.)
This plot itself is well done, with helpful coloring and labels.
One must be careful about "story time": it's easy to infer from the graph that blue dots mean worse racial tension but that interpretation requires an assumption not proven in the data. (What is missing is the correlation between this data and some other data measuring tension.)
The secret to reading this chart is to look at the slopes of lines from the origin to each point. Above the 45-degree diagonal separating the blue dots from the gray are the cities where the police is more white than the people. The steeper the line to the origin, the more unrepresentative. Once you pass the 45-degree line, do the reverse.
The slope is really the metric of X police per Y residents. So the two dimensions can be collapsed into one. With the one dimension, I'd try a histogram view. If you find the data, let me know. Or just post it to the comments.
I like this New York Timesgraphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.
There are three sections to the graphic, and each displays a different form of comparisons.
The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.
One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.
The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.
The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).
The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!
The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.
Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.
Alberto links to a nice Propublica chart on average annual spend per dialysis patient on ambulances by state. (link to chart and article)
It's a nice small-multiples setup with two tabs, one showing the states in order of descending spend and the other, alphabetical.
In the article itself, they excerpt the top of the chart containing the states that have suspiciously high per-patient spend.
Several types of comparisons are facilitated: comparison over time within each state, comparison of each state against the national average, comparison of trend across states, and comparison of state to state given the year.
The first comparison is simple as it happens inside each chart component.
The second type of comparison is enabled by the orange line being replicated on every component. (I'd have removed the columns from the first component as it is both redundant and potentially confusing, although I suspect that the designer may need it for technical reasons.)
The third type of comparison is also relatively easy. Just look at the shape of the columns from one component to the next.
The fourth type of comparison is where the challenge lies for any small-multiples construction. This is also a secret of this chart. If you mouse over any year on any component, every component now highlights that particular year's data so that one can easily make state by state comparisons. Like this for 2008:
You see that every chart now shows 2008 on the horizontal axis and the data label is the amount for 2008. The respective columns are given a different color. Of course, if this is the most important comparison, then the dimensions should be switched around so that this particular set of comparisons occurs within a chart component--but obviously, this is a minor comparison so it gets minor billing.
I love to see this type of thoughtfulness! This is an example of using interactivity in a smart way, to enhance the user experience.
The Boston subway charts I featured before also introduce interactivity in a smart way. Make sure you read that post.
Also, I have a few comments about the data analysis on the sister blog.
This chart from Reuters is making the rounds on Twitter today.
Quickly, tell me whether the Gun Law in Florida did well or poorly.
That of course is the entire purpose of the chart.
If you are like me, that is, you have knowledge in your head of time-seriesline charts, you probably experienced that moment where the bottom fell out and you didn't know which way was up.
This is the double edge of novelty in charts. There should be a very high bar against running counter to convention. Readers do bring their "baggage" to the chart, and the designer should take that into consideration.
Some commentators are complaining about trickery. That may be true. But it's also possible the designer actually thought reversing the direction of the vertical axis made the chart better.
Don't forget about we have another convention: up is good and down is bad. Fewer murders is good and more murders is bad. So why not make it such that a rising line indicates goodness (fewer murders)?
Going back to the Trifecta Checkup. This chart has dual problems. We just talked about the syncing between the data and the graphical element.
The other issue is that the data is insufficient to draw conclusions about the underlying question: what explains the shift in number of murders since the late 2000s? This is a complex problem--the chapter in Freakonomics about abortion and crime rate is still instructive, not for the disputed conclusion but for the process of testing various hypotheses. The reduction of the complex causal structure to a single factor is dissatisfying.
Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.
Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.
In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:
The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.
The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.
Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.
The switch to the dots aesthetic introduces a host of problems.
Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.
This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.
Another design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.
I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)
It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.
A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):
I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.
Josh tweeted quite a shocking attack ad to me last week. He told me it came from the DC Metro. The ad is taken out by a group called HumaneWatch.Org, which apparently is a watchdog checking up on charity organizations. The ad attacks a specific group called the Humane Society of the United States. Here is the map that is the centerpiece of the copy:
I like to use the Trifecta checkup to evaluate graphics. It's a nice way to organize your visualization critique. You progress through three corners: figuring out what is the practical question being addressed by the graphic, then evaluating what data is being deployed, and finally whether the graphical elements (the chart itself) is well executed in relation to the question and the data.
Based on the map, it appears that HumaneWatch is interested in the spending on pet shelters. Every number shown is tiny: on a quick scan, the range may be from 0% to 0.35%. The all-caps title "A Whole Lotta Nothing" confirms that this is the intended message.
Knowing nothing about either of these organizations leaves me confused. Should the "Humane Society" be spending the bulk of its budget on pet shelters? If it doesn't, is it because the staff is pilfering money, or because it has wasteful spending, or because pets are not its major cause, or because pet shelters are not the key way this organization helps pets?
I did look up Humane Society to learn that it is an animal rights group. The four bullet points at the bottom of the ad provide a clue as to what the designer wanted to convey: namely, that this charity is a scam, with too much overhead spending, and spending on pensions.
So I think the question being asked is sufficiently clarified, and it's a pretty important one. How is this organization spending its donations? Is it irresponsible compared to other similar organizations?
The data should be in sync with the question being addressed; that's why there is a link between the two corners of the Trifecta. Given the trouble I endured understanding the question being addressed, it would come as no surprise that this chart scores poorly on the DATA corner.
I don't understand why budget spent on pet shelters is the key bone of contention. Based on the perceived objectives, it seems that they should display directly what proportion of the budget went to overhead, and what proportion went to pensions, with suitable comparisons.
The analysis by state is a disease of having too much data. Let's imagine that the proportions averaged across all states come to 0.1%. If we replaced those 50 numbers with one number printed across all states: "The Humane Society spends less than 0.1% of its budget on pet shelters.", the message would have been identical, while being less confusing.
And it's not just confusion. Cutting the data by state introduces complications. The analyst would need to make sure that any differences between states are not due to factors such as the number of pets, the proportion of households owning pets, the average spending per pet, the supply and demand for pet shelters, the existence of alternatives to pet shelters, etc. None of these issues need to worry the designer who does not slice the data down.
The same reason goes for why the absolute amount of spending (encoded in the colors of individual states) is not worth the ink it's printed on. The range between 0% and 0.35% has been chopped into seven pieces, which creates artificial gaps between the states. This design muddles the graphic's key message, "A Whole Lotta Nothing".
THE CHART ITSELF:
As we land on the final corner of the Trifecta, we ignore our previous complaint and accept that the proportion of budget is an interesting data series to visualize, and turn attention to the graphical elements. This chart scores poorly on chart execution as well!
Notice that the designer simultaneously plots two data series on the same map, the dollar value of pet shelter spending, and it as a proportion of budget. The former is encoded in the color of the state areas while the latter is printed directly as data labels. This is a map equivalent of "dual-axes" line charts, and equally unreadable.
Based on the color legend, our brain tells us the yellow states are better than the blue states but the huge numbers printed on the map conveys the opposite message. The progression of colors makes little sense. The red and yellow stand out but those states are in the middle of the range.
It's a little blurry but I think there is a number of New England states in the high spending category (black and dark gray colors), and the map just happens to obscure this key feature.