Where but when and why: deaths of journalism

On Twitter, someone pointed me to the following map of journalists who were killed between 1993 and 2015.


I wasn't sure if the person who posted this liked or disliked this graphic. We see a clear metaphor of gunshots and bloodshed. But in delivering the metaphor, a number of things are sacrificed:

  • the number of deaths is hard to read
  • the location of deaths is distorted, both in large countries (Russia) where the deaths are too concentrated, and in small countries (Philippines) where the deaths are too dispersed
  • despite the use of a country-level map, it is hard to learn the deaths by country

The Committee to Protect Journalists (CPJ), which publishes the data, used a more conventional choropleth map, which was reproduced and enhanced by Global Post:


They added country names and death counts via a list at the bottom. There is also now a color scale. (Note the different sets of dates.)


In a Trifecta Checkup, I would give this effort a Type DV. While the map is competently produced, it doesn't get at the meat of the data. In addition, these raw counts of deaths do not reveal much about the level of risk experienced by journalists working in different countries.

The limitation of the map can be seen in the following heatmap:


While this is not a definitive visualization of the dataset, I use this heatmap to highlight the trouble with hiding the time dimension. Deaths are correlated with particular events that occurred at particular times.

Iraq is far and away the most dangerous but only after the Iraq War and primarily during the War and its immediate aftermath. Similarly, it is perfectly safe to work in Syria until the last few years.

A journalist can use this heatmap as a blueprint, and start annotating it with various events that are causes of heightened deaths.


Now the real question in this dataset is the risk faced by journalists in different countries. The death counts give a rather obvious and thus not so interesting answer: more journalists are killed in war zones.

A denominator is missing. How many journalists are working in the respective countries? How many non-journalists died in the same countries?

Also, separating out the causes of death can be insightful.

Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:


This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.


After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:


There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.


This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.



Shaking up expectations for pension benefits

Ted Ballachine wrote me about his website Pension360 pointing me to a recent attempt at visualizing pension benefits in various retirement systems in the state of Illinois. The link to the blog post is here.

One of the things they did right is to start with an extended guide to reading the chart. This type of thing should be done more often. Here is the top part of this section.


It turns out that the reading guide is vital for this visualization! The reason is that they made some decisions that shake up our expectations.

For example, darker colors usually mean more but here they mean less.

Similarly, a person's service increases as you go down the vertical axis, not up.

I have recommended that they switch those since there doesn't seem to be a strong reason to change those conventions.


This display facilitates comparing the structure of different retirement systems. For example, I have placed next to each other the images for the Illinois Teacher's Retirement System (blue), and the Chicago Teacher's Pension Fund (black).


It is immediately clear that the Chicago system is miserly. The light gray parts extend only to half of the width compared to the blue cells in the top chart. The fact that the annual payout grows somewhat linearly as the years of service increase makes sense.

What doesn't make sense to me, in the blue chart, is the extreme variance in the annual payout for the beneficiary with "average" tenure of about 35 years. If you look at all of the charts, there are several examples of retirement systems in which employees with similar tenure have payouts that differ by an order of magnitude. Can someone explain that?


One consideration for those who make heatmaps using conditional formatting in Excel.

These charts code the count of people in the shades of colors. The reference population is the entire table. This is actually not the only way to code the data. This way of coding it prevents us from understanding the "sparsely populated" regions of the heatmap.

Look at any of the pension charts. Darkness reigns at the bottom of each one, in the rows for people with 50 or 60 years of service. This is because there are few such employees (relative to the total population). An alternative is to color code each row separately. Then you have surfaced the distribution of benefits within each tenure group. (The trade-off is the revised chart no longer tells the reader how service years are distributed.)

Excel's conditional formatting procedure is terrible. It does not remember how you code the colors. It is almost guaranteed that the next time you go back and look at your heatmap, you can't recall whether you did this row by row, column by column, or the entire table at once. And if you coded it cell by cell, my condolences.

What if the Washington Post did not display all the data

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.


In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.


The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

One way to solve this problem is to use a density chart (heatmap).


You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.
This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:


Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.


More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.


The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

  • There is a national bias of whites being more likely to be in the police force
  • In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
  • Most cities confirm to the national bias, within an acceptable margin of error
  • There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.

Interactivity as overhead

Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.

Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.

For example, one can compare the angle and distance of the home runs hit by different players:


One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)

Mode_homerunsBut back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).

Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.

The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.

How effective visualization brings data alive

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:


These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.


What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:



 And this is the "caramel" question:



 The set of maps referred to in the 2009 post can be found here.


Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.



Advocacy graphics

Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.


Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.

In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:


The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.

The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.

Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.


The switch to the dots aesthetic introduces a host of problems.

Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.

This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.

Pew_deathpenalty_noteAnother design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.


I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)

It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.


A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):


I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.



Some chart types are not scalable

Peter Cock sent this Venn diagram to me via twitter. (Original from this paper.)


For someone who doesn't know genetics, it is very hard to make sense of this chart. It seems like there are five characteristics that each unit of analysis can have (listed on the left column) and each unit possesses one or more of these characteristics.

There is one glaring problem with this visual display. The area of each subset is not proportional to the count it represents. Look at the two numbers in the middle of the chart, each accounting for a large chunk of the area of the green tree. One side says 5,724 while the other say 13 even though both sides have the same areas.

In this respect, Venn diagrams are like maps. The area of a country or state on a map is not related to the data being plotted (unless it's a cartogram).

If you know how to interpret the data, please leave a comment. I'm guessing some kind of heatmap will work well with this data. 

Beautiful spider loses its way 2

A double post today.

In the previous post, I talked about NFL.com's visualization of football player statistics. In this post, I offer a few different views of the same data.


The first is a dot plot arranged in small multiples.


Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)

You can immediately make a bunch of observations:

  • Alex Smith was quite poor, except for interceptions.
  • Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
  • Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
  • Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.


The second version is a heatmap.

This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?


Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.

I like this visualization best, primarily because it scales beautifully.


The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.