Ranking data provide context but can also confuse

This dataviz from the Economist had me spending a lot of time clicking around - which means it is a success.


The graphic presents four measures of wellbeing in society - life expectancy, infant mortality rate, murder rate and prison population. The primary goal is to compare nations across those metrics. The focus is on comparing how certain nations (or subgroups) rank against each other, as indicated by the relative vertical position.

The Economist staff has a particular story to tell about racial division in the US. The dotted bars represent the U.S. average. The colored bars are the averages for Hispanic, white and black Americans. The wider the gap between the colored bars, the more variant is the experiences between American races.

The chart shows that the racial gap of life expectancy is the widest. For prison population, the U.S. and its racial subgroups occupy many of the lowest (i.e. least desirable) ranks, with the smallest gap in ranking.


The primary element of interactivity is hovering on a bar, which then highlights the four bars corresponding to the particular nation selected. Here is the picture for Thailand:


According to this view of the world, Thailand is a close cousin of the U.S. On each metric, the Thai value clings pretty near the U.S. average and sits within the range by racial groups. I'm surprised to learn that the prison population in Thailand is among the highest in the world.

Unfortunately, this chart form doesn't facilitate comparing Thailand to a country other than the U.S as one can highlight only one country at a time.


While the main focus of the chart is on relative comparison through ranking, the reader can extract absolute difference by reading the lengths of the bars.

This is a close-up of the bottom of the prison population metric:

Econ_useexcept_prisonpop_bottomThe length of each bar displays the numeric data. The red line is an outlier in this dataset. Black Americans suffer an incarceration rate that is almost three times the national average. Even white Americans (blue line) is imprisoned at a rate higher than most countries around the world.

As noted above, the prison population metric exhibits the smallest gap between racial subgroups. This chart is a great example of why ranking data frequently hide important information. The small gap in ranking masks the extraordinary absolute difference in incareration rates between white and black America.

The difference between rank #1 and rank #2 is enormous.

Econ_useexcept_lifeexpect_topThe opposite situation appears for life expectancy. The life expectancy values are bunched up especially at the top of the scale. The absolute difference between Hispanic and black America is 82 - 75 = 7 years, which looks small because the axis starts at zero. On a ranking scale, Hispanic is roughly in the top 15% while black America is just above the median. The relative difference is huge.

For life expectancy, ranking conveys the view that even a 7-year difference is a big deal because the countries are tightly bunched together. For prison population, ranking shows the view that a multiple fold difference is "unimportant" because a 20-0 blowout and a 10-0 blowout are both heavy defeats.


Whenever you transform numeric data to ranks, remember that you are artificially treating the gap between each value and the next value as a constant, even when the underlying numeric gaps show wide variance.






Five steps to let the young ones shine

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.


I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.


The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.


The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.


The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.


Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.





Two good charts can use better titles

NPR has this chart, which I like:


It's a small multiples of bumps charts. Nice, clear labels. No unnecessary things like axis labels. Intuitive organization by Major Factor, Minor Factor, and Not a Factor.

Above all, the data convey a strong, surprising, message - despite many high-profile gun violence incidents this year, some Democratic voters are actually much less likely to see guns as a "major factor" in deciding their vote!

Of course, the overall importance of gun policy is down but the story of the chart is really about the collapse on the Democratic side, in a matter of two months.

The one missing thing about this chart is a nice, informative title: In two months, gun policy went from a major to a minor issue for some Democratic voters.


 I am impressed by this Financial Times effort:


The key here is the analysis. Most lazy analyses compare millennials to other generations but at current ages but this analyst looked at each generation at the same age range of 18 to 33 (i.e. controlling for age).

Again, the data convey a strong message - millennials have significantly higher un(der)employment than previous generations at their age range. Similar to the NPR chart above, the overall story is not nearly as interesting as the specific story - it is the pink area ("not in labour force") that is driving this trend.

Specifically, millennial unemployment rate is high because the proportion of people classified as "not in labour force" has doubled in 2014, compared to all previous generations depicted here. I really like this chart because it lays waste to a prevailing theory spread around by reputable economists - that somehow after the Great Recession, demographics trends are causing the explosion in people classified as "not in labor force". These people are nobodies when it comes to computing the unemployment rate. They literally do not count! There is simply no reason why someone just graduated from college should not be in the labour force by choice. (Dean Baker has a discussion of the theory that people not wanting to work is a long term trend.)

The legend would be better placed to the right of the columns, rather than the top.

Again, this chart benefits from a stronger headline: BLS Finds Millennials are twice as likely as previous generations to have dropped out of the labour force.





Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.


In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.


Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!


As a reference, I found this map of the population density in San Francisco (link):



This map steals story-telling from the designer

Stolen drugs is a problem at federal VA hospitals according to the following map.



Let's evaluate this map from a Trifecta Checkup perspective.

VISUAL - Pass by a whisker. The chosen visual form of a map is standard for geographic data although the map snatches story-telling from our claws, just as people steal drugs from hospitals. Looking at the map, it's not clear what the message is. Is there one?

The 50 states plus DC are placed into five groups based on the reported number of incidents of theft. From the headline, it appears that the journalist conducted a Top 2 Box analysis, defining "significant" losses of drugs as 300 incidents or more. The visual design ignores this definition of "significance."

DATA - Fail. The map tells us where the VA hospitals are located. It doesn't tell us which states are most egregious in drug theft. To learn that, we need to compute a rate, based on the number of hospitals or patients or the amount of spending on drugs.

Looking more carefully, it's not clear they used a Top 2 Box analysis either. I counted seven states with the highest level of theft, followed by another seven states with the second highest level of theft. So the cutoff of twelve states awkwardly lands in between the two levels.

QUESTION - Fail. Drug theft from hospitals is an interesting topic but the graphic does not provide a good answer to the question.


Even if we don't have data to compute a rate, the chart is a bit better if proportions are emphasized, rather than counts.


The proportions are most easily understood from the base of four quarters making the whole. The first group is just over a quarter; the second group is exactly a quarter. The third group plus the first group roughly make up a half. The fourth and fifth groups together almost fills out a quarter.

In the original map, we are told about at least 400 incidents of theft in Texas but given no context to interpret this statistic. What proportion of the total thefts occur in Texas?



Excellent visualization of gun violence in American cities

I like the Guardian's feature (undated) on gun violence in American cities a lot.

The following graphic illustrates the situation in Baltimore.


The designer starts by placing where the gun homicides occured in 2015. Then, it leads readers through an exploration of the key factors that might be associated with the spatial distribution of those homicides.

The blue color measures poverty levels. There is a moderate correlation between high numbers of dots (homicides) and deeper blue (poorer). The magenta color measures education attainment and the orange color measures proportion of blacks. In Baltimore, it appears that race is substantially better at explaining the prevalence of homicides.

This work is exemplary because it transcends description (first map) and explores explanations for the spatial pattern. Because three factors are explored together in a small-multiples layout, readers learn that no single factor can explain everything. In addition, we learn that different factors have different degrees of explanatory power.

Attentive readers will also find that the three factors of poverty, education attainment and proportion black are mutually correlated.  Areas with large black populations also tend to be poorer and less educated.


I also like the introductory section in which a little dose of interactivity is used to sequentially present the four maps, now superimposed. It then becomes possible to comprehend the rest quickly.



The top section is less successful as proportions are not easily conveyed via dot density maps.


Dropping the map form helps. Here is a draft of what I have in mind. I just pulled some data from online sources at the metropolitan area (MSA) level, and it doesn't have as striking a comparison as the city-level data, it seems.



 PS. On Twitter, Aliza tells me the article was dated January 9, 2017.

Where but when and why: deaths of journalism

On Twitter, someone pointed me to the following map of journalists who were killed between 1993 and 2015.


I wasn't sure if the person who posted this liked or disliked this graphic. We see a clear metaphor of gunshots and bloodshed. But in delivering the metaphor, a number of things are sacrificed:

  • the number of deaths is hard to read
  • the location of deaths is distorted, both in large countries (Russia) where the deaths are too concentrated, and in small countries (Philippines) where the deaths are too dispersed
  • despite the use of a country-level map, it is hard to learn the deaths by country

The Committee to Protect Journalists (CPJ), which publishes the data, used a more conventional choropleth map, which was reproduced and enhanced by Global Post:


They added country names and death counts via a list at the bottom. There is also now a color scale. (Note the different sets of dates.)


In a Trifecta Checkup, I would give this effort a Type DV. While the map is competently produced, it doesn't get at the meat of the data. In addition, these raw counts of deaths do not reveal much about the level of risk experienced by journalists working in different countries.

The limitation of the map can be seen in the following heatmap:


While this is not a definitive visualization of the dataset, I use this heatmap to highlight the trouble with hiding the time dimension. Deaths are correlated with particular events that occurred at particular times.

Iraq is far and away the most dangerous but only after the Iraq War and primarily during the War and its immediate aftermath. Similarly, it is perfectly safe to work in Syria until the last few years.

A journalist can use this heatmap as a blueprint, and start annotating it with various events that are causes of heightened deaths.


Now the real question in this dataset is the risk faced by journalists in different countries. The death counts give a rather obvious and thus not so interesting answer: more journalists are killed in war zones.

A denominator is missing. How many journalists are working in the respective countries? How many non-journalists died in the same countries?

Also, separating out the causes of death can be insightful.

What if the Washington Post did not display all the data

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.


In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.


The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

One way to solve this problem is to use a density chart (heatmap).


You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.
This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:


Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.


More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.


The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

  • There is a national bias of whites being more likely to be in the police force
  • In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
  • Most cities confirm to the national bias, within an acceptable margin of error
  • There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.

Nice analysis of racial composition of police forces

The Washington Post has a good idea. Using Census data, they computed the proportion of police force who are white and the corresponding proportion of citizens who are white, in different cities.

In the following scatter plot, they singled out North Charleston, SC where the police force is 85% white but the citizens are only 40% white: (Link to the interactive chart.)


This plot itself is well done, with helpful coloring and labels.

One must be careful about "story time": it's easy to infer from the graph that blue dots mean worse racial tension but that interpretation requires an assumption not proven in the data. (What is missing is the correlation between this data and some other data measuring tension.)

The secret to reading this chart is to look at the slopes of lines from the origin to each point. Above the 45-degree diagonal separating the blue dots from the gray are the cities where the police is more white than the people. The steeper the line to the origin, the more unrepresentative. Once you pass the 45-degree line, do the reverse.

The slope is really the metric of X police per Y residents. So the two dimensions can be collapsed into one. With the one dimension, I'd try a histogram view. If you find the data, let me know. Or just post it to the comments.