Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:

Reddit_whichbirdspreferwhichseeds_sm

The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:

Reddit_whichbirdswhichseeds_orig_sm

I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:

Slide1

First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:

Slide2

The white color still stands out more than necessary. Fix this using a gray color.

Slide3

What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.

Slide4

The gridlines are also too thick. Here's a slimmed-down look:

Slide5

The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.

Slide6

Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)

Redojc_birdsseeds_all_2

 

PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.

 


Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.

***

In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.

***

Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!

 

As a reference, I found this map of the population density in San Francisco (link):

SFO_Population_Density

 


A look at how the New York Times readers look at the others

Nyt_taxcutmiddleclass

The above chart, when it was unveiled at the end of November last year, got some mileage on my Twitter feed so it got some attention. A reader, Eric N., didn't like it at all, and I think he has a point.

Here are several debatable design decisions.

The chart uses an inverted axis. A tax cut (negative growth) is shown on the right while a tax increase is shown on the left. This type of inversion has gotten others in trouble before, namely, the controversy over the gun deaths chart (link). The green/red color coding is used to signal the polarity although some will argue this is bad for color-blind readers. The annotation below the axis is probably the reason why I wasn't confused in the first place but the other charts further down the page do not repeat the annotation, and that's where the interpretation of -$2,000 as a tax increase is unnatural!

The chart does not aggregate the data. It plots 25,000 households with 25,000 points. Because of the variance of the data, it's hard to judge trends. It's easy enough to see that there are more green dots than red but how many more? 10 percent, 20 percent, 40 percent? It's also hard to answer any specific questions, say, about households with a certain range of incomes. There are various ways to aggregate the data, such as heatmaps, histograms, and so on.

For those used to looking at scientific charts, the x- and y-axes are reversed. By convention, we'd have put the income ranges on the horizontal axis and the tax changes (the "outcome" variable) on the vertical axis.

***

The text labels do not describe the data patterns on the chart so much as they offer additional information. To see this, remove the labels as I have done below. Try adding the labels based on what is shown on the chart.

Nyt_taxcutmiddleclass_2

Perhaps it's possible to illustrate those insights with a set of charts.

***

While reading this chart, I kept wondering how those 25,000 households were chosen. This is a sample of  households. The methodology is explained in a footnote, which describes the definition of "middle class" but unfortunately, they forgot to tell us how the 25,000 households were chosen from all such middle-class households.

Nyt_taxcutmiddleclass_footnote

The decision to omit the households with income below $40,000 needs more explanation as it usurps the household-size adjustment. Also, it's not clear that the impact of the tax bill on the households with incomes between $20-40K can be assumed the same as for those above $40K.

Are the 25,000 households is a simple random sample of all "middle class" households or are they chosen in some ways to represent the relative counts? It's also useful to know if they applied the $40K cutoff before or after selecting the 25,000 households. 

Ironically, the media kit of the Times discloses an affluent readership with median household income of almost $190K so it appears that the majority of readers are not represented in the graphic at all!

 


Canadian winters in cold gray

I was looking at some Canadian data graphics while planning my talk in Vancouver this Thursday (you can register for the free talk here). I love the concept behind the following chart:

Nationalpost_weather-graphic

Based on the forecasted temperature for 2015 (specifically the temperature on Christmas Eve), the reporter for National Post asked whether the winter of 2015 would be colder or warmer than the winters on record since 1990. The accompanying article is here.

The presentation of small multiples encourages readers to examine that question city by city. It is more challenging to discover larger patterns.

Here is a sketch of a different take that attempts to shed light on regional and temporal patterns:

Jc_redo_canadiantemp2

You can see that the western and central cities were warmer in the past while the eastern cities were colder in the past.

Also, there were some particularly cold years (1996, 1998, 2008, and 2012) when most of the featured cities experienced a freeze.

I am not sure why certain cities had no record of their temperature in certain years (machine malfunction?). In fact, one flaw in the original chart is the confusing legend that maps the grey color to "Data Unavailable" when most of the columns shown are grey. 

Nationalpost_weather-graphic-inset


Where but when and why: deaths of journalism

On Twitter, someone pointed me to the following map of journalists who were killed between 1993 and 2015.

Wherejournalistsarekilled

I wasn't sure if the person who posted this liked or disliked this graphic. We see a clear metaphor of gunshots and bloodshed. But in delivering the metaphor, a number of things are sacrificed:

  • the number of deaths is hard to read
  • the location of deaths is distorted, both in large countries (Russia) where the deaths are too concentrated, and in small countries (Philippines) where the deaths are too dispersed
  • despite the use of a country-level map, it is hard to learn the deaths by country

The Committee to Protect Journalists (CPJ), which publishes the data, used a more conventional choropleth map, which was reproduced and enhanced by Global Post:

Gp_wherejournalistskilled

They added country names and death counts via a list at the bottom. There is also now a color scale. (Note the different sets of dates.)

***

In a Trifecta Checkup, I would give this effort a Type DV. While the map is competently produced, it doesn't get at the meat of the data. In addition, these raw counts of deaths do not reveal much about the level of risk experienced by journalists working in different countries.

The limitation of the map can be seen in the following heatmap:

Redo_cpj_heatmap

While this is not a definitive visualization of the dataset, I use this heatmap to highlight the trouble with hiding the time dimension. Deaths are correlated with particular events that occurred at particular times.

Iraq is far and away the most dangerous but only after the Iraq War and primarily during the War and its immediate aftermath. Similarly, it is perfectly safe to work in Syria until the last few years.

A journalist can use this heatmap as a blueprint, and start annotating it with various events that are causes of heightened deaths.

***

Now the real question in this dataset is the risk faced by journalists in different countries. The death counts give a rather obvious and thus not so interesting answer: more journalists are killed in war zones.

A denominator is missing. How many journalists are working in the respective countries? How many non-journalists died in the same countries?

Also, separating out the causes of death can be insightful.


Raw data and the incurious

The following chart caught my eye when it appeared in the Wall Street Journal this month:

Wsj_fedratehike

This is a laborious design; much sweat has been poured into it. It's a chart that requires the reader to spend time learning to read.

A major difficulty for any visualization of this dataset is keeping track of the two time scales. One scale, depicted horizontally, traces the dates of Fed meetings. These meetings seem to occur four times a year except in 2012. The other time scale is encoded in the colors, explained above the chart. This is the outlook by each Fed committee member of when he/she expects a rate hike to occur.

I find it challenging to understand the time scale in discrete colors. Given that time has an order, my expectation is that the colors should be ordered. Adding to this mess is the correlation between the two time scales. As time treads on, certain predictions become infeasible.

Part of the problem is the unexplained vertical scale. Eventually, I realize each cell is a committee member, and there are 19 members, although two or three routinely fail to submit their outlook in any given meeting.

Contrary to expectation, I don't think one can read across a row to see how a particular member changes his/her view over time. This is because the patches of color would be less together otherwise.

***

After this struggle, all I wanted is some learning from this dataset. Here is what I came up with:

Redo_wsjfedratehike

There is actually little of interest in the data. The most salient point is that a shift in view occurred back in September 2012 when enough members pushed back the year of rate hike that the median view moved from 2014 to 2015. Thereafter, there is a decidedly muted climb in support for the 2015 view.

***

This is an example in which plotting elemental data backfires. Raw data is the sanctuary of the incurious.

 

 


Shaking up expectations for pension benefits

Ted Ballachine wrote me about his website Pension360 pointing me to a recent attempt at visualizing pension benefits in various retirement systems in the state of Illinois. The link to the blog post is here.

One of the things they did right is to start with an extended guide to reading the chart. This type of thing should be done more often. Here is the top part of this section.

Pension36_explained

It turns out that the reading guide is vital for this visualization! The reason is that they made some decisions that shake up our expectations.

For example, darker colors usually mean more but here they mean less.

Similarly, a person's service increases as you go down the vertical axis, not up.

I have recommended that they switch those since there doesn't seem to be a strong reason to change those conventions.

***

This display facilitates comparing the structure of different retirement systems. For example, I have placed next to each other the images for the Illinois Teacher's Retirement System (blue), and the Chicago Teacher's Pension Fund (black).

  Chi_il_pension360

It is immediately clear that the Chicago system is miserly. The light gray parts extend only to half of the width compared to the blue cells in the top chart. The fact that the annual payout grows somewhat linearly as the years of service increase makes sense.

What doesn't make sense to me, in the blue chart, is the extreme variance in the annual payout for the beneficiary with "average" tenure of about 35 years. If you look at all of the charts, there are several examples of retirement systems in which employees with similar tenure have payouts that differ by an order of magnitude. Can someone explain that?

***

One consideration for those who make heatmaps using conditional formatting in Excel.

These charts code the count of people in the shades of colors. The reference population is the entire table. This is actually not the only way to code the data. This way of coding it prevents us from understanding the "sparsely populated" regions of the heatmap.

Look at any of the pension charts. Darkness reigns at the bottom of each one, in the rows for people with 50 or 60 years of service. This is because there are few such employees (relative to the total population). An alternative is to color code each row separately. Then you have surfaced the distribution of benefits within each tenure group. (The trade-off is the revised chart no longer tells the reader how service years are distributed.)

Excel's conditional formatting procedure is terrible. It does not remember how you code the colors. It is almost guaranteed that the next time you go back and look at your heatmap, you can't recall whether you did this row by row, column by column, or the entire table at once. And if you coded it cell by cell, my condolences.


What if the Washington Post did not display all the data

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.

Wp_policestaffing

In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.

Redo_wp_police0

The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

One way to solve this problem is to use a density chart (heatmap).

Redo_wp_police_1

You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.
This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:

Redo_wp_police_2

Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.

  Redo_wp_police_3

More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.

Redo_wp_police_4c

The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

  • There is a national bias of whites being more likely to be in the police force
  • In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
  • Most cities confirm to the national bias, within an acceptable margin of error
  • There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.


Interactivity as overhead

Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.

Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.

For example, one can compare the angle and distance of the home runs hit by different players:

Redo_baseballhr

One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)

Mode_homerunsBut back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).

Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.

The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.


How effective visualization brings data alive

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:

Dialectmap_soda

These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.

***

What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:

  Sidebyside_soda

 

 And this is the "caramel" question:

Side_by_side_caramel

 

 The set of maps referred to in the 2009 post can be found here.

 ***

Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.