After seeing this chart, my mouth needed a rinse

The credit for today's headline goes to Andrew Gelman, who said something like that when I presented the following chart at his Statistical Graphics class yesterday:

Fidelityad_consumerstaples_adj_smWith this chart (which appeared in a large ad in the NY Times), Fidelity Investment wants to tell potential customers to move money into the consumer staples category because of "greater return" and "lower risk". You just might wonder what a "consumer staple" is. Toothbrushes, you see.

There are too many issues with the chart to fit into one blog post. My biggest problem concerns the visual trickery used to illustrate "greater" and "lower". The designer wants to focus readers on the two orange brushes: return for consumer staples is higher, and risk is lower, you see.

The "greater" (i.e. right-facing) toothbrush is associated with longer brushes and higher elevation; the "lower" (left-facing) toothbrush, with shorter brushes and lower elevation.

But looking carefully at the scales reveals that the return ranges from 6% to 14% and the risk ranges from 10% to 25%. So larger numbers are depicted by shorter brushes and lower elevation, exactly the opposite of one's expectation. The orange brushes happen to  represent the same value of 14.3% but the one on the right is at least four times as large as the one on the left. As the dentist says, time to rinse out!

The vertical axis represents ranking of the investment categories in terms of decreasing return and/or risk so on both toothbrushes, the axis should run from 1 to 10.


How would the dentist fix this?

The first step is to visit the Q corner of the Trifecta Checkup. The purpose of this chart is for investors to realize that (using the chosen metrics) consumer durables have the best combination of risk and return. In finance, risk is measured as the volatility of return. So, in effect, all the investors care about is the probability of getting a certain level of return.

The trouble with any chart that shows both risk and return is that readers have no way of going from the pair of numbers to the probability of getting a certain level of return.

The fix is to plot the probability of returns directly.


In the above sketch, I just assumed a normal probability model, which is incorrect; but it is not hard to substitute this with an empirial distribution, if you obtain the raw data.

Unlike the original chart, it does not appear that consumer staples is a clearcut winner.



Fixing the visual versus fixing the story

It's great for me when my friend Alberto Cairo lent a helping hand (link). Here is the original chart showing deaths in African and Middle East countries due to recent unrest:


This is Cairo's redesign:


There is no doubt the new version brings out the data more clearly. I like the cropping of the continent. I'd color-code the countries using the same legend as above.

I'm troubled by the concept of the original chart. I struggle to find any interesting correlation of deaths, whether with time, with government reaction, or with geography. Of the three, I think geography is the most correlated so a good design should bring that out. (Of course, geographical bias is expected and thus rather boring.)

If the intention of the chart is to answer the question of what factors affect deaths, then the wrong variables are being utilized.

So, as regards the Trifecta Checkup, Cairo solved the V problem while the D problem remains.


Data decorations, ornaments, chartjunk, and all that

Alberto Cairo left a comment about "data decorations". This is a name he's using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience--one might argue that the experience was worse than just staring at the data table.

It just happens that I have another example of such a chart, submitted by Xan. This one is from Consumer Reports, and illustrates some findings from a recent survey on what things air travellers hate most. Good luck figuring all this out!


A few of these ideas work, such as the complaints about leg room being tied to the seated passengers inside the plane. But then, the data about people hating middle seats is placed on the upper left corner between the left wing and the tail. All of the atypically shaped charts (the cloud, the triangle, the octaogon) seem to use the oft-criticized convention of coding the data onto just one dimension of these multi-dimensioned objects. I just find the organization of the text confusing and poorly structured.

Xan pulled something from a much older Consumer Reports. And they dared to use a boring bar chart:


A nice compromise would be to create some subsections under Airlines to group different types of complaints (stuff relating to seating, stuff about service, stuff about punctuality, etc.). Ask a designer to draw some icons (remember the NYT dog graphic!)

Dataviz worth your time

The New York Times Upshot team came up with a dataviz that is worth your time. This is a set of maps that gives a perspective on migration patterns within the US. The metric being portrayed is the birthplace of current residents of each state.

Here is the chart for California:


I see a few smart ideas, starting with the little map on the bottom left. It servies multiple functions. It is a legend mapping colors to four regions of the US. It serves as a visual guide to the definition of regions. It serves as an interactive tool to select states.  Readers might remember the use of a pie chart as a legend in my remake of one of the Wikipedia pie charts (link).

The aggregation up to regions is what really makes this chart work. This aggregation reduces the number of pieces from about 50 to about 10.

They also did a great job with the axes and gridlines. Much of the data labels are hidden but the most important numbers are retained. These include the proportion of residents who were born in their home state, the proportion of residents who were born outside the U.S., and any state(s) that contribute a significant portion of residents. In the California example, we see that the proportion of Midwest-born people living in California has declined by a lot over time.

Users can interactively hover over the gridlines to uncover the data labels.


As you scroll through the states, there are some recurring patterns.

Some states clearly have become more desirable over time. Georgia, for instance, has seen strong in-migration (colored pieces) especially from non-Southern states:


This pattern is repeated in other southeastern states, including Virginia, North Carolina and Tennessee.



By contrast, some states are not getting the migrants. As a result, the share of residents born in the home state has increased over time. The Midwestern states have this problem. For instance, Minnesota:


I also find a few states with special features. Nevada has always been a state of migrants:


Wyoming on the other hand has become popular with migrants over time but the composition has shifted away from MidWest states.


I'd have preferred presenting the charts in clusters based on patterns.


I haven't been able to figure out the multi-color spaghetti. I think the undulations are purely for aesthetic reasons.

One way to read the chart, then, is to first see three big patches (light grey for born in current state; white patch for born in other U.S. states; dark gray for born outside the U.S.). Within the white patch, we are looking for the shift between the colors (i.e. regions).



How effective visualization brings data alive

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:


These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.


What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:



 And this is the "caramel" question:



 The set of maps referred to in the 2009 post can be found here.


Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.



Book review: The Functional Art

Cairo_book_coverReading Alberto Cairo’s fabulous book, The Functional Art, feels like reading my own work. It’s staggering how closely aligned our sensibilities are, notwithstanding our disparate backgrounds, he a data journlist by training, and I a statistician. We probably can finish each other’s sentences—and did at this recent Analytically Speaking webcast (link to clip).

Cairo currently teaches data visualization at the University of Miami; this is after a distinguished career as a data/visual journalist, having won many awards.

The Functional Art is divided into halves, which can be read independently.

The front part is a terrific overview of data visualization concepts. Cairo’s interest is in principles, rather than recipes. The field of data visualization has developed separately under three academic disciplines: design, computer science, and statistics. Inevitably, the work products contain contradictions and much re-invention. Cairo achieves a synthesis of these schools of thought, and this book is the clarion call for more work on unifying the key intellectual threads of the field.

The second half contains a series of interviews with industry luminaries. This section is a unique contribution to the literature, glancing at behind-the-scenes of the craft. Practitioners will find these short pieces illuminating and profitable. It is often a long journey to arrive at the graphic in print. The selection of designers emphasizes mainstream media outlets although the interviewees have wide-ranging views.

Included in these pages are plenty of published data graphics, frequently work that Cairo produced while working for the Brazilian publication, Epoca. These graphics are elaborate and ambitious, and nicely reproduced in color images. They reward detailed study, with attention to composition, narrative structure, chart types, selection of statistics, etc.

There are plenty of books on the market about how to do graphics (Dona Wong, Naomi Robbins, Nathan Yau come to mind.) Cairo’s book is not about doing, but about thinking about charts. Trust me, time spent thinking about charts will make your charts much improved.


I will now describe some sections of the book that particularly hold my interest:

In Chapter 3, Cairo explains the “visualization wheel,” a nice way to visualize the decisions that designers make when creating charts. Each decision is presented as a trade-off between two extremes. For example, a chart can be “light” or “dense.” This axis evokes Tufte’s data-ink ratio. Devices such as this wheel are useful for integrating the diverse viewpoints that coexist in our field. Frequently, these trade-off decisions are made implicitly—but they can really benefit from explicit consideration.

Figure 4.11 is one of the Epoca charts narrating a Brazilian election. Just recently, I linked to Cairo’s blog post about a similar chart. In both, a spider (radar) plot features prominently. On the same chart, you’ll find a nice demonstration of the small-multiples principle. I applaud the publisher of Epoca for supporting such deep data graphics.

Chapter 8 is invaluable in documenting the chart-making process. Trial and error is a key element of this process. Here, Cairo shows some of the earlier drafts of projects that eventually went to publication. This material is similar to what Kevin Quealy shows at his ChartNThings blog about New York Times graphics.

Chapter 9 is one of the more mature discussions of interactive graphics I have seen. Too often, interactivity is reduced to a feature that is layered onto any dataset. It should rightfully be seen as a problem of design.

Figure 10.1 is not strictly speaking a “data” graphic but I love John Grimwade’s visual explanation of the “transatlantic superhighway”.

Cairo also writes a blog.

Setting the right priority

On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.


This classic Excel chart has some basic construction issues:

  • The data labels are excessive
  • The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
  • With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!

In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.

But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.

It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.

So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.


Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.