Fixing the visual versus fixing the story

It's great for me when my friend Alberto Cairo lent a helping hand (link). Here is the original chart showing deaths in African and Middle East countries due to recent unrest:


This is Cairo's redesign:


There is no doubt the new version brings out the data more clearly. I like the cropping of the continent. I'd color-code the countries using the same legend as above.

I'm troubled by the concept of the original chart. I struggle to find any interesting correlation of deaths, whether with time, with government reaction, or with geography. Of the three, I think geography is the most correlated so a good design should bring that out. (Of course, geographical bias is expected and thus rather boring.)

If the intention of the chart is to answer the question of what factors affect deaths, then the wrong variables are being utilized.

So, as regards the Trifecta Checkup, Cairo solved the V problem while the D problem remains.


Data decorations, ornaments, chartjunk, and all that

Alberto Cairo left a comment about "data decorations". This is a name he's using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience--one might argue that the experience was worse than just staring at the data table.

It just happens that I have another example of such a chart, submitted by Xan. This one is from Consumer Reports, and illustrates some findings from a recent survey on what things air travellers hate most. Good luck figuring all this out!


A few of these ideas work, such as the complaints about leg room being tied to the seated passengers inside the plane. But then, the data about people hating middle seats is placed on the upper left corner between the left wing and the tail. All of the atypically shaped charts (the cloud, the triangle, the octaogon) seem to use the oft-criticized convention of coding the data onto just one dimension of these multi-dimensioned objects. I just find the organization of the text confusing and poorly structured.

Xan pulled something from a much older Consumer Reports. And they dared to use a boring bar chart:


A nice compromise would be to create some subsections under Airlines to group different types of complaints (stuff relating to seating, stuff about service, stuff about punctuality, etc.). Ask a designer to draw some icons (remember the NYT dog graphic!)

Dataviz worth your time

The New York Times Upshot team came up with a dataviz that is worth your time. This is a set of maps that gives a perspective on migration patterns within the US. The metric being portrayed is the birthplace of current residents of each state.

Here is the chart for California:


I see a few smart ideas, starting with the little map on the bottom left. It servies multiple functions. It is a legend mapping colors to four regions of the US. It serves as a visual guide to the definition of regions. It serves as an interactive tool to select states.  Readers might remember the use of a pie chart as a legend in my remake of one of the Wikipedia pie charts (link).

The aggregation up to regions is what really makes this chart work. This aggregation reduces the number of pieces from about 50 to about 10.

They also did a great job with the axes and gridlines. Much of the data labels are hidden but the most important numbers are retained. These include the proportion of residents who were born in their home state, the proportion of residents who were born outside the U.S., and any state(s) that contribute a significant portion of residents. In the California example, we see that the proportion of Midwest-born people living in California has declined by a lot over time.

Users can interactively hover over the gridlines to uncover the data labels.


As you scroll through the states, there are some recurring patterns.

Some states clearly have become more desirable over time. Georgia, for instance, has seen strong in-migration (colored pieces) especially from non-Southern states:


This pattern is repeated in other southeastern states, including Virginia, North Carolina and Tennessee.



By contrast, some states are not getting the migrants. As a result, the share of residents born in the home state has increased over time. The Midwestern states have this problem. For instance, Minnesota:


I also find a few states with special features. Nevada has always been a state of migrants:


Wyoming on the other hand has become popular with migrants over time but the composition has shifted away from MidWest states.


I'd have preferred presenting the charts in clusters based on patterns.


I haven't been able to figure out the multi-color spaghetti. I think the undulations are purely for aesthetic reasons.

One way to read the chart, then, is to first see three big patches (light grey for born in current state; white patch for born in other U.S. states; dark gray for born outside the U.S.). Within the white patch, we are looking for the shift between the colors (i.e. regions).



How effective visualization brings data alive

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:


These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.


What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:



 And this is the "caramel" question:



 The set of maps referred to in the 2009 post can be found here.


Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.



Book review: The Functional Art

Cairo_book_coverReading Alberto Cairo’s fabulous book, The Functional Art, feels like reading my own work. It’s staggering how closely aligned our sensibilities are, notwithstanding our disparate backgrounds, he a data journlist by training, and I a statistician. We probably can finish each other’s sentences—and did at this recent Analytically Speaking webcast (link to clip).

Cairo currently teaches data visualization at the University of Miami; this is after a distinguished career as a data/visual journalist, having won many awards.

The Functional Art is divided into halves, which can be read independently.

The front part is a terrific overview of data visualization concepts. Cairo’s interest is in principles, rather than recipes. The field of data visualization has developed separately under three academic disciplines: design, computer science, and statistics. Inevitably, the work products contain contradictions and much re-invention. Cairo achieves a synthesis of these schools of thought, and this book is the clarion call for more work on unifying the key intellectual threads of the field.

The second half contains a series of interviews with industry luminaries. This section is a unique contribution to the literature, glancing at behind-the-scenes of the craft. Practitioners will find these short pieces illuminating and profitable. It is often a long journey to arrive at the graphic in print. The selection of designers emphasizes mainstream media outlets although the interviewees have wide-ranging views.

Included in these pages are plenty of published data graphics, frequently work that Cairo produced while working for the Brazilian publication, Epoca. These graphics are elaborate and ambitious, and nicely reproduced in color images. They reward detailed study, with attention to composition, narrative structure, chart types, selection of statistics, etc.

There are plenty of books on the market about how to do graphics (Dona Wong, Naomi Robbins, Nathan Yau come to mind.) Cairo’s book is not about doing, but about thinking about charts. Trust me, time spent thinking about charts will make your charts much improved.


I will now describe some sections of the book that particularly hold my interest:

In Chapter 3, Cairo explains the “visualization wheel,” a nice way to visualize the decisions that designers make when creating charts. Each decision is presented as a trade-off between two extremes. For example, a chart can be “light” or “dense.” This axis evokes Tufte’s data-ink ratio. Devices such as this wheel are useful for integrating the diverse viewpoints that coexist in our field. Frequently, these trade-off decisions are made implicitly—but they can really benefit from explicit consideration.

Figure 4.11 is one of the Epoca charts narrating a Brazilian election. Just recently, I linked to Cairo’s blog post about a similar chart. In both, a spider (radar) plot features prominently. On the same chart, you’ll find a nice demonstration of the small-multiples principle. I applaud the publisher of Epoca for supporting such deep data graphics.

Chapter 8 is invaluable in documenting the chart-making process. Trial and error is a key element of this process. Here, Cairo shows some of the earlier drafts of projects that eventually went to publication. This material is similar to what Kevin Quealy shows at his ChartNThings blog about New York Times graphics.

Chapter 9 is one of the more mature discussions of interactive graphics I have seen. Too often, interactivity is reduced to a feature that is layered onto any dataset. It should rightfully be seen as a problem of design.

Figure 10.1 is not strictly speaking a “data” graphic but I love John Grimwade’s visual explanation of the “transatlantic superhighway”.

Cairo also writes a blog.

Setting the right priority

On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.


This classic Excel chart has some basic construction issues:

  • The data labels are excessive
  • The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
  • With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!

In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.

But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.

It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.

So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.


Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.

Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?


Visualizing movements of people

Long-time reader Daniel L. sends in this chart illustrating a large data set of intra-state migration flows in the U.S. The original chart is at Vizynary by way of Daily Kos.



There is no denying that this chart is beautiful to look at. But what is its message? That there are people migrating from and to every state? (assuming all fifty states are present)

Daily Kos describes how one can hover over any state to see its individual patterns. Something like this:


This is a great way, perhaps the only way, to consume the chart. Essentially, the reader is asked to generate a small-multiples panel of charts. The chart does a better job at showing the pairs of states between which people migrate than at showing the relative size of the flows. The size of the flows is coded in the width of the arcs. The widths are too similar to tell apart; and it doesn't help that no legend is provided.

The choice of color is curious. Each region of the country is its own color, in a "nominal" way. It is a design decision to emphasize regions.

Another decision is to hide information on the distances of the migrations. Evidently, the designer sacrificed that information in order to create the neat circular arrangement of states.

A shortcoming of this representation is one missing dimension: the direction of the flow. I'm not sure given any pair of states A and B, whether the net migration is into A or into B.


I propose a solution using the map while preserving the interactive element of the original.

On this map, when you hover over a particular state, it highlights all other states for which there are migrations flows into or out of that state. For color, use a blue-white-red scheme with blue indicating net inflow, red indicating net outflow, and white for near-zero flows. Include a legend.

Another important decision for the designer is absolute versus relative scales. In an absolute scheme, you rank the entire set of flows for all pairs of states; obviously, the resulting colors would be influenced by the state populations. Alternatively, you rank the flow sizes within each state; in this case, the smaller states will feel exaggerated.

The map has the additional advantage of showing the approximate distance (and direction) moved, which, for me, is a useful piece of information.