« January 2014 | Main | March 2014 »

Knowledge in the chart and knowledge in the head

One of the many insights from Don Norman's great design book is that a user's behavior is affected by "knowledge in the world", and "knowledge in the head." Applied to graphics, this means readers of graphics use both knowledge in the chart and knowledge in the head. The recent debate between Alberto Cairo (and various others) and Andy Kirk about the following map illustrates this well.


As background, we should note that Cairo is Spanish-born and has taught at the Universities of Miami and of North Carolina while Kirk is a UK-based consultant.

Cairo is dismissive of this map, because the blue-orange split is really caused by the underlying population distribution. Notice that in coming to this conclusion, Cairo is using both knowledge imparted by the map and knowledge that pre-exists in his head. Someone who doesn't know the population density of the U.S. would not readily see the underlying cause of the pattern.

For those who can tell that the orange areas are the major urban areas, we are still making an assumption that every important metro area is painted in orange, and that there are no non-urban area painted in orange. Do we really know the validity of these  assumptions?


It is understandable that UK-based Kirk brings less knowledge of U.S. geography to the table:

Whilst I know roughly where the major cities of the US are, the size and population-density extremes of the country fascinate me so I find it interesting, particularly as a non-US person.

Graphics designers ought to think about how much knowledge the readers would be required to bring with them. It seems inappropriate to be imparting demographic lessons from a chart that depicts economic activity. Is there a way to provide Kirk with the lessons he desires without boring Cairo with "common knowledge"? That is the challenge here.


Please read the post on my sister blog for commentary on other aspects of this map. See this link.

Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.


Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?


The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.


PS. Click here for class syllabus. Click here for first update.

Pets may need shelter from this terrible chart

Josh tweeted quite a shocking attack ad to me last week. He told me it came from the DC Metro. The ad is taken out by a group called HumaneWatch.Org, which apparently is a watchdog checking up on charity organizations. The ad attacks a specific group called the Humane Society of the United States. Here is the map that is the centerpiece of the copy:


Trifecta_checkupI like to use the Trifecta checkup to evaluate graphics. It's a nice way to organize your visualization critique. You progress through three corners: figuring out what is the practical question being addressed by the graphic, then evaluating what data is being deployed, and finally whether the graphical elements (the chart itself) is well executed in relation to the question and the data.


Based on the map, it appears that HumaneWatch is interested in the spending on pet shelters. Every number shown is tiny: on a quick scan, the range may be from 0% to 0.35%. The all-caps title "A Whole Lotta Nothing" confirms that this is the intended message.

Knowing nothing about either of these organizations leaves me confused. Should the "Humane Society" be spending the bulk of its budget on pet shelters? If it doesn't, is it because the staff is pilfering money, or because it has wasteful spending, or because pets are not its major cause, or because pet shelters are not the key way this organization helps pets?

I did look up Humane Society to learn that it is an animal rights group. The four bullet points at the bottom of the ad provide a clue as to what the designer wanted to convey: namely, that this charity is a scam, with too much overhead spending, and spending on pensions.


So I think the question being asked is sufficiently clarified, and it's a pretty important one. How is this organization spending its donations? Is it irresponsible compared to other similar organizations?


The data should be in sync with the question being addressed; that's why there is a link between the two corners of the Trifecta. Given the trouble I endured understanding the question being addressed, it would come as no surprise that this chart scores poorly on the DATA corner.

I don't understand why budget spent on pet shelters is the key bone of contention. Based on the perceived objectives, it seems that they should display directly what proportion of the budget went to overhead, and what proportion went to pensions, with suitable comparisons.

The analysis by state is a disease of having too much data. Let's imagine that the proportions averaged across all states come to 0.1%. If we replaced those 50 numbers with one number printed across all states: "The Humane Society spends less than 0.1% of its budget on pet shelters.", the message would have been identical, while being less confusing.

And it's not just confusion. Cutting the data by state introduces complications. The analyst would need to make sure that any differences between states are not due to factors such as the number of pets, the proportion of households owning pets, the average spending per pet, the supply and demand for pet shelters, the existence of alternatives to pet shelters, etc. None of these issues need to worry the designer who does not slice the data down.

The same reason goes for why the absolute amount of spending (encoded in the colors of individual states) is not worth the ink it's printed on. The range between 0% and 0.35% has been chopped into seven pieces, which creates artificial gaps between the states. This design muddles the graphic's key message, "A Whole Lotta Nothing".


As we land on the final corner of the Trifecta, we ignore our previous complaint and accept that the proportion of budget is an interesting data series to visualize, and turn attention to the graphical elements. This chart scores poorly on chart execution as well!

Notice that the designer simultaneously plots two data series on the same map, the dollar value of pet shelter spending, and it as a proportion of budget. The former is encoded in the color of the state areas while the latter is printed directly as data labels. This is a map equivalent of "dual-axes" line charts, and equally unreadable.

Dcmetro_map_colorsBased on the color legend, our brain tells us the yellow states are better than the blue states but the huge numbers printed on the map conveys the opposite message. The progression of colors makes little sense. The red and yellow stand out but those states are in the middle of the range.

It's a little blurry but I think there is a number of New England states in the high spending category (black and dark gray colors), and the map just happens to obscure this key feature.




DATA: Very Poor


Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.


The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.


REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies


Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.


What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.


The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?


A message worth repeating

When you see two time series, resist the temptation to plot them as lines on the same chart. According to the Atlantic, the following dual-axis chart has been making the rounds in the investment community: (thanks to Alberto Cairo for the tip)


There may be correlation or there may not be. When we look at a chart like this and see "correlation" -- actually a high degree of correlation -- what we are really talking about are the long-run trends being correlated. For example, the underlying data for this chart is most likely on a daily level. If you train your eye to a small part of the chart, you will notice that at the daily level, there is a lot of noise and a lot less correlation than you think.

Long term trends being correlated does not imply short-term trends are also correlated!

Furthermore, the long-run correlation is not enough to jump to the conclusion that the new trend will follow the old trend. When you make this conclusion, you are implicitly assuming that the mechanism causing the trend in the 1928-9 period is identical to that causing the current-period trend. This is when you realize that such an assumption is hard to support.


The Atlantic piece debunks this chart by re-expressing the data as indices. This means we switched from absolute changes in the Dow Jones average to relative changes. This has its own problem actually because the general level of the Dow Jones is so different between those two periods.



Here are some posts I have written on dual-axis charts. I have been complaining about them since almost the beginning of this blog. Back in 2006, I wrote this piece which takes a different path to debunking a similar chart -- by compressing or expanding one of the axes.

In a more recent post, I showed an example of when it is natural to use two axes on the same chart.


A deeper look at the Bloomberg report

In the prior post, I linked to Eric P.'s (link) vetting of the Bloomberg chart on the drop in median male income in the U.S. in the last few decades.  Just as a reminder, here is the key chart:


In the 25-34 age group (blue line), the median income has suffered two waves of drastic declines, about 25% from 1972 to 1992 and then about 18% from 1999 to 2011.


There is a different way to digest the chart above, which is what I want to talk about in this post. Notice that people age over time so if you trace the blue line from left to right, at every point in time we are comparing different people.

Instead, let's trace the same people across time -- this is known as a cohort analysis. I traced a black line through the above chart:


This cohort consists of male workers who were 25 to 34 years old between 1972 and 1982. By 1982, they would have aged to between 35 and 44 years old and so they would belong to the green line. Then they shifted up to the yellow line. So over the lifespan, the median worker increased their income.

You might notice that this analysis is very rough because the data is not granular enough. For example, if you are 34 years old in 1972, by 1973, you already moved from the blue to the green line. With the proper data, this analysis can be made precise. The weird jump (indicated by the dashed lines) is most likely a consequence of the imperfections in the cohorting.

If we have the birth year data, then we can trace people who are born in each year forward, and then stack all these traces on the same chart to figure out true generational changes. Imagine that the chart would have age on the horizontal axis.


One of the key elements of numbersense is realizing that there is no single way to analyze any given dataset. When the data is rich, it holds many different insights.

Applying numbersense to a Bloomberg report

Eric P. asked me to comment on his recent post via twitter. (Yes, that's another way to submit charts to me.)

Overall, I love the spirit of Eric's article. He's using his numbersense, in this case by asking what the data series would look like if we look further back in time, and if we look at more granular data, and not be constrained by what the report (in this case, the Bloomberg reporter) shows us.

Here is the chart that appeared on Bloomberg, showing that the median income of American male workers has been in drastic decline over the last four decades across all age groups.


Here is Eric checking if the flat slopes captured the trends properly. Not quite:


I am a big fan of taking out local fluctuations and straightening out curves but in this case, they went a bit too far. It would have been better if the red and orange lines on the Bloomberg chart were shown as flat from the 70s to the 90s, followed by a downward slope. Here is an example where getting rid of the gyrations cleans up the message (link).

Here is Eric taking the data back to the 40s:


It's important to see that the reporter picked a point in time to avoid the strangeness of the kink in all those lines. I think only experts in this data can explain whether there was a change in metrics (even in the way inflation is computed since the data is adjusted for inflation) or whether there were other reasons for the abrupt reversal in trend.

Finally, Eric facetiously applied the line-straightening technique to the above chart and yielded the following:


Of course, Eric's having some fun. My own feeling is that the reporter did fine to focus on the trend (from the 70s) that is definitely real and current, and leave the other question perhaps for another time.


At the end of the analysis, I don't think the key message has changed... in particular, in the 25-34 age group, the median income has suffered two waves of drastic declines, about 25% from 1972 to 1992 and then about 18% from 1999 to 2011. Recall from your intro stats class that the median metric is supposed to be stable; if the median shifts so drastically, it is a remarkable trend!


Eric also points out a missed opportunity. This observation is so stark it really should have been highlighted on the chart:

Another conclusion we might draw from this chart is that median incomes for men were somewhat volatile depending on what political party was in power. Incomes decreased under every U.S. Republican President except Reagan (where it went up), and increased under every Democratic President except Obama (it’s too early to tell based on this data). Bloomberg’s analysis didn’t take any of this into account.

Here is the link to Eric's post, it's definitely worth reading.

Small multiples with simple axes

Jens M., a long-time reader, submits a good graphic! This small-multiples chart (via Quartz) compares the consumption of liquor from selected countries around the world, showing both the level of consumption and the change over time.


What they did right:

  • Did not put the data on a map
  • Ordered the countries by the most recent data point rather than alphabetically
  • Scale labels are found only on outer edge of the chart area, rather than one set per panel
  • Only used three labels for the 11 years on the plot
  • Did not overdo the vertical scale either

The nicest feature was the XL scale applied only to South Korea. This destroys the small-multiples principle but draws attention to the top left corner, where the designer wants our eyes to go. I would have used smaller fonts throughout.

Having done so much work to simplify the data and expose the patterns, it's time to look at whether we can add some complexity without going overboard. I'd suggest using a different color to draw attention to curves that are strangely shaped -- the Ukraine comes to mind, so does Brazil.

I'd also consider adding the top liquor in each country... the writeup made a big deal out of the fact that most of the drinking in South Korea is of Soju.


One way to appreciate the greatness of the chart is to look at alternatives.

Here, the Economist tries the lazy approach of using a map: (link)


For one thing, they have to give up the time dimension.

A variation is a cartogram in which the physical size and shape of countries are mapped to the underlying data. Here's one on Worldmapper (link):


One problem with this transformation is what to do with missing data.

Wikipedia has a better map with variations of one color (link):


The Atlantic realizes that populations are not evenly distributed on the map so instead of coloring countries, thay put bubbles on top of the map (link):

Theatlantic_Global Beer Consumption-thumb-590x411-31757

 Unfortunately, they scaled the bubbles to the total consumption rather than the per-capita consumption. You guess it, China gets the biggest bubble and much larger than anywhere else but from a per-capita standpoint, China is behind many other countries depicted on the map.


PS. A note on submissions. I welcome submissions, especially if you have a good chart to offer. Please ping me if I don't reply within a few weeks. I may have just missed your email. Also, realize that submissions take even more time to research since it is likely in the area I have little knowledge about, and mostly because you sent it to me since you hope I'll research it. Sometimes I give up since it's taking too much time. If you ping me again, I'll let you know if I'm working on it.

The above does not apply to emails from people who are building traffic for their infographics.


PPS. Andrew Gelman chimes in with his take on small multiples.

Oldie but goodie

Back in 2007, the New York Times graphics team produced a fabulous chart explaining the rise in prices at the pump (link).

Let's start with the tab labeled "Regional Price" which contains a well-executed map of the average gas prices by county:


The color scale is wonderful. It's just one color and yet the gradations are easily discerned. The general spatial pattern jumps out at you, with prices being higher in the Pacific coast, and lower in New England all the way down south. The Lakes region also has higher prices so does New Mexico and Colorado and Hawaii.


The legend is just superb. Take a closer look:


What sets this legend apart is varying lengths of the segments. In particular, the darkest blue also corresponds to a wide range of prices (3.45-3.94). One can also easily figure out the lowest and highest price in the nation--the designers located exactly in which counties those prices were recorded, which is another nice touch.

To determine the breakpoints on the legend, one can use a statistical methodology: a standardized scale anchored on both sides of the national average price (from the other chart, the average price was $3.22). Then, we have each color mapping to the length of one standard deviation of prices in both directions. What this does is to put counties into standardized groups: for example, all counties whose prices were within one standard deviation above the average are given one tint while those that were one to two standard deviations above the average has a darker blue, and so on. In effect, we would have created a contour map.


I see the designers' intention in clearly labeling the areas where they do not have data, with the diagonal stripes on white. My own preference is to put those areas in a mild gray, in effect blending them into the surroundings. In this way, the missing data do not distract the average reader, while the fastidious reader can still figure out where the data holes are.

This is a key learning for most research scientists. We have a tendency to train our eyes on the outliers and the data holes because they are like imperfections in diamonds. This leads us to the tendency of highlighting the least important message up front. And it's a bad habit.


In the following, I put the county and state level views side by side. The NYT graphic allows users to switch between the two views via a tab.


Much like the recent post on the age of buildings in Brooklyn, the state aggregates tell a simpler story but still capture almost all of the spatial pattern. The average prices per state are now printed directly on the chart. The question the designer should ask is what the readers want to learn from such a chart, and which one delivers more of such requirements. It's possible the Times is catering to two types of readers. Perhaps one can strike a middle ground, which is to break out certain states like Texas into contiguous "regions".