« March 2012 | Main | May 2012 »

A matter of compactness

Andrew Gelman may have nominated himself the graphics advisor for the World Happiness Report (link). That would be a very good thing.

To kick this off, I re-made the Figures 2.1-2.2.8 in the report, which summarized the findings of the Gallup World Poll covering annual samples of 1,000 people aged 15 and over from each of 150 countries. (These charts are effectively the first charts to appear in the report. There is no Figure 1 because Chapter 1 has no charts. The report also inexplicably follows the outdated academic-publishing convention of banishing all diagrams to the end of the report as if they were footnotes.)

In the report, they presented histograms of the 0-10 ratings (10 = happiest) by region of the world, two charts a page running to 5 pages. Here's one such page:


If you're presenting regional data, you're expecting readers to want to compare regions. It's not very nice to make them flip back and forth and task their memory in order to do these comparisons.

This data set is where small multiples show their power. Small multiples are a set of charts all sharing the same execution (type, axes, etc.) but each showing different subsets of the data. This sort of chart is designed for group comparisons, and is one of the key propositions by Edward Tufte in his classic book.

In the following junkart version, I plotted each region's histogram against the global average histogram (indicated in gray as background). The average rating in each region is indicated with the light blue vertical line. The countries are sorted from highest average happiness to lowest.



The same data now occupies only one page of the report. (A topic for a different post: does the higher average rating in N. America/Europe indicate greater happiness or grade inflation?)


Redo_whr_linesAlternatively, one can stack up the line charts into a column, as shown on the right. This view is somewhat better for any pairwise comparisons. (Calling JMP developers: how do I rotate the text labels to make them horizontal?)


Finally, I made a chart for exploratory purpose, using a scatterplot matrix (see also this post). In this version, every pair of regions is under the microscope. Since there are 10 regions (including the global total), we have (10*9)/2 = 45 pairwise comparisons. Each of these comparisons have its own chart in the matrix, indexed by the labels on the axes. 

Each individual chart is a scatter plot of the proportions selecting a particular rating. If the histograms in region A and region B are identical, then we see 11 dots all lined up in a diagonal line going from bottom left to top right.

In addition, the pink area of the chart contains 95% of the data. So the more the pink area resembles a diagonal line, the more correlated are the histograms between the two regions being compared.















For example, the very top chart compares CIS with East Asia. The thinness of the pink area tells us that the histograms of happiness ratings in those two regions resemble each other. You can easily verify this finding by looking at the first two line charts shown in the column of line charts above.

By contrast, the chart comparing CIS and Europe has an expansive pink area, meaning the happiness ratings follow different distributions. This is also verified by looking at the line charts, which show that Europeans are generally happier than people in CIS. There is an "excess" of people with ratings around 6-8 in Europe compared to CIS. The dots corresponding to these ratings would appear above the diagonal.

This scatterplot matrix explores all possible comparisons on one page but it is a lab exercise not suitable for mass consumption because it has too much detail.


For those curious, the small-multiples of line charts is made using R. The column of line charts and scatterplot matrix are created using JMP.

Flooding the Himalayas

Just a quick post today as I've been traveling.

Reader Chris P. sent in this map showing tsunami risk around the world:


I don't have a larger version but here is Chris's comment:

 Not that residents of Lake Tahoe should worry about tsunamis, but the map makes it look like they should...


I'm not sure what's going on here because in some cases (India, Australia), apparently the entire country is subject to tsunami risk. Surely, the water won't rise up the Himalayas?

The Earth Institute needs a graphics advisor

Reader Dave S. was disturbed by the graphics in the inaugural World Happiness Report, published by Jeffrey Sachs's Earth Institute (link). It's a 200-page document with lots of graphs, many of which require rework.

Here's a pie chart showing (purportedly) what "happy" people in Bhutan are happy about:

I'm really curious how these domains add up to 100% exactly. Since the data came from some kind of survey, you typically would allow each respondent to pick more than one domains in which he or she is happy. If that is the case, then it would not make sense to add up responses, nor would the total (100%) signify anything.

If, on the other hand, respondents are forced to pick only one domain, it is very suspicious that all 9 domains would essentially receive the same number of votes. Nor would it make sense to ask survey-takers to select only one domain if all 9 domains contribute to someone's happiness.

Pie charts are perhaps the most abused chart type. There are just endless examples of poorly executed pie charts (just browse my last few posts). The prevalence of abuse may be reason enough to ban them.


Paired with Figure 4 shown above is Figure 5 shown below, which deepens the mystery:


Compare the captions. What's the difference between "In which domains do happy people enjoy sufficiency?" and "Indicators in which happy people enjoy sufficiency"? The categories are related but not identical (Education vs. Schooling, Health vs. Self reported health status, etc.) However, in Figure 5, the distribution is uniform as in Figure 4. Is the data contradictory? Or the captions misleading?

This column chart would be better presented as a horizontal bar chart so that readers don't have to break their necks trying to read the category names.

The designer should also perform the routine task to get rid of the 120% tick mark on the proportion axis that comes from Excel.




The importance of explaining your chart: the case of the red 118

Reader Jim S. was rightfully mystified by the following map that appeared on the Ars Technica blog (link), and purported to demonstrate that high temperatures of March 2012 across most of the U.S. were of historical significance.


I must say the production values of this map, produced by the people at NOAA, are superb. I love, love, love the caption that the Ars Technica editors added to the map. I wish they had blown it up to 20-point font, and made it shiny :) Besides that, the colors are well-chosen, and it doesn't feel cluttered despite having 48 numbers printed on it.

Like Jim, I'm hypnotized by the drumbeat of 118, 118, 118, ... all over the red area. Noaa_map_legendWhat could the numbers mean? They could be temperatures in Fahrenheit (although 118 degrees in March surely would have been newsworthy). The legend does lend support to this interpretation (see right), what with the extra-large font announcing "Temperature". Jim commented: "But it seems odd that such a large area would have precisely the same high."


201203-201203Not so soon, Jim. The NOAA also made the chart shown on the right (link). So indeed, the entire country could be given one value of 118.

If not Fahrenheit, what could the numbers mean? They could be some kind of index in which case the average value would seem to be 50 (the white patch). That would be one strange index.

Too bad this map is produced by specialists for specialists, leaving us commoners guessing. The only clue we got is in the title, "Statewide Ranks".

But this isn't very helpful either. The 118s are still ringing in my ear. If the numbers are ranks, then 118 would likely be the maximum rank, given as there are so many 118s. But I can't figure out which metric has 118 levels.

I finally found my way to this page, which explains what NOAA calls "climatological ranking". The page also has a chart (below), which can serve as a sort of legend for the maps, but is almost as difficult to read.

Ranks-combined-frameApparently there are 118 years worth of recorded temperatures, going back to 1895. And within each state, the annual temperatures for the past 118 years were ranked from lowest to highest, meaning that 118 is the hottest on record.

Given that there is lop-sided attention to hotter temperatures (global warming), it would be much better to reverse the ranking so that 1 is the hottest month year!

The chart also explains that the years are grouped into three equal buckets to indicate "below normal", "near normal" and "above normal".

Too bad this chart gives us three or five levels of ranking while in the map they use seven colors (levels).

They really ought to include on the map (a) the definition of the ranking and (b) the range of ranks corresponding to each color.


While researching this post, I found this wonderful page of NOAA maps (link). This is a beautiful illustration of the process of statistical aggregation. Notice the trade-off between simplicity and loss of information. The art in statistics is to figure out the right balance between the two.



I always like to explore doing away with the unofficial rule that says spatial data must be plotted on maps. Conceptually I'd like to see the following heatmap, where a concentration of red cells at the top of the chart would indicate extraordinarily hot temperatures across the states.


I couldn't make this chart because the NOAA website has this insane interface where I can only grab the rank for one state for one year one at a time. But you get the gist of the concept.


Did I tell you I love, love, love the caption? Go right ahead, and make a slogan for your chart today!


 [PS: Reader Mark Bulling (see his comment below) contributes a realization of my heatmap suggestion above. One of the benefits of this chart is its economy, as a small version of it shows:



Still shaken from the quake

On the news that the tsunami warning has been called off in Asia, I reminded myself of the human toll of the 2004 calamity by looking up Wikipedia (link).

I didn't receive a warm greeting as I was confronted with the following pie chart:



Maybe someone can help me understand what is being plotted here.

Maybe the designer was still shaken from the devastation of the quake when this was drawn?

Light entertainment: Spinning wheel at the fun fair

@TheChadd submitted the following chart via Twitter.

I don't know if "fun fairs" mean the same thing to me as to you but that's where I got introduced to spinning wheel games. You stand 10 feet away from a multi-colored pie chart,  you are supposed to throw darts (or other objects) at the circle, you win gigantic teddy bears if you hit the narrow wedge and maybe a sweet if you hit the big wedge.

To add to the fun, the pie chart is made to spin around slowly.


Well, we are at the fun fair and here is the spinning pie chart:


To see the real thing, click here.

Notice that this game has an extra level of difficulty; it spins both clockwise and counterclockwise.

Have a great weekend.

Breaking up a time series

My friend Augustine sends me to this press release by Kantar Research, via PaidContent (link).

This article expresses alarm that advertisers have cut their spending on online advertising in Q4 of 2011, especially on search and display advertising. An important person is quoted as saying that a shift to mobile ads explains this phenomenon.

Throughout this piece, it's hard to keep track of whether the growth rate is full year 2010 v. full year 2011, or Q4 2010 v. Q4 2011, or Q3 2011 v. Q4 2011. Based on the data table attached to the end, I think they use the first two metrics although the sentence "paid search fell in the 4th quarter by 1 percent" is often interpreted as falling 1 percent from Q3 to Q4.

The labeling on the following chart doesn't help:


Comparing Q4 2010 to Q4 2011 (and Q4 2009 to Q4 2010) is one way to do a crude seasonal adjustment, and I'm assuming that's what they did. If so, then each rate can be considered an annual growth rate for a particular quarter and the following chart would bring out the dramatic decline in a much clearer manner:


Instead of starting another debate about line charts versus bar charts, I show them both, but continue to recommend the line chart.

In the original chart, either the data labels or the scaffolding (the vertical axis and gridlines) should be removed. If the data set is entirely printed on the chart, the designer expresses no confidence in the graphical elements.

The curiousity in this press release is the absence of mobile ad data. Apparently the key message of the article is not supported by the data set, which makes this a case of "story time". (I write about story time in the sister blog.)


Augustine writes about digital media here.

A chart that stops the story-telling impetus

We all like to tell stories. One device that has produced a lot of stories, and provoked much imagination is the dual-axis plot showing two time series. Is there a correlation or is there not? Unfortunately, most of these stories are false.

Claremont_homesLooking at the following chart (link) showing the home sales and median home price in Claremont over the last six years, one gets the sense that the two variables move in tandem, kind of. Both time series appear to reach a peak in 2006 and a trough in 2011. In 2010, both series seem to be levelling off.

When the designer places two series on the same chart, he or she is implicitly saying: there is an interesting relationship between these two data sets.

But this is not always the case. Two data sets may have little to do with each other. This is especially true if each data set shows high variability over time as in here.


Below is another view of the same data. In order to visualize any year-to-year effect or quarterly effect, I split the data along those dimensions. The year-to-year effect is quite strong although there isn't any interesting pattern. The quarterly effect is not so strong, and as the directions of the paths indicate, this effect is not consistent from year to year.


The scales on each axis are "standardized" meaning 0 is the average value, 1 is one standard deviation above the average, etc. Movements of 1 to 2 standard deviations are not unusual so one can see that almost all values on the chart are within 2 SD.

There just doesn't seem to be a compelling story here. This chart taxes our imagination.

PS. In case you're wondering, this chart is made using Graph Builder in JMP. (except for the arrows) I also wish JMP would allow me to use 1,2,3,4 (column data) as my plot objects instead of the standard dots and crosses, etc.

[4/11/2012: Thanks to Ken L. for submitting this chart. Also, Rob Simmon on Twitter points out that the house price data should be inflation-adjusted.]

Is that my third leg?

We look at another idea from the visualization project "Gaps in the US Healthcare System" (link). This was a tip from reader Jordan G. (link). One of the bright points about this project is the conscious attempt to try something different although the end result is not always successful.

A tree-like branching chart was used to represent cancer death rates, broken down by racial group, gender and type of cancer, in that order.


The tree structure loses its logic after the race and gender splits. Why link different types of cancers (the gray squares) together in a sequence? Stranger still is the existence of a third branch coming out of every race node (the four closest to the center). One branch is male, the other branch is female, what's the third leg? It appears to be prostate cancer which is male only--why doesn't it go with the male branch?

It's not easy to find the connection between what's depicted here, and the idea of "gaps" in the US healthcare system. I think the question is ill-posed to begin with.  The rate of death reflects both the possible differential quality of healthcare between groups and the differential incidence of cancers between groups so no visualization tricks could be used to find reliable answers to the question being posed.

Jc_trifectaThe chart fails the first corner of the Trifecta checkup. The chart type also does not fit the data.


The following chart plots the same data in a Bumps style.


I separated the male and female data since certain cancers are limited to one gender, and the gender difference is not likely to be the primary interest. The gender difference, incidentally, is clearly observed: the male death rates are generally about twice as high as the female rates of the same type of cancer, except for colorectal.

In terms of the "race gap", we find that black death rates are generally quite a bit higher than white death rates, especially for prostate cancer but except for lung cancer in females.

Asians and American Indians have practially the same death rates but in both cases the sample sizes are small.

The raw data can be found at the CDC website here.