« November 2010 | Main | January 2011 »

New York Post simplifies a chart

Speaking of experiments in simplification (see here), I found a perfect example in the New York Post. (Courtesy to their marketing team who had the bright idea of handing out free copies of the paper a few weeks ago.)


The chart on the left bas been a staple of the Calculated Risk blog; it contains a wealth of information and by itself, it is a well-made chart. When the New York Post reported the November job loss statistic, the editors simplified the chart, which is shown on the right. When I saw this chart, I applauded: by reducing the picture down to its essential elements, they succeeded in getting their message across more proficiently.

Here are the changes that make the chart:

  • Focusing on the last three recessions, as opposed to all past recessions
  • With only three lines, there is no need to use a rainbow of colors
  • Eliminating the minor tickmarks and labels on the time axis; this has the extra benefit of drawing attention to the zero line, which is the anchor of this chart
  • With fewer labels competing for real estate, the labels could be turned around so readers don't need to turn their heads around
  • Removing the graph paper background
  • Keeping the tickmarks but omitting some of the labels on the vertical axis. I'd however put a -2% instead of the -4% label

Credit: NYU librarians who tracked down a scanned version of the NYP chart.

Experiments in simplification

Julien D. sent us to this link, where a design agency picks up everyday objects and investigates what happens if the designs were to be simplified. Here is Nutella simplified:


I reckon this should be wonderful inspiration for chart designers.

Take your final design, remove components and interrogate whether you need those components.

(Related posts: self-sufficiency.)

Be guided by the questions 2

In a prior post, I showed a chart of Pisa test scores that can be used to investigate differences between any pair of countries. At least one reader found it confusing, containing too much data. I then realize that if the objective of the chart is re-stated as "How the UK fared relative to other OECD countries", which was the intent of the original Guardian chart, the chart could be presented in the following simplified fashion:



Simplification can be achieved in many ways, one of which is simplifying the objective. In fact, I'd not be opposed to showing just the left side of the chart, which addresses an even more general question, which is how the countries fared in a general sense.


While the lines in the Guardian chart display correlations of math, reading and science scores within specific countries, essentially a parallel coordinates plot, the same correlation can be visualized in a scatterplot matrix (see this post).

Redo_pisa4 Each scatter plot here relates the scores of two subject areas as indicated by the axis labels. The simplest observation is the high degree of positive correlation on all three panels: in other words, countries in general do well in all three subjects, or poorly in all three subjects.

This pattern confirms why it isn't very productive to focus readers' attention on this set of correlations when dealing with this data set.

You'll notice the use of colored dots on the scatter plots. Imagine that I have put the countries into groups based on overall scores (rather than just reading scores) as in my earlier analysis. The dots of the same color represent countries that are deemed to have performed similarly. The black cross indicates the "average country".

Focusing on the colors for the moment, you can confirm yet again that a country doing well in one subject is highly predictive of it doing well in the other subjects.

As I pointed out at the start of the prior post, using a little statistical technique allows us to understand the data better, and plotting summaries of the data allows us to draw more interesting conclusions than putting all the data, unperturbed, onto a canvass.


Be guided by the questions

Information graphics is one of many terms used to describe charts showing data -- and a very ambitious one at that. It promises the delivery of "information". Too often, readers are disappointed, sometimes because the "information" cannot be found on the chart, and sometimes because the "information" is resolutely hidden behind thickets.

Statistical techniques are useful to expose the hidden information. They work by getting rid of the extraneous or misleading bits of data, and by accentuating the most informative parts. A statistical graphic distinguishes itself by not showing all the raw data.

Guardian_pisa_sm Here is the Guardian's take on the OECD PISA scores that were released recently. (Perhaps some of you are playing around with this data, which I featured in the Open Call... alas, no takers so far.) I only excerpted the top part of the chart.

This graphic is not bad, could have been much worse, and I'm sure there are much worse out there.

But think about this for a moment: what question did the designer hope to address with this chart? The headline says comparing UK against other OECD countries, which is a simple objective that does not justify such a complex chart.

The most noticeable feature are the line segments showing the correlation of ranks among the three subject areas within each country. So, South Korea is ranked first in reading and math, and third in science. Equally prominent is the rank of countries shown on the left-hand-side of the chart (which, on inspection, shows the ranking of reading scores); this ranking also determines the colors used, another eye-catching part of this chart. (The thick black UK line is, of course, important also.)

In my opinion, those are not the three or four most interesting questions about this data set. In such a rich data set, there could be dozens of interesting questions. I'm not arguing that we have to agree on which ones are the most prominent. I'm saying the designer should be clear in his or her own mind what questions are being answered -- prior to digging around the data.

With that in mind, I decided that a popular question concerns the comparison of scores between any pair of countries. From there, I worked on how to simplify the data to bring out the "information". Specifically, I used a little statistics to classify countries into 7 groups; countries within each group are judged to have performed equally well in the test and any difference could be considered statistical noise. (I will discuss how I put countries into these groups in a future post, just focusing on the chart here.)

Here is the result: (PS. Just realized the axis should be labelled "PISA Reading Score Differentials from the Reference Country Group" as they show pairwise differences, not scores.)


Each row uses one of the country groups as the reference level. For example, the first row shows that Finland and South Korea, the two best performing countries, did significantly better than all other country groups, except those in A2. The relative distance of each set of countries from the reference level is meaningful, and gives information about how much worse they did. 

(The standard error seems to be about 3-6 based on some table I found on the web, which may or may not be correct. This value leads to very high standardized score differentials, indicating that the spread between countries are very wide.

I have done this for the reading test only. The test scores were standardized, which is not necessary if we are only concerned about the reading test. But since I was also looking at correlations between the three subjects, I chose to standardize the scores, which is another way of saying putting them on an identical scale.)

Before settling on the above chart, I produced this version:


This post is getting too long so I'll be brief on this next point. You may wonder whether having all 7 rows is redundant. The reason why they are all there is that the pairwise differences lack "transitivity": e.g., the difference between Finland and UK is not the difference between Finland and Sweden plus the difference between Sweden and the UK. The right way to read it is to cling to the reference country group, and only look at the differences between the reference group and each of the other groups. The differences between two country groups neither of which is a reference group should be ignored in this chart: instead look up the two rows for which those countries are a reference group.

Before that, I tried a more typical network graph. It looks "sophisticated" and is much more compact but it contains less information than the previous chart, and gets murkier as the number of entities increases. Readers have to work hard to dig out the interesting bits.





The richness of nothingness

Statisticians investigate data, and it may seem like missing data should be ignored since no data means no analysis, right? Well, in practice, it turns out that the knowledge that data is missing is very powerful, and statisticians are, in fact, always wary of missingness.

Dailykosbloomberg A reader pointed me to Daily Kos for another chart which I'll eventually talk about -- but I got waylaid by this one (shown on the right), depicting the relative proportions of favorable and unfavorable ratings for a set of political players.

The data is simple, and the chart is sufficient although I'd avoid the blue/red coloring which connotes party affiliation in American politics. The graph also fails our self-sufficiency test.

But really, the big problem with this chart is not on the page. Alert readers might realize that very few people (in fact, only just over half) have an opinion of John Boehner.

In the following version, the proportion of missing/no opinion/don't know is plotted right beside favorables and unfavorables, revealing that this proportion ranges wildly from only 4% for Obama and Bush to 48% for Boehner.

Redo_fav1 This is one data set which makes stacked bar charts look better than they typically are. The two main categories of favorable and unfavorable can be stacked to the sides so that they can individually be compared. The middle part, which represents missing data, will usually not provide much information but in this dataset, the gaping blank space makes us think about how we should treat the missing data.

In this chart, we give equal weight to those who have an opinion and those who don't.


Alternatively, we could ignore the people with no opinion, and look at the proportion of favorables and unfavorables among those who have an opinion. There is a danger in doing this because as seen above, the large proportion of don't knows would be hidden from view, and in the case of Boehner, and even for Pelosi and Tea Party, the amount of missing raises interesting questions: have people not heard of these players? are they afraid of providing an opinion? are they conflicted? etc.

Here is the alternative view, in which I have added a couple of comments to highlight things that otherwise would have been missed. One notable feature is that the respondents in this survey essentially view most of these players in similar light (40-50% favorable), as I don't see the differences in the center of the chart as meaningful.



Open call 1

I noticed a burst of activity on Twitter with "Junk Charts" nominations, too many for me to take care of. So, I'm trying a new feature, the Open Call. It's your chance to start the conversation on these charts.


Domiriel "Optimus ad with purposefully exaggerated differences...In yellow, length that matches percentages." (no link)






Boyan Penev

Guardian_pisa Rich data set, and not a bad choice, but there is better. Many decisions: scores v. ranks, colors, foreground v. background, segmentation, chart type.... (link to chart)
















Jim Williams


"the BBC normally do a good job but this is junk, impossible to compare GDP figures" (link to chart)

Failure to adjust is a mind trick: second mint

In a prior post, I looked at Mint's illustration of the 2010 retail growth data. I suggested little changes that can enhance  their power of communication.

While writing that post, I noticed two larger issues with the text accompanying the first chart on their infographic poster. Here is that chart again:


The designer declared "story time" when he stated "Q3 saw the largest drop, as shoppers prepare for holiday shopping". The neighboring chart seemingly provides evidence for this holiday-accented story but in fact, offers no support for it. How do we know the decline is fully or partially explained by holiday shopping?  Read more about "Story time" on the Numbers Rule Your World blog, here and here.

The chosen narrative, in fact, does not hold water. The chart designer forgot that the data has been seasonally adjusted. The data shows that growth has slowed from Q2 to Q3 after removing the effect of seasonality. Holiday shopping, and also any associated sales contraction in its anticipation, happens every year, and their (seasonal) effects have been taken out of this picture. Relative to Q3 2009, the sales of Q3 2010 were 3% lower. This drop is not explained by holiday shopping since that factor affects both 2009 and 2010 figures; however, if there is something peculiar to the 2010 holiday season that should cause Q3 sales to decline more than usual (say), then we have a story.


A more thorough analysis would require the data be adjusted for inflation and for population growth as well. The former is necessary because a dollar in 2010 is not the same as a dollar in 2009 (well, given the low interest rate and low inflation, this factor may be immaterial at the moment); the latter is needed because retail sales should rise because of population growth even if the consumption level of the average person does not change.

Not long ago, adjusting data was attacked as a dishonest "trick" (see here). On the contrary, failing to adjust data is the more serious crime. I show this in the relatively safe haven of retail sales statistics but ask any statistician, and you will get the same advice: adjust the data.



Great charts get the little things right: first mint

Mint produced a set of charts (they call this an infographic) about the state of our retail sector prior to entering the all-important 4th quarter. There are things I like, and things I don't.

What they did well is to produce a separate chart for each key message.


Mint_retail1 Let's start with the last chart on the poster, which is the simplest and so the easiest to grasp.

The Bumps chart (shown on the right) compares the growth performance of four luxury retailers over the first three quarters of 2010. The text correctly summarizes the pattern as a "slow decline in growth over 3 quarters".

The chart focuses on the change of a change, the "second derivative"; thus, a downward-sloping line above zero indicates positive growth that decreases in magnitude. Just like the Co2 emissions charts (see Stefan's comment), the designer makes a conscious choice to invert our conventional sense of up is positive, down is negative, and would have done well by readers if he had slipped a brief note under the chart.

There is one extra detail -- better to label the vertical axis as "annualized growth rate (seasonally-adjusted)". The seasonal adjustment, as explained here recently, is needed so that every number on this chart can be directly compared to every other number. (The simplest form of seasonal adjustment is taking the difference of this year from the last year, which is what they did here; thus, Neiman Marcus's Q1 sales grew by 40% in 2010, relative to their Q1 sales in 2009.)

The current label, "change in spending per user '09-'10", juxtaposed with the time axis going from Q1 to Q3 2010, confuses and does not communicate.

How can we make a world of difference with a minor shuffle? Put the retailer names in the same order as the lines on the chart. Like this:



Mint_retail2 At the top of the poster sits a similarly-styled, slightly busier chart, comparing the change in growth rates of retail sales by type of retailer.

Four little things stick out for improvement.

Placing the line labels beside each line eliminates the need for a legend, and also removes the need to use six colors on the same page.

The 3% increment on the sales axis is unusual, as if to defy convention for the sake of defiance. I'd stay glued to a 5% increment.

The text sings a different tune from that of the chart. The only year visible to the reader is 2010 while the text refers to 2009 and 2008. This forces readers to crawl around the plumbing (to use Andrew Gelman's term); to see the connection to 2009, readers have to know the formula for seasonal adjustment. There is no linkage to 2008 as far as I can tell. Better to hide the pipes.

And then we have the little issue of the negative growth rate. In a chart like this, the negative numbers should shake up readers because they represent not just declining growth but contraction. Shoving this into a little corner does not do it justice.

The same chart with minor changes:



Thanks to reader Chris P. for sending in the link. Chris suggests that a longer time horizon be plotted.

In the next post (second mint), I will deal with the interpretation of these charts.








Handling multi-level data in multiple charts

Stefan S., whose team created the inkblot charts featured here, has updated his page of charts with a bunch of new ones, including some bubble charts. I had fun looking at a few of his experiments, and learnt a bit from them.

Geo_co2wealth This chart deals with the geographical distribution of CO2 emissions and of wealth, and their correlation. The "standard" format would be to put pairs of columns onto a world map. That format has various weaknesses, and it is great to see Stefan try to reimagine the chart.

I especially love how he broke up the map into continents and subregions and arranged these pieces in a clean manner. He also recognized that stacked columns are not great for comparisons as the two pieces being compared don't usually sit at the same level. So in this chart, most pieces are level at their base. The solution is not perfect though, as for instance in the European section, it was very hard to put everything level without introducing white space.

Stefan also realized that you can't make it easy to compare distribution within a continent and across the globe in one chart, so he created the right-side column to solve this problem. Again, it's a good effort, not entirely successful but a very good start.


Another chart I like is the inkblot chart that deals with all levels of the data simultaneously. These, I think, manage to both engage our brains and entertain us.


Stefan specifically asked for feedback on this bubble chart:

Geo_co2dev  I must say the map legend is a lovely touch.

As a whole, the scatter plot is effective at showing the inverse positive [oops, see comment below] correlation between development and per-capita emissions.

The effort is enormously ambitious in terms of stuffing as many dimensions as possible onto the same chart. I feel like the data can benefit from being shown in a set of charts, rather than one.

For example, having the continental averages plotted with all the individual countries doesn't work for me. I'd rather see individual plots for each continent, with the continental average plotted against a background of all the individual countries. This proposed view is reminiscent of the Gapminder presentation from what seems like eons ago.

Also, many metrics are found here: population, per-capita emissions, total emissions, human development index. Anyone familiar with this business knows that there are controversies around different pairs of metrics, e.g. per-capita versus total emissions, total emissions v. population, per-capita emissions v. population. Thus, a panel of scatter plots that focuses on different pairs of metrics would encourage readers to ponder these practical questions. Otherwise, we may feel like being dumped in the deep end.

I'm sure Stefan would appreciate other comments on this or other charts.