« October 2011 | Main | December 2011 »

Lost in complexity

Felix Salmon (link) and others linked to this BBC News graphic about European debt recently.



At first sight, the use of arrows inside a ring, enhanced by an interactive filter by country, seems to be an inspired idea.

Then, I started clicking. Here is the German view.


According to the paragraph beneath the headline, the arrows show how much money is owed by each country to banks in other nations. So, it appears that German banks have lentborrowed about equal amounts tofrom France and Italy, and also good amounts tofrom the U.S., the U.K. and Japan. And German banks would be affected if these debtors were to default.

Now, take a look at the right column where BBC tells us "The biggest European economy is exposed to Greek, Irish and Portuguese, but mostly, Spanish debt." Say what?

Much more important than appearance, the designer must ensure that the data and conclusions make sense. Here, the chart doesn't support the discussion.


See also my previous post about Europe debt. 



What went wrong and how?

From Twitter @yoslevy comes this chart, and the tweet: "I am sure you don't need to understand Hebrew to find out what's going on [in this chart]". (original link here)

Indeed, we don't need to know the language.


It's always baffling how this sort of error gets into print.

Is the data wrong or are the bars wrong?

Or just maybe, this is the Hebrew Onion?

Ornaments or fireworks for Christmas?

When I saw this chart:


I was wondering if the inspiration was these Christmas lighting ornaments,


(credit for the image: here)

or perhaps this beautiful piece of fireworks.


(credit for the image: here)


The chart was came from the same Warwickshire collection submitted by Alex L. (See the previous post here.)

This chart appeared to rank school districts by some arcane measure of academic performance based on public exam results (GCSE). It's anyone's guess what the color codes mean; I assume the intended readers of this chart would instinctively know the answer.

It also took me a while to figure out that the five groups on the left is the disguised legend, with the aggregate statistics. Perhaps the unusually big size threw me off.


This data set illustates the power of dot plots.


By sensibly grouping and sorting the data, one can easily understand both the average performance and the spread of performance, and both within each district group and across district groups.

If we can agree on where the acceptable standard is (say, 50% or more having "good GCSE"), then adding a vertical line to the above chart makes it that much more powerful.


The wall of blinking lights

Reader Alex L. submitted this chart showing the evolution of quality of life in Warwickshire in the U.K.


 This wall of lights is drawing way too much power. Let's make a list of fixes:

  • Stretch out the hemisphere, turning those arcs into horizontal lines
  • Allow readers to read horizontally, rather than centrifugally (?)
  • Align horizontally all of the labels for the quality of life indicators
  • Allow readers to read indicator labels in one direction, rather than inside-out on the right hemisphere and outside-in on the left hemisphere
  • Assume readers understand that the first year for which there is data is the "baseline year"
  • Remove the distance between one data point and another, which makes unnecessary the white gridlines
  • Use rectangles (rather than circles) as they can be packed more tightly
  • Order the indicators in a meaningful way

Eventually this chart reveals itself as a heatmap:


The heatmap is much better. But the heatmap doesn't expose the trends clearly, especially the differences between indicators. The heatmap function (in R) has a built-in clustering method which automatically groups the indicators by similarity of trends. The color scheme should really be reversed since on this chart, red is good, and blue is bad; the default orientation of the column labels is also annoying. The bad indicators are clustered to the top, the good ones in the middle and the neutral ones at the bottom.


The next version uses the line chart, in a small-multiples setting. Now we have something to chew on.


Although not done here, we can order this set of charts using the clustering results from the heatmap.

The lesson is that the pretty colors in the heatmap really tell us much less than the plain levels in a line chart.



Showing off the world in charts

Un_lifexpectStefan S. who works for the UN data project and is a regular contributor to this blog, points us to a new report they have issued that contain a host of charts. The report is an update on what has happened to our Earth since 1992 (The Earth Summit). Link to the PDF file here.


This life expectancy chart (shown on left) uses a Bumps-type chart, and is very nicely done, clean and informative. 


Un_agedistThis age distribution chart shown on the right is unusual. It's a case of the data defeating the chart type. The magnitude of the 5-year changes is just not large enough as a percentage of the total to register. On a different data set, I can see this chart type being more effective.


Now, this criss-cross chart (bottom left) reminds me of Friedman's foolish attempt some time ago. It has various issues, like dual axes, excessive labels and inattentive titles (not indicating that the base population was only of developing countries).


  Instead, I attempted an area chart, using population size as the primary metric. Perhaps a more direct way to illustrate this point is to plot the growth rate of the slum population versus that of the total population.


This map is excellent, showing the spatial distribution of the countries with above-average and below-average GDP per capita. It would be even better if smaller geographic units can be used so that the distribution within each country can also be seen.



I'd like to salute all the people around the world who work at statistical agencies and who collect and make sense of all of this data, without which any of these charts would not have been possible.

Two lines dropping

Reader Ron D. was not pleased to see this dual-axis chart purporting to show a cause-effect between the decline in union membership and the drop in the proportion of income earned by middle-class households (defined as the middle 60% of households). Click here to read the original article. They credit CAP's David Madland and Karla Waters for this chart.


Using dual axes is a well-tested way of creating correlation where there may be none. Playing with the scales will do that for you. I wrote about this issue here.

However, the correlation in this data cannot be denied, as the scatter plot below shows. Note that the scatter plot is much better at revealing correlational patterns than a chart with multiple time-series lines. (Here's an example of two lines that display a spurious correlation.)


If one were to ask for a linear regression line, one will obtain a very high R-squared indeed (over 0.9). The problem is with the interpretation of this correlation. Any two data series that move with time will be highly correlated with each other, just because each series is highly correlated with time. Despite what you might believe after reading Freakonomics, regression -- especially in social science data -- cannot prove causation.

The writers at Think Progress show no such restraint, from the title "The American middle class was built by unions and will decline without them." to the sentences "these assaults have successfully decreased union membership over time... this has had a detrimental effect on the American middle class."

Note: these statements may in fact be true; I'm just pointing out that the chart does not buttress the assertions.


It's often hard to elevate a correlation to a causal effect. We have to try different tests. One such test for this data set is: if a change in union membership causes a change in middle-class incomes, then we'd expect that  the annual changes of one to be correlated with the annual changes of the other (at least in direction, better in magnitude).

So, in a year in which union membership declined a lot, one should expect to see middle-class incomes also drop substantially.

The next scatter plot contrasting these annual differences suggests that causation is probably absent. At this smaller time scale, one just doesn't see any correlation at all. Annual declines in the proportion of union membership has been around 2-4% for most of this period but shifts in middle-class incomes have been ranged widely in terms of direction as well as magnitude.


P.S.  Andrew suggested connecting the lines. Here are the charts with the lines:


What appears to be a very strong correlation on the left chart does not look that well-coordinated on the right chart!  (The lines connect the dots in chronological order.)