« September 2009 | Main | November 2009 »

Declutter

Cluttered, and ineffective (from here)

IPMchart2_102609 


Nice!  Simple, sharp, fun even (from here)

Businessinsider


Better:

  • No data labels
  • No dots, squares, triangles
  • No legend, fun use of pictures


Could be even better:

  • No line shadows (get rid of that new Excel default)
  • Clean up the year labels, only one needed for each year
  • Include the line for "other devices" since this chart is about market share, and we don't want to miss the trend for "others"


Reference: "Chart of the Day", Oct 28 2009, BusinessInsider.com; "Apple Soars Behind iPhone 3GS Momentum", October 27, 2009, InvestorPlace.com



Life-enabling charts

In response to my call for positive examples, reader Merle H. sent in an example of how good charts can make our lives simpler and easier.

All of us have seen the following presentation of air travel data.

Travelocity Not trying to pick on Travelocity - it's the same format whether you use Expedia or any of the airline sites.  For those customers who are looking to decide what dates to travel so as to minimize their air fare, this format is very cumbersome to use.


Flight_chart What about this fare chart at FuncTravel.com?

As you mouse along the line chart, the average fare for each day is visible.  Clicking on a particular day will fix the departure or return dates.

So much easier, isn't it?


A few caveats, though:

  • Instead of just providing the historical averages, they should consider including information on variability, such as bars that indicate the middle 50% or 75% of prices.  Also, what about a sliding control for customers to decide which period of past history the averages should use?  More recent data may be more representative.
  • This particular feature appeals to the price-sensitive, date-flexible customer segment.  Not everyone will pick itineraries based on those criteria.  There is an easy fix. If some controls are available for customers to indicate other preferences, e.g. exclude all British Airways flights, include only evening flights, etc., and the chart can update itself based on such selections, then the chart becomes a lot more flexible, and useful to many more customers.
  • As with many automatically generated charts, the chosen labels on the vertical axis are laughable.  That should be relatively easy to fix, you'd think.
A great start.  I happen to notice that Travelocity has a beta feature that shows a similar chart.  A revolution in how travel sites present data to us is long overdue. 






Following one's nose 1

Andrew Gelman has a great post about a so-called Immigrant paradox here, which should be interesting to our readers too.

He posed a set of sharp questions.  My read, in reverse order:

Immigrant_paradox 6. The graph is pretty effective, I agree.  This is known as an "interaction plot".  The message the authors were trying to send was that the gap between immigrants and U.S. born in terms of prevalence of mental illness is not constant across sub-groups of Latinos.  For example, the gap for Mexicans (light blue) is larger than the gap for Puerto Ricans (pink).  Thus, the authors concluded that one should be careful about speaking of an aggregate (average) gap.

The graph lays this out clearly.  The steeper the line, the bigger the gap between the  immigrants and non-immigrants.

When Andrew showed this, I knew for sure someone will cry foul that a line is drawn between unrelated, discrete things.  Indeed, the very first commenter weighed in with this complaint.  In fact, whenever I show such charts to non-statisticians, a lot of people have this reaction.

So I'll take this as another chance to convince you to release interaction plots from jail.

Mental_nolines Typically, a dissenter will offer up a dot plot as an alternative.  So let's look at the same chart without the lines.  Since the reader is supposed to figure out how the gap between U.S. born and immigrant groups across different subgroups of Latinos, the proverbial nose is tracing a line from a left dot to a right dot.  Thus, to follow one's nose is to mentally draw the lines I just removed.  The chart designer has done us a favor by making the lines explicit.

In addition, as Andrew pointed out, it is always better to try to get rid of the legend and put the line labels directly onto the chart.

One shortcoming of the interaction plot is that it does not disclose the relative importance of the different lines, which correspond to the relative proportions of people in these subgroups.  Without this information, the reader will likely assume the lines have equal weight.  This assumption, as I will explain in a future post, may be a problem.


This post dealt with the graphical aspect.  I will have more to say about Andrew's other points on the statistics in a future post.


The hard work of entertaining

Stefan pointed us to his work for the UN GEO (United Nations Global Environment Outlook) data portal.  This set of information posters highlights a vexing issue that crops up on Junk Charts from time to time, that is, the proper balance between information and entertainment value of data displays.  While this blog concerns itself primarily with the former, it does not mean that we are blind to the flashier side of the enterprise.

Geo1



Recycling Let's take Stefan's recycling spiral chart as an example.  One must admit that visually this presentation is more appealing than either a data table or a set of bar charts.  The reader can obtain the primary piece of information, which is the ranking of different countries in terms of the proportion of collected waste that is recycled. 

And if the reader is curious enough, the chart also provides the data on the per-capita amount of waste collected in each of these countries.  (Like the table and bar chart, this display also has the problem that it is one-dimensional, thus the countries can be sorted by proportion of recycling but then the waste collected data will be out of order.)

For those readers who would like to understand the data better, they would want to know some of the following:

  • Is there a relationship between amount of waste collected and amount of waste recycled?
  • Are there differences in culture resulting in different recycling rates?
  • Is the level of development of a country predictive of its recycling rate?
  • Why are some countries recycling more of its waste, and others less?


To address these types of questions, one can start with the following scatter plot.

Redo_recycling2 


With the exception of South Korea, there is a general pattern of positive correlation: the more waste collected per capita, the larger proportion of such waste recycled.  Any dots that are not in the bottom left or top right quadrant are exceptions to the rule.  These countries are labeled in red or blue, the former indicating that the amount of collection is above average while the rate of recycling is below average. 

Because there is sampling error, dots that are close to the average dot (the center of this scatter plot) are probably just average.  Roughly speaking, dots in the gray circle are close enough to the center that I would not consider them exceptional cases.  That leaves Spain and Iceland in the red corner, and South Korea in the blue corner.  If both data series are considered together, these three countries should merit attention; if only the proportion of recycling is considered, then one would pay attention to Italy, Turkey and Slovak Republic on the lower end and South Korea on the high end.

Scatter plots are very versatile.  The following one explores the issue of development level.  Surprisingly, the level of recycling seems to have little to do with development; the countries are quite widely scattered.

Redo_recycling3b
 

Technical note: The data on both axes are expressed in "standardized" units.  So the zeroes represent the average per-capita waste collected, and the average proportion of waste recycled (only of those countries depicted in the original chart).  +1 indicates an amount that is one standard deviation above the average.  Think of "standardized units" as measuring how extreme is a particular country with respect to the average. 


Getting the Nobel wrong

Epic-fail-percentage-display-failWhy do we pay so much attention to seemingly inconsequential details of a chart's scale, colors, labels, etc.?

Here's why. (Reader Omegatron pointed us to the FAIL blog that captured this beauty.)


Notice the messed-up horizontal scale, in particular, the failure to start the axis at 0.  The result: the tiniest difference presented as a wide gulf.


The graph, published by the Washington Post, has since been fixed.  See here.  Nevertheless, the comments left by readers lent witness to the confusion.  I copied the first bunch that mentioned the graphical display - there were plenty of for-and-against-the-prize comments, many assuming that the poll result was as lop-sided as the chart seemed to indicate.

By starting at a base of 45% (as of this reading), your graphic grossly misrepresents the results of your poll. The "no" bar is four times as big as the "yes" bar, giving the visual impression that the vote must have gone 80-20 against Obama's Nobel. On the contrary, as of now the vote is 53-47 against. Whoever produced this graphic should re-read Edward Tufte's "The Visual Display of Quantitative Information."

Posted by: threefab | October 9, 2009 8:33 AM


Someone made comment that the graph is misleading, but it really isn't if you know how to read a graph.

This type of graph highlights the difference, not the complete number of votes and is appropriate when viewing percentages.

Posted by: gconrads | October 9, 2009 8:39 AM

why is graph not drawn to scale? The vote is 51 - 49 no and the graph looks like an overwhelming number of "voters' said no

Posted by: spitts1 | October 9, 2009 8:41 AM


The graph is purposely designed to make it appear that there are a huge number of no votes and demonstrates obvious bias.

Posted by: fingersfly | October 9, 2009 8:41 AM


I'm looking at the graphic here and cannot figure out what you guys are trying to show?
The tiny slice of blue is supposed to be 50% and the huge slice of red is supposed to be 50% and the scale at the bottom is......
WHAT?

Posted by: Tomcat3 | October 9, 2009 8:53 AM


It is my understanding that the President did not "win", but was "awarded" the prize...much like a bonus given to you from your employer...deserving or not. In addition, I could not help but notice that after casting my vote, the graphic scale indicated 49% yes and 51% no; however, the graph lines seemed much more disproportionate for only a 2% spread with emphasis on “no”…

Posted by: jonbwnfd | October 9, 2009 9:40 AM


Why does the graph make the vote look so lopsided?

Posted by: subwayguy | October 9, 2009 9:48 AM


The graph is not lopsided .. the numbers are. Read the posts and you will see the posts are against the idea not for.

Posted by: T-Tom | October 9, 2009 9:53 AM 

Who designed the bar graph? The difference between the "yes" votes and the "no" votes is two percentage points, yet the "no" bar is three times as long as the "yes" bar.

In fact, "yes" is 49% and "no" is 51%: the graph is supposed to illustrate reality, not. . . well, just what DOES it illustrate?

Posted by: cmcintyr | October 9, 2009 10:02 AM


Inconsequential detail?  Big fail?  You decide.



A tale of four charts

Speaking of rules for making charts, I think the most important is "if at first you don't succeed, try, try and try again."  It's absolutely essential to produce multiple looks before settling on the one that helps tell the story.

While researching U.S. consumer credit this week, I came across these four views of presumably the same data set.

Credit_charts

The chart on the top left (via Business Insider) shows a downward sloping line, with a steep decline on the right edge of the chart.  The authors clearly wanted to show us that consumer credit in the U.S. was collapsing.  The bottom was falling out.  To aid in this effort, they chose to:

  • start the plot at the peak of the time series (2000);
  • set the minimum value of the vertical scale to coincide with the lowest value attained thus far (2009) -- we have reached rock bottom, ladies and gentlemen; and
  • pay no particular attention to the 0% line, which is placed in the lower half of the chart, rather than the middle, thus obscuring the fact that credit grew at faster rates during the height of the boom than it has been declining during the recession.
(Thanks to Excel's default setting, we are treated to a centipede effect on the horizontal axis.  Also thick lines and line shadows.)

Readers who previously complained about my willingness to draw a line through things that shouldn't be connected will be none too amused by the use of a line chart to plot year-on-year changes.  Thus, the zig-zagging of the line represented the change of change, which has been dubbed the "second derivative" during a period when optimists used technical wizardry to show us "green shoots".  To read this chart properly, one should focus on the actual annual decline (the dots) rather than the line.

If one takes a longer-term view, as in the chart on the top right (via Rolfe Winkler), the recent drop in consumer credit can be put in perspective.  Since the 1960s, consumer credit has been positively exploding with few years of decline.  Note, again, that the falling line between 2000 and 2008 represented a slowing rate of growth, not a decline.  The question to ask is: after so many years of almost continuous growth, is the current correction such a big cause for alarm?

This situation of exploding growth followed by a slight correction is better visualized in the third chart (lower left, via chartingtheeconomy.com)  The author also answered a question asked by Rolfe, which is to compare the total consumer credit to the population, as he plotted the per-capita consumer credit.  An eyeball estimate tells us that consumer credit jumped by more than 800 percent since the 1970s, and the current retrenchment is a blip if we take a long view.  The explosion in consumer credit has no doubt enhanced wellbeing in the U.S. for decades; even though credit might well have been over-extended in the recent past, it is far from something evil.

The final chart (bottom right, via The Big Picture) should carry a health warning.  It really should not be shown by itself, or without comment.  Both charts on the right, in fact, came from the same presentation, by David Rosenberg.  This chart, not surprisingly, is the most dramatic -- the chart designer is going for the jugular.  The decline on the right side is much more exaggerated than in any of the other three.  The trick here is:

  • to plot the dollar value of credit change, rather than percentage value, and thus ensuring that the further back the time, the more insignificant the change;
  • to connect dots which bring out the steep decline when in fact, the steep drop reflected the second derivative; and
  • to put the 0% line below the middle of the chart, which causes the bottom "half" of the chart to be smaller than the top "half", which plays with our perception so that we may not realize that there were multiple years of growth above the absolute level of decline recently experienced.

If this chart were to be believed, our focus should not be on the cliff-diving at the far right -- instead, the chart has the hallmark of a system getting completely out of control, and oscillating to oblivion.  One hopes that is not the message intended by its creator.

For anyone creating charts, it would be a great idea to have attempted all of these versions, and more.