« September 2008 | Main | November 2008 »

The matter of bad choice

Right on the heels of the disastrous bubble chart comes another, courtesy of the NYT Magazine.  Bubble charts are okay for the conceptual ("this is really big, and that is really tiny").  This chart wants readers to compare the sizes of the bubbles, which highlights the worst part of such graphs.

Poor scaling is the huge issue with bubble charts.  They are the prototype of what I call not "self-sufficient" charts.  Without printing all the data, the chart is unscaled, and thus useless (see below middle).  When all the data is printed (as in the original, below left), it is no better than a data table.


In the above right chart, we simulated the situation of a bar or column chart, i.e. we provide a scale.  For this chart, the convenient "tick marks" are at 10, 20, 34, 41.  Unfortunately, this scaled version also fails to amuse.

Note further that the data should have been presented in two sections: the party affiliation analysis and the gender analysis.  Also, it is customary to place "Independents" between "Republicans" and "Democrats" because they are middle-of-the-road.

Redo_pewpoll A profile chart is an attractive way to show this data.  Here, we quickly learn a couple of things obscured in the bubble chart.

On the issue of abortion, Independents are much closer to Democrats than Republicans.  Also, there is barely any difference between the genders, the only difference being the strength of support among those who want to legalize.

Reference: "A matter of Choice", New York Times Magazine, Oct 19 2008.

PS. Based on RichmondTom's suggestion, here are the cumulative profile charts.


Bernard L. suggested a "tornado" chart:

A matter of choice

Mind the gap

When comparing two time series, one typically wants to discuss the size of the gap as it changes over time.  This Business Week chart, for example, depicted for readers the expanding gap between intra-day high and low prices of the S&P 500 for 2008.

This chart construct is effective at pointing out large changes but lacks precision in conveying smaller differences, or trends.  It is always a good idea to plot the gap directly, as we will show below.

Redo_SandPHiLow More importantly, a better choice of scale can help a lot.  By focusing exclusively on variability (extreme values), this chart hides the relevant information of the closing prices of the S&P.  A point spread of a 100 points means more when the index is at 800 than at 1200.  In order to capture this, we can divide the point spread by the opening price of that day so we say the gap is one-eighth or one-twelfth of the opening price. 

The junkart version makes both changes.  The top chart fixes the scale, plotting the point spread as a percentage of daily opening prices.  Relative to the original chart, the variability in the front part of 2008 was muted because the index was at higher levels back then. 

The bottom chart plots the gap sizes (lengths of the high-low lines).  It is without doubt that directly plotting the gaps showcases the key message.  The current level of volatility is more than double what occurred at the beginning of the year.

If one wants to illuminate the trend as opposed to daily fluctuations, a further improvement will be using moving averages.

For those interested, shown below is a scatter plot that compares the original point spread and the derived point spread, which shows that the change is not trivial.


Reference: "The Market: A Daily Roller Coaster", Business Week, Oct 27 2008.

How to read a graph

Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.


Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.

This post is about how to read a graph.  Here are some things that come to mind looking at the map:

  • Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters?  It would be prudent to check this before making generalizations.  Gelman's point may be that Amazon customers behave like rich voters.
  • Sampling period: is the period long enough to capture the average inclination of the book buyers?  As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.)  Best-sellers have a disproportionate influence on average values.  If the time period is too short, the data may only represent the best-sellers.  Consider the following two maps in successive periods in 2004:



Much of the red in the first map was due to John O'Neill's "Unfit for Command", published in August 2004, and much of the blue in the second map was due to John Dean's "Worse Than Watergate", published in April 2004.  If one of these two-month periods was used to draw conclusions, we would make big mistakes!

  • Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data.  The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales.  Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference. 
Thus, one of the first things to look at is Amazon's helpful explanation of how they classified books as "red" or "blue".  We learn that they also have "purple" books which are those they could not decide if it's red or blue.  Each red or blue book is given equal weight but it appears that purple books are not tallied.  Glancing at the list of purple books, I see some hugely important books, e.g. Ron Paul's "The Revolution: A Manifesto" (Amazon rank #56  among all books), Tom Friedman's "Hot, Flat and Crowded" (#15).

If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation.  We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
This is not to say that excluding those books is the wrong decision.  We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.

  • Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior.  Then we need to think about the direction of causality.  Is the map saying that red book buyers are likely to vote red?  Or that red voters are likely to buy red books?  No prolonged staring at this data set will resolve this issue as other data would be needed to address it.

The more data is used to create a graph, the harder our task is to interpret it.  But the pay-off for spending the time is all the sweeter.  Happy graph-reading!

One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon.  This is a great and fun way for readers to find interesting books.

Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.

Break it down, build it up

Thought of the day:

While commuting today, I wondered why we use the term "data analysis" or "data analyst".  I recalled that in chemistry class, we learnt that analysis means breaking things down while synthesis means building things up.

With regards to data, typically we try to collect data at the most detailed level and we build up messages and stories from the little pieces.  We don't break things down.  We can't break things down, in fact, if the data come to us in aggregated form.  (Think ecological fallacy.)

So why don't we say "data synthesis" rather than "data analysis"?

From bad to worse

Pie charts can range from bad to worse.  Brent L. pointed us to a few on the right end of that spectrum.


Brent wrote: "The background image makes it almost totally unreadable.  And what does the forest scene have to do with programming?  *sigh*"

That's not to mention the oval rather than circle, the dizzling array of colors, the Excel-style legend that inverts the order of importance ("Other" at the top), etc. etc.

Again, a column chart would have been much clearer.  Since the total number of famous programmers is arbitrary, a chart of counts would work at least as well as one that plots proportions.

More here.

Reference: "Famous programmers from Adleman to Zimmerman", grokcode.

Vanishing act

This is a well-executed chart showing the complex dealings between Wall Street firms in the last 40 years.


They found a way to present all the information without criss-crossing lines.  The right column is the clincher.  It listed all the important recent events.

Reference: "Wall Street: RIP", New York Times, Sep 28 2008.