How to read a graph

Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.


Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.

This post is about how to read a graph.  Here are some things that come to mind looking at the map:

  • Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters?  It would be prudent to check this before making generalizations.  Gelman's point may be that Amazon customers behave like rich voters.
  • Sampling period: is the period long enough to capture the average inclination of the book buyers?  As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.)  Best-sellers have a disproportionate influence on average values.  If the time period is too short, the data may only represent the best-sellers.  Consider the following two maps in successive periods in 2004:



Much of the red in the first map was due to John O'Neill's "Unfit for Command", published in August 2004, and much of the blue in the second map was due to John Dean's "Worse Than Watergate", published in April 2004.  If one of these two-month periods was used to draw conclusions, we would make big mistakes!

  • Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data.  The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales.  Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference. 
Thus, one of the first things to look at is Amazon's helpful explanation of how they classified books as "red" or "blue".  We learn that they also have "purple" books which are those they could not decide if it's red or blue.  Each red or blue book is given equal weight but it appears that purple books are not tallied.  Glancing at the list of purple books, I see some hugely important books, e.g. Ron Paul's "The Revolution: A Manifesto" (Amazon rank #56  among all books), Tom Friedman's "Hot, Flat and Crowded" (#15).

If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation.  We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
This is not to say that excluding those books is the wrong decision.  We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.

  • Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior.  Then we need to think about the direction of causality.  Is the map saying that red book buyers are likely to vote red?  Or that red voters are likely to buy red books?  No prolonged staring at this data set will resolve this issue as other data would be needed to address it.

The more data is used to create a graph, the harder our task is to interpret it.  But the pay-off for spending the time is all the sweeter.  Happy graph-reading!

One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon.  This is a great and fun way for readers to find interesting books.

Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.

Divided nation

Professor Gelman generally believes the red state, blue state paradigm is too simplistic to describe the American electorate.  He has been sharing some of his work on his blog, and has just published a book about this topic.  Recently he produced the following chart, which is gimmick-looking but crystal clear in its message.


Here, economic and social ideology are plotted on a scatter chart, with positive values indicating conservatism and negative values liberalism.  Further, each state is represented twice on the chart, the red point for the Republicans and the blue for Democrats within the state.

This is a cluster analyst's dream data set.  The absolute separation of the Republican cluster and the Democrat cluster is astounding: imagine a diagonal line perfectly classifying all points.

We should not miss a host of details:

  • as Andrew pointed out, "the big thing we see from the graph ... is that Democrats are much more liberal than Republicans on the economic dimension: Democrats in the most conservative states are still much more liberal than Republicans in even the most liberal states."  This is clear from the wide gap on the horizontal axis.
  • there is a small degree of overlap on the social ideology axis so the nation is closer together on that front.
  • but wait a minute, the scale on the social axis is not the same as that on the economic axis.  This means that the extremes are more extreme on the social axis: the difference between MS and VT is roughly 0.8 on the social scale while the largest difference on the economic scale is roughly 0.5.  (here, I am assuming that the scales are comparable to each other)
  • there is high correlation between social and economic ideologies: the points are well-aligned along the 45-degree line
  • especially on social issues, the Democrats are divided within (the elongated shape of the blue cluster).

Reference: Gelman, "Ranking states by conservatism/liberalism of their voters", June 30 2008.

Two books

Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data.   There are still a few days left to participate.  While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.

I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs.  She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on.  Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover.  The page design - with half of every page blank - is refreshingly easy on the eyes.  Inclusion of examples is generous. 

Lets review her point of view of some of the topics we discuss frequently on Junk Charts:

Starting axis at zero: she thinks "all bar charts must include zero.  However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)

Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.

Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot.  Again, I infer she's in favor.

A dangerous equation

Graduation rates at 47 new small public high schools that have opened since 2002 are substantially higher than the citywide average, an indication that the Bloomberg administration’s decision to break up many large failing high schools has achieved some early success.

Most of the schools have made considerable advances over the low-performing large high schools they replaced. Eight schools out of the 47 small schools graduated more than 90 percent of their students.

Nyt_smallsch This graphic included in the NYT article  lent support to the "small schools movement".  In particular, note the last sentence of the above quotation: it incorporates the oft-used device of subgroup support of a hypothesis, in this case, the subgroup of eight top-performing schools.

Such analysis is "dangerous", according to Howard Wainer, who discusses this and other examples of misapplication in a recent article in American Scientist, entitled "The Most Dangerous Equation".  He alleged that billions have been wasted in the pursuit of small schools.

The issue concerns sample size.  Dr. Wainer and associates analyzed math scores from Pennsylvania public schools.  Wainer_mathscoresAverage scores for smaller schools are based on smaller number of students, and therefore less stable (more variable).  More variability means more extremes.  Thus, by chance alone, we expect to find more smaller schools among the top performers.  Similarly, by chance alone, we also expect to find more smaller schools among the worst performers. 

The scatter plot lays out their argument. Focusing only on the top performers (blue dots), one might conclude that smaller schools do better.  However, when the bottom performers (green) are also considered, the story no longer holds.  Indeed, the regression line is essentially flat, indicating that scores are not correlated with school size.

This is all nicely explained via the standard error formula (De Moivre's equation) in Dr. Wainer's article.  Here is a NYT article from the mid 1990s describing this same phenomenon.

File this as another comparability problem.  Because estimates based on smaller samples are less reliable, one must take extra care when comparing small samples to large samples.

Dr. Wainer is publishing a new book next year, called "The Second Watch: navigating the uncertain world".  I'm eagerly looking forward to it.  His previous books, such as Graphic Discovery and Visual Revelations, both part of the Junk Charts collection.

Sources: "The Most Dangerous Equation", American Scientist, November 2007; "Small Schools Are Ahead in Graduation", New York Times, June 30 2007.

P.S. Referring back to the NYT chart above, one might wonder at the impossible feat of raising graduation rates across the board simply by breaking up large schools into smaller ones.  This topic was taken up here, here and here.  When evaluating the "small schools" policy, it is a mistake to discuss only the performance of small schools; any responsible analysis must look at improvement over all schools.  Otherwise, it's a simple matter of letting small schools skim off the cream from larger schools.


Super Crunchers

Supercrunchers Here's something different, a mini book review of Ian Ayre's "Super Crunchers".  This book can be recommended to anyone interested in what statisticians and data analysts do for a living.  Ian is to be congratulated for making an abstruse subject lively.

His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization
He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.

Regression is a statistical workhorse often used for prediction based on historical data.  Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response.  (In particular, the chapter on randomization covers the topic well.)  Using regression to analyze data collected from randomized experiments allows one to establish cause-effect. 

In the following, I offer a second helping for those who have tasted Ian's first course:

  • Randomized experiments represent an ideal and are not typically possible, especially in social science settings.  (Think about assigning a group of patients at random to be "cigarette smokers".)  When these are not possible, regression uncovers only correlations, and does not say anything about causation.
  • Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
  • Regression is only one tool in the toolbox.  It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules.  Regression has the strongest theoretical underpinning but some of the others are catching up.  (Ian did describe neural networks in a latter chapter.  It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
  • If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care.  The size of the data may even overwhelm the computation.  Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
  • One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.

Lines of death

I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization). 

Who_tobacco I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.

I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.

Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35".  This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).

Who_tobacco2 However, the description of the "lines of death" raised my eyebrows.  It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions. 

Did they mean 25% of the dead middle-aged people die from smoking?  Or 25% of all middle-aged folks die from smoking?  A gigantic difference!

Percentages are very tricky things to use.  Every time I see a percentage, the first thing I ask is what is the base population.  Here, the baseline appeared to have gotten lost in translation.

This set of maps also shows the peril of focusing too much on  entertainment value, and losing the plot. 

For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust".  It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.

Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization



Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.

Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.

Statistical literacy

I finally got around to reading "When Genius Failed", Roger Lowenstein's account of the spectacular collapse of LTCM, the hedge fund fronted by Scholes and Merton, Nobel laureates both.

It is a sobering read for anyone in the business of statistical prediction and modeling for sure.

What also caught my eye, and caused dismay, is how Lowenstein got basic statistical principles wrong in the book.   He used the bully pulpit to sound the usual alarm against the normality assumption and for fat tails.  He began by confusing LLN and CLT (central limit theorem):

Statisticians have long been aware of the "law of large numbers".  Roughly speaking, if you have enough samples of a random event, they will tend to distribute in the familiar bell curve ...

In the same breadth, he then equated two different probability distributions:

This is called the normal distribution, or in mathematical terms, the lognormal distribution.

Doesn't this say something about the state of statistical literacy?

PS. Here is a link to Dunbar's "Inventing Money" (thanks Marc).  It apparently came out before Lowenstein but didn't get as much press. 

Review: Curve Ball 3

Just want to highlight one more graphic from Curve Ball, one which I consider the most innovative, highly effective and powerful.  Without much ado:
AlberttorThis is one of those charts that paints a vivid story.  Any fan can mentally re-trace the baseball game by reading this chart, without having seen the game itself.  The horizontal axis traces the 9 innings of a baseball game while the vertical axis plots the probability of Toronto winning the game.  This probability is updated over the course of the game as we read from left to right.  (For those asking, this plots Game 6 of the 1993 World Series.)

To quote the authors:

" We see that Toronto's probability of winning rose from the start as they prevented the Phillies from scoring in the first inning.  This trend continued as the [Toronto] Blue Jays scored three times in the first inning... The low point in the [fifth] inning for Toronto occurred just after [Phillie] John Kruk walked to load the bases.  [Phillie Dave] Hollin's big out is shown by the rise ... in Toronto's victory probability from this low point ... in the seventh the Phillies turned the tables ... scoring five runs to take the lead.  The plot of Toronto's probability of winning looks like the Dow Jones Industrial Average in free-fall.  Toronto did not score in its half of the seventh, pushing its probability of winning even further down. ... the plot rises (and the plot thickens) in the eighth inning as a result of a threat with bases loaded and two outs.  In the ninth inning, the Phillies went down quickly.  Toronto came out storming, ... quickly putting runners on base.  The triumphant ... impact of [Toronto's Joe] Carter's home run is evident in the steep rise in the final markings of the plot."

This chart belongs to the same class as the Bumps chart.  In a previous post, I traced how one can re-imagine the Bumps race just by tracing the plot from left to right.

Reference: "Curve Ball", Albert & Bennett.

Narratives that create questions

FreygraphicbigNarrative charts, like this one shown on the right, are particularly difficult to master.  The temptation is strong on the part of the designer to mislead by inclusion/exclusion and on the part of the reader to misjudge by reading between the lines.

The accompanying article made the point that James Frey's fraudulent memoir experienced a severe drop in sales since Oprah "sacrificed" him.  Unfortunately, what the chart shows is how choppy sales have been for this book.  It experienced a similar drop after his first appearance on Oprah.

In fact, this chart raised a host of unanswered questions:

  • How to explain the end-of-year rise in sales back to post-first-Oprah-appearance level?
  • What is the purpose of the red line?  Is it fair to compare a hardcover with a paperback?
  • How to explain the "turbulence" of Frey's memoir sales even before the scandal broke?  Why is it that his other book has a much smoother sales trend?  (Is it merely a scale effect hiding its ups and downs?)
  • Oprah's influence appeared to be highly specific to the recommended book and time of the show.  There seemed to be zero impact on the sales of other books by the same author (scale effect?) and a rapidly diminishing impact even for the highlighted book.  Is this a general phenomenon?

This type of time-series chart does not provide direct evidence of cause and effect; however, they are commonly used in the media for that dubious purpose.  At the best, we can conclude that those factors are correlated, prompting hypotheses and further analyses.

Reference: "James Frey's Falsehoods Improved His Tale", New York Times, Feb 1 2006.