Break it down, build it up
Mind the gap

How to read a graph

Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.

Last60days

Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.


This post is about how to read a graph.  Here are some things that come to mind looking at the map:

  • Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters?  It would be prudent to check this before making generalizations.  Gelman's point may be that Amazon customers behave like rich voters.
  • Sampling period: is the period long enough to capture the average inclination of the book buyers?  As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.)  Best-sellers have a disproportionate influence on average values.  If the time period is too short, the data may only represent the best-sellers.  Consider the following two maps in successive periods in 2004:

Unfitforcommandaug004

Worsethanwatergateapr04

Much of the red in the first map was due to John O'Neill's "Unfit for Command", published in August 2004, and much of the blue in the second map was due to John Dean's "Worse Than Watergate", published in April 2004.  If one of these two-month periods was used to draw conclusions, we would make big mistakes!

  • Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data.  The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales.  Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference. 
Thus, one of the first things to look at is Amazon's helpful explanation of how they classified books as "red" or "blue".  We learn that they also have "purple" books which are those they could not decide if it's red or blue.  Each red or blue book is given equal weight but it appears that purple books are not tallied.  Glancing at the list of purple books, I see some hugely important books, e.g. Ron Paul's "The Revolution: A Manifesto" (Amazon rank #56  among all books), Tom Friedman's "Hot, Flat and Crowded" (#15).

If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation.  We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
 
This is not to say that excluding those books is the wrong decision.  We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.

  • Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior.  Then we need to think about the direction of causality.  Is the map saying that red book buyers are likely to vote red?  Or that red voters are likely to buy red books?  No prolonged staring at this data set will resolve this issue as other data would be needed to address it.

The more data is used to create a graph, the harder our task is to interpret it.  But the pay-off for spending the time is all the sweeter.  Happy graph-reading!


One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon.  This is a great and fun way for readers to find interesting books.


Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.

Comments

TH

It is a sad world where we only choose to talk to those we agree with, and only read books that confirm what we already know, how we already think. This polarises the field and only fuels misunderstandings and false representations of each others' ideas. We start seeing differences where none exist.

This graph seems to assume this is already the case. Although many of the books tracked here might be "election books" that only serve narrow purposes, I don't think this assumption is very well founded.

Jon Peltier

Another skew in the data is the mismatch between area of a state on the map and population of that state. My home state of Rhode Island barely appears on the map, but its population outnumbers several states with much greater area: Montana, South Dakota, North Dakota, Alaska, and Wyoming, plus Vermont and Delaware, which are not geographically large.

G Horse

Thanks for sharing the graphing information. Very interesting topic.

Tony

Thats some very useful information...

"It is a sad world where we only choose to talk to those we agree with, and only read books that confirm what we already know, how we already think"

And that is very very true lol

The comments to this entry are closed.