Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.
Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.
This post is about how to read a graph. Here are some things that come to mind looking at the map:
- Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters? It would be prudent to check this before making generalizations. Gelman's point may be that Amazon customers behave like rich voters.
- Sampling period: is the period long enough to capture the average inclination of the book buyers? As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.) Best-sellers have a disproportionate influence on average values. If the time period is too short, the data may only represent the best-sellers. Consider the following two maps in successive periods in 2004:
- Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data. The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales. Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference.
If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation. We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
This is not to say that excluding those books is the wrong decision. We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.
- Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior. Then we need to think about the direction of causality. Is the map saying that red book buyers are likely to vote red? Or that red voters are likely to buy red books? No prolonged staring at this data set will resolve this issue as other data would be needed to address it.
The more data is used to create a graph, the harder our task is to interpret it. But the pay-off for spending the time is all the sweeter. Happy graph-reading!
Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.