« February 2010 | Main | April 2010 »

Infographing the cost of iPad

IPad_600px_mar10The cost of the iPad gets the infographics treatment here.

I feel a little weird about featuring this item.  Helen E., who created the chart/poster, urged me to write about it. The link seems to connect to a commercial site but doesn't look too commercial -- and since the iPad fever is upon us, I thought why not.

There are two elements on this poster that qualify it as data graphics.

The surreptitious blue bars that are sized to match the inflation-adjusted prices... except for the Apple Lisa when the bar was chopped prematurely.

And the image equation gimmick at the bottom.


The 43 iPads = 1 Lisa visualization is definitely effective. I'm not so sure if anyone should care about this particular comparison though.

The blue bars are a super, light touch on a chart that could be otherwise quite boring. The varying heights of the bars exaggerate the larger prices, and can cause confusion.  For example, Apple III, costing $11,412.88, is a midget next to Macintosh Portable, costing $11,358.59.

The choice of which products to feature images and bolded text appears somewhat arbitrary... are those keystone products? Nice use of foreground/background though.


I sent a note back to Helen about the (ab)use of decimals, indicating that dropping decimals and rounding off to the nearest $10 would improve readability.

She replied, saying "moving forward, they will be a little cleaner without such a paranoid focus on accuracy".

She explained further:

We were really concerned with accuracy as we knew that the Apple fan base would be tough on us if our calculations were even a little out.

So legendary Apple fanboys, take your decimals with you on your way out!

Signed-book Contest Results

Thank you for the terrific response to the book contest.  There were almost 900 entries, and the five lucky winners are: Dan Robertson, James Fiedler, Geoff Urland, Pawel G, and Robert Fowler.  Congratulations.

My heartfelt thanks to all of you for visiting the blog and joining the contest. Amazon recognized your enthusiasm by connecting a search for the title of my book to the "related search: Andrew Young".  Andrew Young was one of the choices in question 5 of the quiz.  (For those, perhaps overseas, who are not aware, Young is the former aide to former Presidential candidate John Edwards; he and his wife claimed to have private videotapes belonging to Edwards' mistress). This is a testament also to Amazon's data processing prowess.


Early reviews on Amazon and Twitter of the book have been fantastic, and I'm confident getting my book won't be a waste of your money or time.

Signed copies are available from McNally Jackson bookstore in New York City. They also ship books within the U.S. Their number is (212) 274-1160.


Using the tools on Survey Monkey and JMP software, I delved into the data a little. Below are the proportion of entries picking each answer (in each case, the "crowd" picked the right answer):


Survey Monkey's results charts are nice and clean. Sensibly, they fixed the width of the bar charts regardless of the maximum data point. Thank you for no pie charts.

The following chart shows the distribution of eligible entries over time. (Eligible entries are those with 5 correct answers.) Of the 5 winners, 3 came from March 15, and 2 from March 9, which were the two days receiving the most entries.



I used the point-and-click interface of JMP software to find the eligible entries. First, I created the histograms similar to those in Survey Monkey and shown above. JMP allows users to click on the bars to select (and deselect) data, so through a combination of clicks, I was able to isolate the rows (entries) that had all 5 correct answers. Then, using the Subset function, a new dataset is generated containing only the Eligible entries, from which I used the Random Sampling function to get the 5 winners. If you know what you're doing, it takes only 2 minutes.

Of course, the 2 minutes did not include the time spent exploring and cleaning the data. Some of the little things I found included:

There were 9 IP addresses that appeared twice. These people either came back to correct an earlier mistake, or share a computer with others, or just wanted to double their chances.

There were 7 entries with no email addresses, and I'm not sure why because there would be no way to notify them if they won.

The gulf remains wide

The gulf between infographics and statistical graphics, that is.

Snakeoil_supplements_956 Stan at Mashable praised "5 Amazing Infographics for the Health Conscious". They belong to the class of "pretty things" that are touted all over the Web but from a statistical graphics perspective, they are dull.

Reader Mike L. poked me about the snake oil chart (right) while I was writing up this post. The snake oil chart is by David McCandless whose Twitter chart I liked quite a bit.

This one, not very much.

If the location and cluster membership of the substances depicted have some meaning, I might even feel ok about the effervescence. But I don't think so.

I continue to love his pithy text labels though; the "worth it line", truly.

The data (if verified) is pretty useful though since there are so many health supplements out there, and as a consumer, it's impossible to know which ones are sham.  (Ben Goldacre's site may help.)


Now, let's run through the low lights of the rest:

I'm still trying to figure out what plus-minus means in the Dirty Water graphic.

The fact that the four buildings are not considered one complete unit also trips me up. The Truckee Meadows is depicted as 7 buildings, not divisible by 4. In addition, if 2 short buildings + 1 tall + 1 medium = 200,000 people, how many people live in 2 tall + 1 medium + 4 short buildings?


The obesity charts are pinatas.

The cost of health care chart is boring, just a prettied up data table. Why are life expectancy statistics expressed in 2 decimal places, and not in years and months?


Why 78.11 years and not 78 years (or 78 years, 1 month)?

The scatter chart relating survival rates of people with various ailments and the survival rates of virues/bacteria left outside our bodies is alright but do we care about this correlation?


I hate to be so negative but I can't believe these are examples of good infographics.

My appeal for readers to send in positive examples still stand!

Explaining the appeal of certain graphs

Andrew points to the following graph by Adam Bonica as Exhibit A for a chart attracting popular attention but breaking Tufte-ish rules galore:

 I have a slightly different view about this chart.  I don't think it's popular because people find it "pretty" or "eye-catching".  I think it's popular because it addresses a topic people have a fascination about -- as Adam describes it, it's the "ideological ranking of occupations".

Andrew has provided a few pointers on his post, which I won't repeat here.

Instead, I'll point out some questions that come to mind when I look at this chart -- without help from the rest of the article Erik Voeten (who used Adam's chart) wrote, or any domain knowledge of this area of study.

  • Non-U.S. readers may not understand blue vs. red. (could guess from liberal-conservative label or left-right orientation but why leave it to guessing?)
  • Curious about the horizontal scale. It looks like a standard-deviation type scale but the distributions of probability mass are far from normal (not that they need to be normal)
  • There are no occupations to the right of "Oil and Gas" and yet most of the red mass sit to its right. Who are those unnamed rightists?
  • Similarly, who are to the left of people in the movie business?
  • The data is screaming for a Gapminder style treatment! Would love to be able to click on "professors" or "investment banking" to look at the dispersion within each occupation.
  • Not sure how to read the overlapped areas... for example, should I read Auto-dealers as the right tail of the blue distribution or the near-mode of the red distribution?

How do you feel about this chart? Are you attracted to the visual or to the topic?


PS. Adam has put up a new version of the chart, reflecting some of Andrew's suggestions.


He also explains that the vertical axis is related to campaign contributions.  Note that the bars plotted are 40th and 60th percentiles so the dispersion for some of these occupations are gigantic.

Mystery index

Economists have their misery index; dentists, it seems, have a mystery index.

Laird Harrison, senior editor at DrBicuspid.com, an online newsletter for the dental community, pointed me to this chart when he interviewed me about how to interpret the findings in the latest Quarterly Survey of Economic Confidence, conducted by the American Dental Association. (Note: you have to register to read his article. Registration is free.)

When faced with an index, the first thing to do is to find out what the reference level (here, the zero level) means. Although the report is littered with dozens of similar graphs showing all kinds of indices, I cannot find any definition of the reference level, not even in the methodology appendix.  The closest is the following directive for reading the chart I printed above:

For example, [this figure] illustrates that the Net Income Index improved by approximately 10% between 3rd and 4th quarters in 2009, an increase that was driven by 6% fewer dentists responding that net income had declined, approximately 2% more dentists indicating that net income was about the same, and 5% more dentists reporting that net income had increased.

For this survey question, respondents could answer that their Net Income increased, stayed about the same or decreased, and correspondingly, these answers were scored +1, 0, -1. But we still do not know what zero means in the Net Income Index.

Fortunately, the raw data was also provided. I plotted the net score differential, essentially the difference in proportion between those who reported income increase and those who reported income decrease:


The shape of this line looks eerily familiar. But what is the zero level?

After some investigation, I found the answer. The reference level is the net score differential, averaged over the six quarters shown on the chart. In essence, the blue line from this chart, if shifted up by the average net score differential, becomes the green line from the first chart.

How would we interpret such an index? The current quarter's differential was about -40% which was 3% below the average net score differential between 2008Q3 and 2009Q4 (which was -37%).

This index is very problematic. The choice of the past six quarters seems completely arbitrary and ignores any seasonality effect. The use of an unweighted average to average the score differentials assumes that there are no quarterly variations in the data.

But the biggest problem surfaces if one focuses attention on, say, 2008Q3. The top chart says that the net score differential for 2008Q3 was 2% above the average differential from 2008Q3 to 2009Q4. But this is a forward-looking number because in 2008Q3, it was not yet known what the net score differentials would be in the next 5 quarters. Usually, indices are constructed using historical data to establish the reference level.

The mystery is why indexing is even needed. What's wrong with plotting the change in net score differentials?


Reference: "Quarterly Survey of Economic Confidence, Fourth Quarter 2009", American Dental Association, Jan 29 2010.

Leave good alone

In Cousin misfit, we looked at a problematic area chart in which the areas on the chart contain no useful information. The lines in a line chart should carry some meaning, and so too should areas in an area chart.


The Wall Street Journal recently printed something that looked like a cross between a column chart, an area chart, and a flow chart.  Whatever it is, the areas of the pieces do not match the data.

The data describes how the TV market is split between the top 5 brands (comprising over 50% of the total unit sales) and all other brands -- basically the six numbers printed on the chart.

The graphical construct can be broken up into three parts: a stacked column (on the left), a stacked column with gaps (on the right), and some connecting areas (which are parallelograms).

The last two parts are unnecessary, and in particular, the parallelograms distort the total areas.

It can be baffling to the reader why the left column is shorter than the right column when both show the identical data.

At first, I thought this is some kind of flow chart illustrating the change in market share over time but that's not the case.

What's wrong with the standard stacked column?

Reference: "Samsung Edges out TV Rivals", Wall Street Journal, Feb 17 2010.

Beer-stained T-shirt

Reader Jeff G. sent us to this post from Floating Sheep, which walks through an analysis showing which states have the highest beer consumption in the United States. Jeff is not amused by several of their maps.

Us_bars_bubblesThe first one utilizes overlapping bubbles, which is generally a bad idea but especially bad when the data is as dense as depicted here:

This is a great example to illustrate why the default use of maps for geographical data is sometimes misplaced. The greatest feature of this map (and many others) is the scarcity of data in the middle and the density around major city centers. This just tells us about the overall population density!

When we plot data on maps, we usually want to highlight something other than population density.



Us_bars_groceries_100122 The second map, called the "beer belly of America", has circulated a bit on the Web. This is a case of throwing out too much data. It appears that the original data set contains the number of times bars and grocery stores were searched by location of Google Map users (two numbers per location). The plotted data consists only of whether bars or grocery stores were searched more, thus one bit (binary datum) per location. 

Because of excessive data reduction, it appears that most of the country is a vast expanse of yellow. I'm sure if one goes back to the frequency of bar or grocery store searches, one will find that yellow comes in many shades.

While the maps are quite ugly, I like the way the website walks through their analysis process. I would say that their maps are less intended for final presentation as they are intended to aid exploration during the analysis. Indeed, at one point, they computed the number of bars per 10,000 residents (starting with North Dakota, 6.54), which is really the best way to summarize this information. It would be interesting to see this data plotted at the state level.

One technical note: the "beer belly" map contains a hidden assumption that the distribution of Google Map searches is the same as the distribution of population. If not, what we are looking at is the combined effect of the popularity of Google Map searches and beer consumption.