« August 2010 | Main | October 2010 »

Buy one get one free costs you one way or another

Voice-text-by-ageThis chart by Nielsen's mobile unit explaining the variability of cell phone usage by age group in the U.S. looks innocently standard. It's a grouped bar chart.

I have long argued that this type of chart should be used sparingly. Notice that the key point of the chart requires comparing voice (or text) usage across age groups; this means readers are asked to compare the relative lengths of the blue (or green) bars. With grouped bars, readers are asked to literally jump over bars to perform the comparisons. The more groups, the more bars to jump. I don't like to exercise my readers in this way.

Something else more damaging is lurking beneath. We can see this by splitting up the charts. Here, I'm reproducing the two charts as is, but on a small multiples format.

Redo_nielsen1 Recall the key practical question here is to compare usage across age groups. Now look at the massive amounts of empty space on the Voice usage chart: the space crowds out the data, meaning that the lengths of the voice bars are compressed, affecting our perception of differences.

What is causing this problem? It's the use of one axis for disparate data. Voice is measured in minutes and texts in units, and in trying to avoid double axes, the designer manages to make things worse by lumping them together.

In the Trifecta checkup, they outlined an interesting question, and collected the right data to address it but the execution of the graphic was wanting.

It's a case of buy one get one free, except that when you go home, you realize you would never buy that second item on its own.


PS. I have more to say about the statistical thinking behind this Nielsen press release at the sister blog.

Another contest

Here's another infographics contest.  It's organized by the White House (kids.gov). The theme is "How do I become President?" It's billed as an infographics contest although I'm not sure how much data can be leveraged to answer this question. Also, as a naturalized citizen, I can't help but think that certain American kids are less excited about entering this contest. (Paging Arnie in California ...)

Trying too hard

Google_acq Daniel L. submitted this infographics with some positive comments:

-There's quite a bit of information.  The objects on the chart have some depth beyond dots on a page. 
-I'm kind of willing to overlook the size of the infographic because I think there's pretty good use of the page.  Yeah....I know...a lot of blank space in there, but I think that's the price of doing the 3 column schtick - and I think it works in this case. 

One weird thing:
Top to bottom it goes from 2010 back to 2001.

I am less impressed with this chart. (The full version is here. It's big.) There are many problems:

It has all the signs of having tried too hard. There is indeed a trove of information. We are presented with each of Google's acquisitions, the time of each deal, the value of each deal, whether the deal happened in a busy or not busy period, the type of deal by synergy with Google, type of business the acquired company is in, the impact on the financials of Google. As if this is not enough, the chart includes the months without any acquisitions in small gray letters.

But the designer seems to have no idea what the plot is. I can't figure out what I am supposed to read from this poster. While I often dislike the graphical details on this genre of posters, I usually can enjoy the attempt to tell a story using the data but on this one, I just don't know what to make of it.

Also notice that the foremost dimension on this chart is the chronology. To me, it is the least important dimension. If one wants to understand what Google's acquisition strategy is, for example, the chronology is not important - it would have been much more informative to group the acquisitions by type of business, or impact on financials, or any number of other dimensions.

Two RSS feeds

I learnt at the JMP Conference that many of you who are reading Junk Charts through an RSS feed are not aware of my sister blog, which discusses written (rather than visual) communications of statistical topics. So to clear up the confusion, there are in fact two separate RSS feeds:

To subscribe to Numbers Rule Your World (sister blog), click here

To subscribe to Junk Charts, click here

The Twitter feed combines both in one place, if that's what you want.


Hope you'll enjoy both blogs!

Failed university education

016_017_THE_ranksThe Times Higher Education magazine fancied itself an arbiter of good universities and yet they appeared not to have heard of Tufte, or know why we should not use 3-D pie charts, ever.

Reader Cedric K. sent in this chart, with a note of dismay. Quick, which is most important: the pink, the blue or the green?


Something like the stacked bar chart shown below delivers the information more effectively. The section showing sub-categories can be omitted.


If, in fact, it is crucial for the readers to know each weight to the second decimal, then why not just print a data table? The beauty of just using a data table is that it can accommodate long text strings, which are needed in this case to explain clearly what the subcategories actually mean.

If one wants bells and whistles, one can add little bars to the right of the proportions to visualize the weights.



Pies fail to deliver

The Wall Street Journal reported that the Ritz-Carlton brand of hotels has been hit worse in the slump than other brands in the Marriott family, and has recently launched a loyalty program as a result after holding out for a long time.

The following serving of pie charts shows the occupancy rates in the past three years. In the second layer of charts, I removed the data from the chart in order to show why this chart is not self-sufficient. Without the data printed directly on the chart, it is difficult to read the individual occupancy rates; and it is even harder to figure out that the decline was worse on the Ritz-Carlton brand.


A line chart brings out the message clearly and directly.


Some links

Daniel L. points us to the visualization of the 2010 elections by the New York Times. These are pretty good, and do a good job highlighting the most important question: how many seats are up for grabs, and which party they are leaning at the moment. I wonder if these predictions are continuously being updated -- if so, some kind of time-series chart showing the state of competition in the toss-up states would be interesting to look at.


Quantcast_yearly  Chris P. takes the folks at Quantcast to task for an innocent looking typo. Lest you think we're nit-picking, the entire chart only contains 12 data points, so the error rate is almost 10%. And, by the way, they commit a much less forgivable error, which is to use different scales in a small multiples setting: the 17% growth on the Yearly chart is the same height as the 5.5% growth in the Quarterly chart. In any case, I'm not understanding why there are charts for  monthly, quarterly as well as yearly data.




The headline writers at Business Insider continue to play fast and loose. Yes, the map of bubbles is classic chartjunk but how does this chart lead to their conclusion that "Americans are more caring than 99% of the world"? Pray tell.

For one thing, our neighbor has a bubble that is slightly larger than ours.



The chart itself is a shocker. Instead of boring you with a term paper, I just want to tell you the most counter-intuitive insight I gleaned from this chart: since the more countries there is in a continent, the more caring are its people, we should break the U.S. up into 50 entities tomorrow.

Answering an open call

Dan Goldstein, who writes the Decision Science News blog, relays an internal debate occurring at Yahoo! about the relative merits of some simple charts. From what I can tell, they used three types of methods (known as "Search", "Baseline", and "Combined") on four sample data sets with different subjects ("flu", "movies", "music", "games") and compared the performance of the methods. I imagine the underlying practical question to be: does having search data improve the performance of some kind of predictive model that can be applied to the different data sets? There exists an existing baseline model that does not use search data.

(I noticed that Dan has since put up the final version of the chart they decided to use for publication. I will ignore that for the moment, and put up my response. Their final version is similar to my revised version.)


I'd like to use this data to reiterate a couple of principles that I have championed here over the years.

First, we must start a bar chart at zero. There were some back and forth on Dan's blog about whether this should be an iron-clad rule, and some comments about it being not a big deal. It is a big deal; just take a look:

The left chart has the full scale while the right chart chopped off the parts below 0.4. The result of the chopping off is that the relative lengths of the bars become distorted. For the music data, the search method appears (on the right) to be half as effective as the baseline, which is far from reality as shown on the left chart.

I used R to generate these charts, and was pleasantly surprised that the barplot function automatically assumes that bars start at zero. If you try to start the vertical axis above zero, the bars would literally walk off the chart, making it extremely ugly! (I had to pull some tricks to create the version shown above.)


Andrew Gelman suggested using a line chart. He also recently wrote that he has become a fan of line charts. Long-time readers know I am a fan of line charts, too... and I have tifosi who come here to complain about my over-use of line charts, especially when we have categorical data (as here!).

In particular, I have written about grouped bar charts before, and most of the time, they can be made into line charts and made clearer and better. (See here, or here.)

Some of the readers of Dan's blog complained that the dot plot makes it difficult to compare the performance of say the search method across different subjects (data sets). They think the bar charts do this better.

If comparing across subjects is the key activity for the reader of this chart, then a line chart is even better. Imagine you are reading the bar chart and comparing across subjects. Follow your eyes. You are essentially tracing lines across the top of the bars. The line chart makes this explicit. That, to me, is the key argument for using line charts in place of grouped bar charts.

I have made this argument before. Here is an illustration of the argument. The broken red lines are the same as the lines in the line chart.



On the line chart shown above, it's easy to see that the Combined method has the best average performance, and is never worse than the other two methods for any of the subjects. It also shows that the music subject differentiates the three methods most, primarily because the search data was not adding much to the effort. There is also no need to add colors, which can quickly make the bar charts unwieldy and disorienting.

In the final chart, shown on the right below, I flipped the two axes, changed the plot characters, used colors, shifted emphasis slightly to dots rather than lines, and started the chart at 0.5 (!)


Line charts are more flexible in that they can make sense even when the axis does not start at zero. In particular, when the point of the chart is to make comparisons, that is, to look at the gaps between dots or lines, rather than the absolute values, then it is fine to start the axis at some place other than zero.

Take again the example of the performance on music by the three methods (red line). The drop in performance between combined and baseline and that between baseline and music are indeed roughly equal. The vertical distances to the bottom of the chart are still distorted as in the bar chart but in a line chart, readers are less likely to get distracted by those distances because the bars are not there.