« December 2005 | Main | February 2006 »

Nuke this bubble chart

NytnukesThis unfortunate chartjunk appeared in NYT Magazine this weekend.  Once again, bubbles prove to degrade, not enhance, our ability to interpret the data.

How to explain the overlapping circles?  The solid versus empty bubbles?  Those with numbers inside, and to the left or to the right?  Those bubbles showing a precise number and those that show a range?  Pakistan ranking below India?

The chart fails our self-sufficiency test: the chart does not lose any power if we remove all the bubbles because every piece of data has been printed on it.

A two-sided dot chart may be appropriate here, shown next.  The relative scale of Russia and U.S. warheads to those of other nuclear powers is starkly revealed.


Review: Curve Ball

A kind reader sent me a Christmas gift, which accompanied me on my vacation.  The book is Curve Ball by Jim Albert and Jay Bennett, and I'm completely fascinated by it.  It presents a statistical perspective on baseball data, a soothing antidote to the nonsense spouted by the typical sportscaster.  Even more impressively, the book is liberally sprinkled with charts, and these charts are generally of a very high standard.

Their first feat was to debunk the myth of the batting average BA (hits divided by at-bats).  AlbertbaThey accomplish this using this innovative chart. 
Each vertical bar is a range of estimate of the batter's BA after he has a given number of at-bats.  The bars get shorter as the number of at-bats increases because over the course of the season, we can be more and more certain of the batter's true hitting ability.

Notice that the bar is very tall in the first 100 at-bats, roughly ranging from 0.35 to 0.50.  This illustrates why statisticians love data quantity: without sufficient samples, any estimation is highly unreliable.

Also notice that the rate of shortening is very slow after say 250 at-bats and after 700 at-bats (roughly a full season), the bar is still about 0.06 tall, roughly between 0.385 and 0.459.  This shows why BA is not as definitive as usually thought.  Looking up 2005 batting statistics, one finds that Derek Lee, the top hitter, hit 0.335.  This means his true batting average is roughly between 0.305 and 0.365.  There were 20 other hitters who hit at least 0.305.

Further, because the 2005 league BA was 0.264, any player with BA between 0.234 and 0.294 may be a league-average hitter.  Looking up the statistics, one finds that this range includes hitters ranked 37 through 150 (which is the end of the list).

More to come...

Reference: Albert and Bennett, Curve Ball, pp. 67-8

Concordance, or tag clouds

I noticed that Amazon has adopted the tag cloud metaphor in its newest feature known as "concordance".  Clicking on concordance gives you a list of the top 100 most frequently occurring words in the book; mousing over each word provides the exact number of mentions in the book; clicking on the word brings up pages on which the word is mentioned.

They are using the simple and elegant presentation that I praised here, the same as Flickr.  Beautiful as it is, it took me a little while to come up with a use case for this feature.  But I did!

Imagine someone wanting to buy a book on probability for self-study.  It is a cardinal rule of book publishing that every text book must be labelled "introduction" or "elementary", regardless of content.  But Amazon's concordance is here to help.  Here are four books of increasing difficulty (Aczel's Chance, Ross' Probability Models, Resnick's Probability Path and Dudley's Real Analysis and Probability):
Looking at the tag clouds, one can roughly judge the level of sophistication of these books.  Below I present them in mixed order.

[1] appears to be an elementary book that emphasizes the key concepts ("probability", "random", "distribution", "independent") while "customers" is the most interesting word indicating it is perhaps an applied book.  [2] is even more novice as we don't find words like "suppose", "system" and "function" that showed up in [1].  Words like "martingale", "sequence", "convergence" give away [3] as reaching another level of sophistication.  I should click on "oc" and "oo" to find out what these mean.  [4] is the only book on probability where "probability" is not in the top 10; it is evidently entirely theoretical with oodles of measure theory.  (So, [1] Ross [2] Aczel [3] Resnick [4] Dudley.)

How else have you used this concordance feature?  Let us know!

Dissecting two axes

This chart shows why statisticians don't like seeing two vertical scales.  The top chart on the right is roughly the same as the original.  The bottom right chart compresses the vertical scale so each unit of length represents twice as many dollars as before.  This small manipulation effectively halves the slope of the line and so the growth appears less pronounced in the bottom chart.  But there is no reason to pick one over the other: they plot the same data.


Observe also that the criss-crossings between the blue and red lines are artificial, as they disappear from the bottom chart which plots the same data.

Finally, a scatter plot is better able to show the inter-relationship between oil price and capital expenditure.

Reference: "Capital discipline for Big Oil", McKinsey Quarterly, December 2005

The redundant dimension eye-trick

My friend Patrick was particularly incensed by this chart, from the Economist publication "The World in 2006", which has been discussed here and here.  It employs a typical trick to make charts more "entertaining", that is, introducing an extra dimension, region of the world in this case.  As the right-side junkart version shows, collapsing this dimension results in a much clearer graph.  Disagree?  Try figuring out which columns to contrast in the left chart, and you might get dizzy as if reading an Escher "impossible trident" (more Escher goodies here).


Reference: "Wider but not deeper", The World in 2006.

Two easy pieces


The life-expectancy chart (top right) gives the false impression that men live half as long as women.  The problem is easily fixed using the start-at-zero rule, as shown in the bottom chart.

Demographers use a side-by-side bar chart (known as a population pyramid) to plot such data.  A variation is shown.  Redoage2This construct facilitates inter-country comparisons but is less than effective here because the ages are bunched together.

Of interest in this data is whether the female/male life expectancy gap is constant across countries.  The last graph in this series shows which countries are above or below the group average; in each country, women live on average at least ~6% longer.Redoage3_1

Reference: BBC News website (thanks to Tom for the tip-off)

Here is a great use of a line chart.  The clear message is that Toyota (and to a lesser extent, Japanese auto-makers) are relying less on price-cutting to attract customers than the Americans.  Particular praise should be given to the judicious and spare choice of axis markers.


Reference: "Toyota Shows the Big 3 ...", New York Times, Jan 13 2006.

Can good charts be entertaining?

A response to Jack's comments on the Economist charts.

Junk Charts pleads guilty to the charge that this blog's attitude is seriously serious, except on rare occasions.  That is because we believe data analysis to be a serious subject.  That said, we do wonder how entertainment value can co-exist with data integrity; and thus far, we have not found the happy medium.

Tufte's favorite chart of Napoleon's Russian campaign is one example of an entertaining and informative chart.  For anyone who knows or follows the Bumps Race, the Bumps chart is highly expressive.  We believe that entertainment can be a by-product og graph-making but deliberately seeking it is folly.

Fn2noguts_1Case in point: the palm-tree hedge-fund plot Jack thought to be funny. 

At the least, when adding entertainment, the designer must be careful not to distort the data contents but even minor chartjunk can insidiously ruin an otherwise competent chart, as happened here.

Getting rid of the chartjunk, we would revert to a standard time-series chart on a rectangular grid.   The palm-tree axis, being curved, is a curious little feature.  Its presence meant that the rectangular grid interpretation no longer applies!  When reading the data for 2000 for instance, one must trace a curved lines upwards, not the usual vertical line.

RedopalmThe right chart illustrates this.  If the designer switches to a curved grid, then the trend line must be transformed from the black line to the red line.  (This may remind some of Jacobian transformations in multi-variable calculus.)  The error in the Economist chart is akin to showing the black trend line on the red grid.

Also, when the designer focuses on beautifying the chart, she may become careless.  For instance, why on earth should the vertical axis start at negative $25 billion assets?  One would think that hedge funds with negative assets do not, and cannot, exist.  Perhaps it's truly "far from expected" in the Caymans!

I encourage other readers to comment if they have ideas as to how to integrate entertainment into data graphics.

Feast for the eyes?

Readers of this site will know that the otherwise venerable British business magazine, the Economist, could use some help with their data graphics.  I have  on two obsessions in particular, the awful donut chart and appending of an additional data series to a line chart.

Readers familiar with the USA Today newspaper will know about their one-a-day graphic on the lower left corner of the front page.  I have avoided commenting on them because they usually violate every rule in the book.  Here are two from the pile:


It is with some sadness that I must report that the Economist has joined the race to the bottom.  Its recent publication called "The World in 2006" contains a score of exasperating, over-adorned graphics of the USA Today variety.  Consider these specimen (thanks to Patrick for alerting me).





Stress in chart-reading

NytmiserysmThis chart, aptly titled, imposes significant stress on readers.  The key message appears to be that "misery" as defined by Merrill Lynch has increased sharply in the United States in the last decade, vaulting it to first place amongst G7 nations.

That conclusion is an unfortunate misreading of the data.  Keen readers will notice that the absolute U.S. index increased only by 3 points yet the U.S. rank dropped from 2nd best to worst.

To add to the confusion, the two sides of the bar chart utilized different scales.  For example, Germany's bar (18 on the left)  is about 1/3 the length of U.S.'s bar (18 on the right).  And the fact that Japan's left bar (4) appeared longer than its right bar (6) and about the same length as U.S.'s left bar (15) indicated sloppy editing.

In other words, this is a chart not worthy of a fine publication.

RedomiseryAs usual, the Bumps chart (here, here, here, etc.) is the best way to highlight all the important learning, namely:

  • Misery has declined in all countries except U.S. and Japan
  • Misery has increased in the U.S. but its drop to the bottom should be attributed mostly to sharp improvements in other G7 nations
  • The spread in the misery index has halved from 24 = 28-4 to 12 = 18-6

Further insights can be obtained if we have data for the intervening years.

A key characteristic of a good chart is low stress factor and the inapt published chart failed this test.

Reference: " Why Investors Don't Live by Current Bets Alone", New York Times, Dec 25, 2005.

P.S. I'll gradually resume regular posting this week as I fight off jet lag.