« February 2006 | Main | April 2006 »

Bar in a bar

I have been meaning to comment on the bar-in-a-bar chart for a while, and have finally found a good example.  This type of chart figures prominently in NYT but is generally inferior to a dot plot or an interval plot.

NytnjschoolsThe article dealt with the alarming finding that superintendents in New Jersey school districts may have under-reported their total compensation to the Department of Education.  Depicted in the chart are a set of 12 "paired" differences, each comprising a pair of numbers, the reported salary and the actual salary.

In the bar-in-a-bar plot, one data series is drawn as fat gray bars while the other data series uses thin black bars superimposed on the gray bars.  Aside from its ugliness, this chart also distorts our perception of the data as the area of a bar is no longer proportional to the salary number.

Worse still, this form takes our attention away from the key statistic, that being the gap between reported and actual salary.  The interval plot below remedies these problems; it also adopts a more reasonable ordering, by the size of the salary gap.


Presented in this way, the chart draws attention to the phenomenon that the higher the reported pay, the larger the pay gap.  The following scatter plot takes up this topic by plotting the salary gap as a percentage of the reported salary against the reported salary.


Reference: "Leading New Jersey's Schools Has Its Price: High", New York Times, March 14, 2006.

Tukey's Box Plots

20060318_auto_graphic_2Happy to see a fantastic graphic in NYT this past Saturday.  The chart is a variant of Tukey's box plot, which essentially summarizes the distribution of a data series, displaying especially its dispersion.

Typically, a box plot contains a five-number summary.  The version used here has three numbers: max, min and the most current cash incentive.

Having the ten box plots side by side is a powerful way to compare different groups of objects.  Even better is the care taken to sort the car type from largest current incentive to smallest.  The chart is really powerful as the reader can glean many insights at a glance, for instance:

  • Lincoln and Cadillac generally have the best incentives while Lexus and Acura offer much less
  • Mercedes, Saab and BMW have changed their incentive structure the most in terms of the range of incentives
  • February was a good month to buy Saab, Mercedes, Volvo, Infiniti or even Lincoln as the incentive levels for these brands are close to the 12-month maxima
  • It is a particularly great time to get a Mercedes because the current incentive is the highest in the past 12 months among a huge range
  • On the other hand, it may not be wise to buy Cadillac or Lexus

Some minor improvements can be made to the chart.  The lines linking the left edges of the boxes to the vertical axes are redundant.

More seriously, the "average incentive" row at the bottom tends to confuse rather than enlighten.  The minimum "average incentive" represents the average incentive across the 9 brands in some specific month.  Say that month is August.  Then the minimum = [X1(8) + ... X9(8)]/9 where (8) means August and X is the incentive.  The reader is asked to compare this number to the minima of each of the other boxes but this is apples to oranges.  For example, if Lexus offered the minimum incentive in January, then the left end of the Lexus box = X9(1) where (1) means January and X9 indicates Lexus incentive.  (Notice that X9(8) not X9(1) was used in the minimum "average" incentive calculation.)

Therefore, the only useful number in the last row is the current month's average incentive across all 9 brands.  This average can easily be eye-balled by looking over the first 9 rows.  The last row should be removed.

A further variant of this chart would be a dot plot.  So instead of using just the max and min, print all 12 data points, perhaps using smaller dots for everything other than the current month.  Such a treatment would, for instance, allow us to judge whether Mercedes had many months of low incentives or just one month of low incentives (causing the box to become so wide).

In summary, this graphic is much more informative and occupies much less space than most newspaper charts, and totally worthy of this newspaper.

Reference: New York Times, March 18 2006.

PS. Can't let this post appear without a rant... when will Excel include Tukey's box plot as one of the key chart types?

Readers speak up

I'm going to start printing reader submissions.  Here's one from Jen and Peter, of Library House, a consulting outfit in Cambridge UK.

" We think they tried to say something simple, but ended up saying very little.

What is the message of that chart?

Investment went up and down, number of deals has varied, what about the average disclosed deal size? What is it they try to get across?

We would have drawn this chart differently, probably like the picture attached [Ed: see below]. Our guess would be that the key message in this chart was about trends. Well the key conclusion is that the total number of VC investments in China has remained relatively stable over the last five years, but that a dramatic increase in deal size [Ed: my italics] has increased the total amount of money deployed quite significantly. Alternatively, one might conclude that a lot of money looking for a new home has increased the deal size... "

It appears that they have indexed each variable to 100 = 2001, and hence all three vertical scales are percentages.  This is a rare instance where superimposing the first two lines on the same chart would prove insightful (the third line is a derivative of the other two).  Indexing harmonizes  the scales therefore the superimposed chart works with only one vertical axis.

Subtraction by addition

AdpioneersmHere I excerpted the left half of an advertisement found in Institutional Investor.  It is one of those ubiquitous ads touting one mutual fund or another.  (Click on the ad to see a larger version of it.)

It illustrates the principle of subtracting by adding.  Jam a chart with more data, particularly repetitive data, and confuse the hell out of the reader.

My eyes aren't sure where to focus.  Are the bars important?  With the blue ovals perched atop the bars, most of them appear to be the same height.  Are the percentages in the blue ovals significant?  What are they measuring?  Annual returns?  How is it that the bar heights are not correlated to the percentages?

If you're still paying attention, you might ask if the quartile ranking is the key message?  What about the actual ranking shown at the bottom of the charts?

The unlucky few who scan all the fine print at the bottom will not get any of these questions answered either.

This is a kalediscopic chart.  There is only one data series (ranking of each fund in its fund group) but this data is shown four times in four different guises:

  • as an actual ranking listed at the bottom
  • as percentiles written in the blue ovals
  • as percentiles coded in the bar heights
  • as quartiles requiring referencing the vertical axis

Indeed, by adding multiple representations, the original data graphic loses clarity.

Nutrients and colors

McdonaldsinfoFrom Information Aesthetics comes this entry about how McDonalds in Europe visualized nutrition data.  A few people have already made insightful comments over there.  I find the use of colors a bit gratuitous.  The choice of mapping colors to nutrient groups rather than to coverage of the daily amounts is odd.  This is a good attempt, a definite improvement upon the usual data table, but certainly not the last word.

Information Aesthetics is a great blog that discusses data graphics that are much higher-end and complex than the ones seen here.  Well worth a read!

Large numbers

It has been reported that Google's CFO induced the (gasp) "law of large numbers" to explain why the company's growth will inevitably slow.  (See e.g. John Battelle's blog.) Such is the state of statistical education.

I think he meant to say "regression to the mean".

A typical use of the law of large numbers is to justify using random samples to make generalized statements about some larger population.  It says that (under some assumptions) the uncertainty of these statements decrease with increasing sample size.  I can't see the connection with declining revenues.

The danger of dual trending

Forbes_peStatisticians have long advised against forcing two data series with different scales onto the same chart, using dual vertical axes.  The use of color to match data to axis reduces the confusion but in no way prevent the reader from coming to mischievous conclusions, such as this from Forbes magazine:  "If history is any guide, the recent run-up in assets [i.e. capital] ... bodes ill for future returns."

The "run-up" in assets evidently refers to the slight uptick in 2002.  The huge gap between returns and capital from around 1993 to 2001 gives the impression that high returns are correlated with low capital commitment. But this gap is completely an artifact of the chosen scales, as I have discussed before.

For two data series, the scatter plot provides the most illuminating insights.  In this format, one can hardly claim any strong association between the two variables!


Indeed, we notice that there have only been two observations in the last 20 years or so in which the capital commitment has exceeded 0.23% of the stock market.  While those two observations were both paired with low returns, they themselves do not indicate any trend.  Referring back to the trend chart, we further note that both those observations occurred about 20 years ago.  Besides, we have several instances of the same low returns when the capital commitment was low.

Moreover, at the most likely levels of capital commitment (i.e. between 0 and 0.2%), any level of returns have historically occurred ranging from 10% to over 50%.

Thus, the association between capital and return is weak, if it exists at all.  The line chart with dual axes presents the false impression of a strong association while the scatter plot shows a different story.

Reference: "Private Equities", Forbes, March 13 2006.


Many times by adding an extra dimension to a chart, the designer unwittingly confuses his audience.  The rectangle plot is useful in very specialized situation but as used here, I still haven't figured out this chart.

Part of the problem is my own ignorance of this subject matter.  I cannot understand how 0-100 constitutes "poverty level" and if it represents percentiles, then the reader needs to know of what?

The biggest problem is that the widths of the rectangles are not labeled.  The labels on the y-axis are categories and most definitely not widths.

These rectangles draw our attention to the areas and yet it isn't clear what is being measured.  It appears to be the number of dead people in thousands who belong to some poverty-defined demographic.  In any case, the human eye is also not trained to compare the area of a fat, short rectangle with that of a tall, slim rectangle.

RedohealthThe junkart version is a simple scatter plot.  Since I don't fully understand this chart, I'm sure some readers will have better suggestions. 

Reference: "The People's Epidemiologists", Harvard Magazine, March-April 2006.