Bar in a bar

Tukey's Box Plots

20060318_auto_graphic_2Happy to see a fantastic graphic in NYT this past Saturday.  The chart is a variant of Tukey's box plot, which essentially summarizes the distribution of a data series, displaying especially its dispersion.

Typically, a box plot contains a five-number summary.  The version used here has three numbers: max, min and the most current cash incentive.

Having the ten box plots side by side is a powerful way to compare different groups of objects.  Even better is the care taken to sort the car type from largest current incentive to smallest.  The chart is really powerful as the reader can glean many insights at a glance, for instance:

  • Lincoln and Cadillac generally have the best incentives while Lexus and Acura offer much less
  • Mercedes, Saab and BMW have changed their incentive structure the most in terms of the range of incentives
  • February was a good month to buy Saab, Mercedes, Volvo, Infiniti or even Lincoln as the incentive levels for these brands are close to the 12-month maxima
  • It is a particularly great time to get a Mercedes because the current incentive is the highest in the past 12 months among a huge range
  • On the other hand, it may not be wise to buy Cadillac or Lexus

Some minor improvements can be made to the chart.  The lines linking the left edges of the boxes to the vertical axes are redundant.

More seriously, the "average incentive" row at the bottom tends to confuse rather than enlighten.  The minimum "average incentive" represents the average incentive across the 9 brands in some specific month.  Say that month is August.  Then the minimum = [X1(8) + ... X9(8)]/9 where (8) means August and X is the incentive.  The reader is asked to compare this number to the minima of each of the other boxes but this is apples to oranges.  For example, if Lexus offered the minimum incentive in January, then the left end of the Lexus box = X9(1) where (1) means January and X9 indicates Lexus incentive.  (Notice that X9(8) not X9(1) was used in the minimum "average" incentive calculation.)

Therefore, the only useful number in the last row is the current month's average incentive across all 9 brands.  This average can easily be eye-balled by looking over the first 9 rows.  The last row should be removed.

A further variant of this chart would be a dot plot.  So instead of using just the max and min, print all 12 data points, perhaps using smaller dots for everything other than the current month.  Such a treatment would, for instance, allow us to judge whether Mercedes had many months of low incentives or just one month of low incentives (causing the box to become so wide).

In summary, this graphic is much more informative and occupies much less space than most newspaper charts, and totally worthy of this newspaper.

Reference: New York Times, March 18 2006.

PS. Can't let this post appear without a rant... when will Excel include Tukey's box plot as one of the key chart types?


Martin Theus

I agree, a very effective display. Tukey's boxplot has a more or less robust estimate for the disperson, which can't be said for the min/max ranges (apart from being averaged values ...)

Still an extraordinary chart!

Tommy McCall

I was delighted to see my chart complimented, for the most part, here. I like the dot-plot idea and agree the lines linking the left edge are redundant. I'm not sure I understand the problem with showing the range of average incentives. If the current average was a good number, then why not the other ones?


Tommy, the NYT should be putting your name to graphs like this, the same as they would for the writer of a story or the photographer who provided a picture.

Jon Peltier

Excel won't do it, so I did. The following page links to procedures that will produce decent box and whisker charts, and describes a utility to do it for you (the utility is still in development, but works nicely):


Hi Tommy, great to see you here. Derek is right: it'd be great if the newspaper acknowledged your work, and the time and care needed to produce good charts like this.

On the question of the averages, I could be misinterpreting what you're plotting. What's the data used to plot that "average incentive" row?

