Using data tables
The nature of variation 1

Bell Curves: Not on charts please

The Bell Curve has become such a fixture in both research and everyday situations that it is often over-used and mis-used.  I will wait for another day to talk about that topic specifically; here, I want to suggest that bell curves rarely be shown in a chart, and never more than one bell curves on one chart.

WbindiaweightsmI thank the Truck and Barter blog for bringing my attention to the paired bell curves in the World Bank Malnultrition Report.  Professor Gelman's comments started me thinking about this.

There is serious distortion in this presentation.  If you recall your first stats class, it is the tail probabilities, not the height of the curves, that matter.  Unfortunately, curves like these tend to draw our attention to the heights.

You might also recall that the mean and the dispersion together completely define any normal distribution curve so really, the only salient features of each curve is the "center" of the curve (in this case, 0 vs -1.8) and the "width" of the curve (here, 6 units vs 8 units).  Sadly, while the labels are numerous, they do not point out these salient features.

I wouldn't be so insistent were it not for the fact that Tukey had long ago invented a far superior way to show distributions.  Here is a boxplot style representation of the same information:

Redoindiaweight

The chart is not quite to scale, and the vertical axis dimension is missing.  Plus the length of the box is not the usual interquartile range.  But the center and width of each curve are clearly shown, and their relative sizes easily read off the chart.

Comments

Sean Devine

Great post. I sat through a presentation from Booz Allen last week on the globalization of the service sector of the world economy that had a number of charts that I'd love to see you comment on. Should I send you the charts?

Zuil Serip

Could the scale be "# of standard deviations away from the mean of the international reference"? I agree that whatever it is, it should be indicated explicitly.

While I also agree that a boxplot (or one of its many variants) can convey the same information more economically and precisely, you really have to take the intended audience in account.

I would guess that a substantial number of intelligent and otherwise well-educated readers are not familiar with the conventions of a box-plot. And even if they were, people don't have well formed intuitions of the relative magnitude of standard deviations unless they work with such concepts on a day-to-day basis. I believe non-statisticians would be much better able to understand and compare the relative distributions much more effectively by looking at the actual curves (I do agree that bell-curves are cliches, but they can be very useful in allowing non-specialists to understand distributions)

John S.

A couple of weeks ago I was making a presentation and wanted to show the effect of a certain rule change on the price of a commodity. The before and after data were both normal, and I made a box plot, including the whiskers and the outliers. It was a beautiful plot, and clearly showed how this rule change had not only lowered prices, but reduced the variance as well. Nevertheless, I was told to take it out of the presentation because "our clients would not understand it".

Patrick O'Shei

While you make some good comments about the misuse and abuse of bell curves. You are wrong when you say the heights do not matter.

Bell curves represent probabilty through the area under the curve. To compare area you need height.
The markings on the x-axis are std. dev. units for the INTL REF.

If you want to compare the probability of an Indian child being underweight to the International reference, you would compare how much area is under the blue (INTL) line to the comparable area under the red (INDIA)line. For SEVERE Underweight, you can see the area under the blue line in this region (left of -3.0) is miniscule while the area under the red line is about 40x as large. An Indian child is about 40x more likely to be severly underweight than the INT reference.

I would not expect the average person to be able to properly interpret the curves and that is a problem.

Kaiser

Patrick, that is precisely my point, which is that it is impossible for even trained people to visually compare areas under two curves with different widths and heights.

In addition to my original comments, another reason why curves do not solve the problem is that these curves stretch to infinity on both tails.

The reasons why a boxplot suffices are that (1) we are assuming normal distributions thereby fixing the areas under the curve as a function of standard deviation from the mean; and (2) we are assuming a statistically literate audience.

This also explains why John had trouble using the boxplot with a lay audience. I'd try to annotate the chart by pointing out that the middle 50% of the data is contained in the box (assuming the sides are the 25th and 75th percentiles).

Why on earth would you assume a statistically literate audience? Presumably the authors of this report on malnutrition wanted to reach a broad audience, not the 0.2% of the population who can read a boxplot.

Kaiser

Anon - the glib response is: most college educated people whether their degree was in engineering, economics, or psychology would have at least taken one statistics course, which qualifies them as statistically literate.

A more serious response is: as I outlined, for the uninitiatied, a boxplot with some text explaining how to read it is sufficient.

I also do not believe overlapping probability curves can be properly interpreted by statistical "illerates" either.

The comments to this entry are closed.