## Wannabell

##### Jul 25, 2008

This graph, called the Bell Curve, is a wannabe.

The first hint is its asymmetry, the right tail being longer than the left tail.

Further, the helpful labelling of the "average" does not coincide with the peak of the curve.

The author of the annotation seemed to understand, calling the distribution "skewed".  A Bell Curve is not skewed.

This is a pity because the designer might have selected a different chart type if she wasn't so enamored by the bell curve object.

The data tells us about users of 30-day unlimited passes in the New York City subway system: how many trips do they typically make?  The card costs \$81 while each trip costs \$2 so anyone taking fewer than 40 trips in those 30 days would have been better off buying individual tickets.  The "average" user took 56 trips.  The range of trips taken was very wide, perhaps surprisingly so.

Several key pieces of information has been left off the chart.  What is the total number of riders?  Without this, there is no way for readers to understand 15,185.  What is the smallest (and largest) number of trips taken by any rider?  Visually, it appears that the horizontal axis does not start at zero.

It would have been better to show a cumulative distribution with percentages of riders on the vertical axis.  On such a chart, we can read off the median and any percentiles.  In other words, it would be much more informative.

As it stands, I like very much the annotation of the 56 trip and the 100 trip points: they are great aids to help decipher the chart.  It would be great to indicate the 40 trip point too.

For those more technically inclined: the graph also begs the question of whether it is an actual or modelled curve.  It looks too smooth to be actual data.  If it is a model, then it is definitely not a normal distribution.  What could it be?  A spline?

Reference: "In Decade of Unlimited Rides, MetroCard Has Transformed How the City Travels", New York Times, July 16 2008.

You can follow this conversation by subscribing to the comment feed for this post.

Variability in the numbers will be low due to the high counts so it wouldn't have needed much if any smoothing. Even if the tails the counts are probably about 400 so 95% CI are about 10%, hardly noticeable and the graphic artist probably just smoothed it out.

Unless there is another option many people may feel it is worthwhile to buy a ticket for less than 40 trips simply to be able to buy only one ticket. I've done a similar thing with day tickets in non-English speaking countries to avoid the problems due to not understanding currency. Others may have a ticket supplied by an employer and there are always sudden holidays and illness that prevent fully using a ticket.

It looks to me like an exponential drop off to the left side of the break-even 40-trip mark. Then the right hand side looks like it could be fit by a piece of a Gaussian.

To me, this says that people are pretty good at deciding whether or not to buy the 30-day card. Although, I think some credit should go to NYC in selling the right product (30-days = about 20 work days = 40 one-way trips to and from work) for a price that is easily computed in your head (\$81/\$2 = about 40).

In addition to the 40-trip mark noted on the graph, a better x-axis, and a right-hand vertical axis which has been normalized, I'd like to know the total number of riders below the 40-trip mark and the total above.

As for the technical note, I think the curve is real data. There are about a million data points in the graph, so I can believe that it ends up being fairly smooth.

looking at the long smooth stretches interspersed with tiny jagged sections, the distribution of these jagged sections, combined with the smoothness of the tails, it looks like something someone drew in photoshop...but i guess i could be wrong.

it makes it's point, though.

It should have been a cumulative curve of percentage against trips. The median, quartiles, and percentiles could be read off by anyone who cared to, or alternatively the percentage of users who took up to 40, up to 100 trips etc. By reading the percentage scale backwards from 100% the reader could say who took more than 40, more than 100 etc. Who's really interested in who took exactly 40 trips, no more and no less, or exactly 100?

The mean of 56 trips could be marked, as it would not be trivial to read it by inspection alone, and the total number of users represented by the "100%" would be given in nearby explanatory text. Multiplying the read-off percentages by the total number would give anyone inspecting the graph the absolute number of users taking more than 40 trips and so on.

Oops, I just wrote exactly what Kaiser already did in the article! So much for my reading comprehension.

It looks a bit like an extreme-value distribution (EVD), which is a distribution that looks like Normal, but with an asymmetrically "fat" tail on one side or the other. Given the what the numbers represent (the total/max number of trips taken on a given card), an EVD might make perfect sense.

Lognormal curve...logarithms of data are normally distributed...

Standard form of normal distribution when range is from zero to infinity, instead of negative to positive infinity.

We can all speculateon what the distribution of the data is relative to standard algorithmic distributions. But the only thing we will agree on without data is that it is not a normal distribution.

In any case, I agree with the folks who say that a cumulative distribution would be much more useful in this case.

Cheers,
T

Folks, cut the title some slack. It says "bell curve", not "normal distribution", and I think that term is perfectly appropriate for informal usage. (I mean, it does look like a bell!) The difference between a true normal distribution and this curve is mildly interesting, but does not affect the point of the story.

For the people advocating a cumulative distribution chart: I think you'd find that only about 1% of the readers of that article would understand that diagram. A cumulative distribution graph is highly unfamiliar to most people--and if you diaagree, I challenge you to find even one example of such a chart in a mainstream press article.

That would be a very odd sounding bell indeed. I think that it would have been more informative to a least include a line showing the median. On a more technical note, I would wager that this could actually be approximated well as a mixture two or three Poisson distributions.

I agree with the EVD comment. It looks like a classic Fisher-Tippet distribution.

The comments to this entry are closed.