The nature of variation 1

May 14, 2006

I refer readers to Andrew's comments on a graph purporting to demonstrate the existence of a month-of-year selection bias in the NHL, cited on the Freakonomics blog as an example of "overwhelming" evidence of such effects in sports.  (The original graph may have come from here.)

In particular, note the Professor's point #4.  It is always necessary to ask oneself if perceived "trends" are real or not before attempting to provide an explanation.  What Andrew computed can be interpreted to mean that approximately 30% of the time, we expect to see percentages larger than 9% or smaller than 7%.  Thus, out of 12 months, we'd expect to see about 3.6 months with those "extreme" values (even if players were randomly picked from the population so that their birthdays would have been evenly spread out).  The NHL line contains 4 such values and so while there is some evidence of bias, it is certainly not "overwhelming" as Freakonomics suggested.

The chart itself is, sadly, misleading by its very choice of comparing NHL players to the populations of Canada and USA.  To cite the original website, the key message of this chart was:

The 761 NHL players show a distinctly different pattern than that for Canada or the United States with the highest percentage of births in January and February and the lowest in September and November.

This "pattern" is the larger observed dispersion of NHL monthly percentages from the mean percentage of 8%, as compared to Canada or USA.  In other words, the NHL line fluctuates more wildly.

Too bad there is a statistical law that guarantees this "pattern": the law says that in looking at sample averages, the larger the sample size, the smaller the dispersion.  (This is why Andrew used the sample size 761/12 in his calculation.)  Because the Canada and USA lines represent averages of millions of people while the NHL line represents only 761 people, it is absolutely no surprise to find the NHL line fluctuating more wildly!

Thus, the comparison is not valid.  It'd have been more useful to have drawn the NHL line for various historical periods.  If all the lines show a downward slope, then it would be time to examine why this is occurring.

To further fix ideas, look at the following set of lines.  Each line represents an alternative universe in which 761 people were randomly selected to be NHL players from the US and Canadian populations.  While in theory the line connecting monthly percentages should be flat (at 1/12 or 8%, i.e. the green lines below), in reality, because of random selection, the lines fluctuate quite a bit.

While the amount of dispersion is not "overwhelming", perhaps the observed trend of decreasing percentage with increasing month is unusual enough to warrant further study.  I'll take a closer look next time.

References: Andrew Gelman's blog, Freakonomics blog, Freakonomics NYT column