« November 2005 | Main | January 2006 »

Happy holidays!

Happy holidays to everyone!  Thanks for your many visits during the past year and I look forward to an exciting 2006.

I'll be traveling for the next two weeks.  There may be dispatches from foreign places but I can't promise.

Peace and happiness to all...

Big data

To me, statistics is about searching for beauty in simplicity.  Much of our discipline is concerned with data reduction, or finding creative representation that consumes less space than the raw data.  That's why I have mixed feelings about complicated, multi-dimensional, dynamic, user-controlled, gee-whiz displays of data.  I have no issue with these as works of art but they tend not to enhance our understanding of the data.

I take a marginally relevant example from Google's well-illustrated 2005 Zeitgeist report.

The annotator nailed the key insights from this data, especially the flatness of "surfing" versus the seasonality of "snowboarding".  Something is not right with the week-by-week fluctuations: these represent noise that interferes with our perception of the underlying seasonal trend.  An easy remedy is to "smooth" the data using a moving average, exponential smoothing, etc.  The smoothed data will not contain such jaggedness, making it easier to read.


NytshopA bit of USA Today crept onto the New York Times this weekend, as evidenced by this gift-wrapped chart shown on the right. 

Because the categories (more, same, less) have a logical ordering and the number of categories is small, I prefer to show this data is in its natural order.  Here the data was shown in order of decreasing frequency.

How this data supports the statement about action and intention implicit in the article's title shall also remain a mystery.

Reference: "Yes, We Shop, but We Don't Really Want To", New York Times, Dec 18, 2005.

Typepad problems

I just realized that my posts this week had disappeared.  Typepad was having problems.  I have now restored the text but the images are currently unavailable and will apparently be restored soon.  Thanks for your patience.

Update (12/17, 11 am): Feel free to browse the archives.  The only images not available are those from the current week.

Where bubbles lead

RedonytmagThis chart reminds us, yet again, of the issue with bubble charts.  I have deliberately blocked out some of the data.  If I blocked everything out, there is no reference point to estimate the size of any bubble.  Even with the unblocked data, it is not easy to estimate the blocked data.

Take a guess before you click to reveal the answer.

Reference: New York Times Magazine, Dec 11, 2005.

The racetrack graph

RacetrackA reader, Nick, sent in this fantastic example of chartjunk.  It is an absolute classic.  He points out - as have we here - that a simple table works better.

We shall call this genre "
racetrack graphs".  If Italy and Japan both traversed 20 metres in distance, it'd appear as if Italy's curve was longer than Japan's because Italy has the inside track!  This bias is well known to anyone who has run track.

The way this chart is constructed is to have length of curve proportional to angle, not distance (see below left); however, our eyes are naturally drawn to the circumference!  When we see Japan's green line, we perceive the length of the line, not its angle.  In particular, notice that the large separation between Canada and Japan represents a 19% difference, which should be double that between Italy and the vertical line (below right).


Nick has a few other examples at his site.  Well worth a look!

And thanks to Jef (see comments) for pointing out my subtraction error.  I will correct the graphic later. (12/14/05, 10 am)  Graphic corrected. (9 pm)

Does run defense win Super Bowls?

Rush defense is a strong barometer of championship potential. More than three-quarters of Super Bowl teams had a top-10 rush defense, according to the Elias Sports Bureau. Nine of the past 14 participants were in the top five.

A Times reporter thus fell victim to the seductive power of data in analyzing the New York Giants' chance of getting to the Super Bowl of American football.  In the past seven years, 64% (9/14) of Super Bowl finalists ranked among the top 5 in run defense; since the Giants currently rank 6th, their chance of getting to the Super Bowl must be about 64% (+/- statistical error).  Thus, run defense "is" (not "can be" or "was" or "has been") a strong barometer or predictor.

Sadly, Giants fans, I bear bad news.  The reporter has put the cart before the horse.  He asked the right question: Are the Giants good enough?  but used the wrong data.  Consider the following two questions:

  • Of those teams that made it to the Super Bowl, how many were ranked top 6 in run defense?
  • Of those teams that were ranked top 6 in run defense, how many made it to the Super Bowl?

Since the Giants are currently ranked 6th, I'm using top 6 which is more appropriate than top 5 or top 10.  These two questions have different answers as the following pair of histograms show.

What is our data and what is our prediction?  We want to predict the Giants' chance of becoming a Super Bowl team based on our knowledge of their run defense rank.  So putting the horse before the cart, we should use the second chart rather than the first.  In other words, given that their run defense ranking at #6, they have about 24% chance of getting into the Super Bowl! (about 1/3 lower than the reported estimate).

To appreciate the difference, one has to realize that in those seven years, 32 teams that were ranked top 6 in run defense did not reach the Super Bowl (as opposed to 10 teams which did).

The reporter's assertion, however, may still hold.  A team has a 6% chance (2/32) of getting into the Super Bowl, completely at random.  Even assuming that one team in each division is a bottom-dweller with no chance at all, the remaining teams still only have an 8% chance (2/24) of getting there.  Thus, knowing that run defense is in the top 6 has tripled our estimate of the chance of getting to the Super Bowl.

NytsuperbowlFinally, the data came from this table, which does little to help our comprehension.

Reference: "Giants' Defense Jells, Then Hardens", New York Times, Dec 11, 2005.

For formula crunchers, the above is Bayes rule in practice.  Prob(SB/1<=RD<=6) = Prob(1<=RD<=6/SB) * P(SB) / P(RD) but P(SB) = 2/n while P(RD) = 6/n where n=total # of NFL teams.  So the correct estimate is 1/3 of reported estimate.  I think the histograms above demonstrate the intuition a lot better.

Light entertainment

A reader sent in this precious piece of pop-up pop art from lowermybills.com, an omnipresent advertisers on the Web.  Talk about mixed metaphors: astronomy and real estate and mortgage rates and planet Earth and reflecting pool.


What caught our eye are the line segments joining the houses.  At Junk Charts, we are baffled by the association of mortgage rates with houses, and in particular, the juxtaposition of dipping lines and rising mortgage rates.  While we have often praised line charts, they too can be abused.

(Thanks to Scott.  Visit his website.)

PS. Please visit the Junk Charts holiday wish list.  If you have enjoyed the postings here, and would like to express your support, I'd be much gratified.

Statistical propaganda

Andrew Gelman has a great post on a piece of government propaganda.  Be sure to take a look.  His commentary makes much sense.

I'd add that the jobs number refers to "job creation" not outstanding jobs, which explains why it is in the thousands.  Propaganda manifests itself in both what it includes and what it omits.  While this chart includes a comparison of monthly unemployment rates against an average rate going back to 1960, it does not compare the job creation number to the level required to keep up with population growth.

Financial myopia

The following table/chart in the Times contains a wealth of information, much of it hard to get at.


From this, the author concludes that "superior performance of ... a deep-pocketed fund is more the rule than the exception", in other words, that large funds that focus on small cap stocks tend to have better performance than smaller funds.

The proximity of the fund size and stock return data immediately calls for scatter plots.  Below, we discover the author's myopia as the purported relationship between fund size and return only applies to one-year returns.  The further back we look, the less plausible is this conclusion.  In the right chart, I annualized the 3- and 5-year returns, which helps show this point.


The left chart contains connected dots, which is not usually done for scatter plots but I find that the lines help me judge whether a positively sloped relationship exists or not.

In addition, whenever we talk of asset returns, we must also consider risk.  Here, I use a proxy for risk by computing the range of annualized returns.  In the right chart from above, we can roughly see that the separation between the dots is smaller for small funds and larger for large funds.  The left chart below shows this relationship more clearly.  It appears that the large funds may have gotten bigger returns by taking higher risk.

Finally, a typical risk-return chart (above right) seems to say that higher risk has not uniformly been rewarded with higher annualized returns over the past five years.

Reference: "Big Doesn't ALways Mean Bad for Some Mutual Funds", New York Times, Dec 4 2005.