« February 2007 | Main | April 2007 »

Illusory disparity

The WSJ published a chart with the cheeky title of "Rich Get Richer" (reminiscent of the Economist).  The underlying data concerned one-, three- and ten-year returns for the buyout fund category.  For each return class, the overall mean and the means for the top and bottom 25% funds were depicted.

I won't go into the relevance of the title as I simply could not figure out how it connected with the data.  The following shows the original chart side by side with the junkart version.


Improvements include:

  • Lines show the comparisons with a minimum of fuss compared with colored bars
  • The overall mean return is placed in the middle of each line segment where it belongs, instead of being the first column
  • The axis label, "annualized return", tells readers what is the performance measure
  • Adding the word "funds" to "top quartile" and "bottom quartile" removes the possible confusion that those represent individual returns of the funds ranked at 25th and 75th percentiles, rather than the average returns of the bottom 25% and top 25% of funds
  • The linear construct paints the correct picture that individual fund returns fall into a continuum

(Thanks to my students for some of these points.)

Reference: Wall Street Journal, Mar 3-4 2007.

Dot com bubbles

Web_dotcombubbles Thanks to Dustin J for the pointer as well as the title of this post.  Dotcom bubbles is the most appropriate name for this overblown chart (featured as the "chart of the day" here).

The chart has no title or axis labels so only the diligent reader will figure out that the data consist of acquisition value of several high-profile Internet companies in the past three years.

There are less data than it seems.  Both the heights and the areas of the bubbles indicate the same thing, the deal values.  If we are supposed to see a trend, we are not finding it.

Most of these deals are not directly comparable anyway.  Webex and Ironport are infrastructure type companies with real business models.  Skype is a phone service.  Ask Jeeves is not a leader in its own space. Myspace and YouTube are traffic sites.

Reference: "Chart of the Day: Web deals", Valleywag, Mar 15 2007.

March mildness

The Times published this great graphic to show 2007 was an upset-starved year in the recent history of the NCAA Basketball tournament, which is on-going.

Nyt_mildness Each box contains the number of upsets in a given year of a given pairing, e.g. in 1998, there was one case of a 9-seed beating an 8-seed.  An upset is defined as a lower seed beating a higher seed although the editorial comment argued that 9 beating 8 is "rarely considered an upset".

The rightmost column (which sums across a row) tells us that the number of upsets fluctuates wildly between the years, ranging from 3 to 13.  (That's why people bet on NCAA pools.)

A couple of improvements will make this chart even more effective:

  • Include a row showing the average number of upsets for each pairing;
  • Include a column of zeroes for 16-1 pairings.

This second point cannot be emphasized more.  The fact that no 1-seed has ever lost to a 16-seed should not be relegated to a footnote.  Think of it this way: if the results for 15-2 and 16-1 were reversed so that no 15-seed had ever beaten a 2-seed but one 1-seed had lost to a 16-seed, nobody would omit the 15-2 column! 

In his seminal work, The Visual Display of Quantitative Information, Tufte discussed the Challenger disaster at considerable length.  A key learning was that non-events (things not happening) contain important information, and should never be dropped from an analysis without unassailable logic.

The mildly improved chart would look like this. Redo_mildnessWhat then to make of the comment that "9 beating 8 is rarely an upset"?  For one thing, 9-8 upsets happen about as frequently as 10-7 upsets so if the comment refers to the surprise factor, then even 10-7 upsets should be excluded.

But the comment also underlines a deeper issue, which is hindsight.  Obviously, the seeding committee felt, and predicted, that the 8 seed would beat the 9 seed.  It was only after the fact that we found out 9 had beaten 8.  Instead of denying the 9-8 upset, would it make more sense to ask if there was a seeding error?

Reference: "March Mildness", New York Times, March 17, 2007, p.D2.

Picking up the right file

The Institutional Investor advises its readers:

Going public may just be the most important -- and nerve racking -- decision any company will make.  Managing and pricing an IPO is tricky, so picking the right underwriter is crucial.  Bankers often boast of their league table prowess to win mandates, but quantity does not necessarily mean quality.

By quantity, they meant the amount of underwriting fees (revenues) earned; and by quality, the average stock performance of the newly-public companies, as of Feb 16, 2007.

Ten banks were compared on the two Qs using this chart, which is best described as the "file folder chart".


Amusingly, its creator sized the height of each file according to the quality metric, which is the return % listed at the top right corner of each file.  The files were sorted by decreasing quality.  Since each file is a parallelogram, its area is proportional to quality.

However, the files overlap, preventing us from comparing the areas of the files.  Besides, the point made in the article about the importance of both Qs is lost since this chart stressed quality over quantity.  Quantity showed up as a low dot on the tallest file and a high dot on the shortest file.

Redo_iporanks The junkart version restores the balance.  The blue lines highlighted several banks that scored high on one metric but low on the other.  The construct is a profile chart, with only two variables.

Curious readers may wonder if there were only 10 banks in the IPO underwriting market.  Far from it.  The chart designer introduced a selection bias because banks were included based on Quantity, and then Quality was rated.  This meant there is possibly a boutique firm with small revenues but higher quality than any of the 10 in the plot.

Furthermore, much useful information is missing, including the dispersion of returns, the number of deals, etc.

Reference: "Grading the IPO Underwriters", Institutional Investor, March 2007.

Lines of death

I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization). 

Who_tobacco I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.

I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.

Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35".  This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).

Who_tobacco2 However, the description of the "lines of death" raised my eyebrows.  It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions. 

Did they mean 25% of the dead middle-aged people die from smoking?  Or 25% of all middle-aged folks die from smoking?  A gigantic difference!

Percentages are very tricky things to use.  Every time I see a percentage, the first thing I ask is what is the base population.  Here, the baseline appeared to have gotten lost in translation.

This set of maps also shows the peril of focusing too much on  entertainment value, and losing the plot. 

For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust".  It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.

Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization

Criminal chart

The Times found a sharp surge in violent crimes.



The legend for the columns is missing.

The maximum murder rate of about 45 per 100,000 in the top chart is depicted by a column 9x as tall as that showing the minimum rate of about 60 per 100,000 of aggravated assaults in the bottom chart.

Sorting by murder rate does disservice to the bottom chart, rendering it essentially unreadable.

Reference: "Violent Crime in Cities Shows Sharp Surge", New York Times, March 9 2007.

Disparity and distortion

I am of two minds about "cartograms", which are world maps in which the area of each country is made proportional to some measurement such as population, wealth, consumption and so on.  I have liked them since young and they typically make spectacular effects but then it's distortion wilfully introduced.

Perhaps the saving grace is that there exists such extreme disparity in our world.  Because of these vast differences, the distortion does not distract us from perceiving the meaning of these maps.

Thanks to Eric C. for alerting me to this set of cartograms, including this one on military spending.  I'm surprised by the size of Europe as compared to the former Soviet Union.


Information gain and loss

The previous two posts indicated that CNN, TWC and Intellicast had the best on-line weather forecasting accuracy by looking at the median and mean error in predicting daily low and high temperatures over 41 days.  Is it possible to differentiate between those three?

For that, we need more data so I switched from summary statistics back to the data.  In this new chart, the day by day errors were plotted.  The gridlines labelled errors within 5 degrees, which is an arbitrary guideline for acceptable / unacceptable.  The three scatters looked remarkably similar although CNN appeared to hit the bull's eye (the middle square) with less bias (errors more evenly distributed) but not much better accuracy overall (similar number of unacceptable errors).