« April 2006 | Main | June 2006 »

Microsoft and innovation

In the area of innovations, Microsoft may be remembered for debuting the "spider plot" with the forthcoming Excel 2007 release.  John S., a long-time reader, tipped me to this link, which contains an awe-inspiring display of engineering talent...
John said he "has never seen anything quite as ugly" and I have to concur.  I keep thinking of a gigantic spider with myriad legs.  Maybe I'm just being fantastical.  Three-dimensional plots are frequently untenable; this one is impossible.

Further, lurking behind the spider plot is a multi-colored table with numbers to 5 decimal places, each data graphic rivalling the other in terms of incomprehensibility.

Not-so-hot news boxes

NewscomtopstoriestxtOne of the lovely Web-enabled technologies is the real-time "Hot News" tickers, which show us stories in order of their popularity, an example of which is shown on the right.

Like tables of numbers, lists of text are not the easiest to read so we look for enhancements that can aid the scanning process.  New.com has taken an usual approach: clicking on "graphic" brings the reader to a "box chart".  We're told that the bigger the box, the "hotter" the story, and the redder the box, the more recent the story.Newscomtopstories 
This graphic has a few problems:

  • Our eyes naturally flow to the darker colors, which in this case represent the oldest - some might say stalest - stories.  It'd help to reverse that color scheme
  • The arrangement of the boxes forces our eyes to criss-cross the page width while the original list involves only one direction
  • It is unclear how the size of the box is related to the "heat" of the story.  The several examples I saw all have two stories on the left and multiple stories cramped in the right column.  Am I observing a natural law of news story popularity, or an arbitrary sizing decision?
  • "Heat" and recency are correlated concepts: the older the story, the more time it has to accumulate readers.  Interestingly, the News.com crew noticed this problem, and their remedy is to include all new stories in the past 72 hours as "hot".  That represents a source of injected correlation!

Redonewscomtopstoriestxt_1The junkart version is a less ambitious "enhanced" list: the list is ranked by how hot the story is, contains bolded keywords to help readers pick out themes, and uses a different bullet for stories published in the last 72 hours.


The choice of keywords is a delicate balance between being a reading aid and being a nuisance.  Here, I just picked out "proper nouns".  I can imagine many other possibilities, least of which computer-generated keywords.

I'm sure there are better ideas out there.  If you send me your charts, I'll post them here.

Userredo1hotnewsUpdate 1: Peter Forret sent in this alternative.  He prefers to emphasize the "heat" index rather than recency, using font size and color as the cue.  I have to say I'm not a fan of different shades of colors as the lighter shades often strain my eyes.

Boxplots to the rescue

Badvanggraph06Phil over at the Gelman blog wondered how to improve this bewildering (but pretty) data display.  Data-rich it certainly is.  The table collects together the returns of 12 categories of funds over a 15-year period.  The fund returns are specified, as are the ranking of each fund within the dozen for each year.

If the purpose is to confuse customers, i.e. to claim that fund class does not matter, then this chart succeeds.  However, on closer look, one might observe that three of the 12 classes showed up disproportionately at the top rank so the chart is somewhat misleading.  Directional evidence is buried in the palette of colors; but how to separate the noise from the signal?

As usual, there is no single "best" graphic.  The "best" graphic is one where form matches function.  If my goal is to help customers understand their expected return and risk for different fund classes over the last 15 years, then the side-by-side boxplot works wonders.

FundclassesCPI and T-bill stands out as relatively low median returns but exceedingly low dispersion of returns.  International was off the charts in terms of fluctuations and also had the third-lowest median return during this period. etc. etc.

If ranks were more of a concern than returns, the same chart can be reproduced using ranks on the vertical axis.

The Crossover Law of Petropolitics


Truck and Barter attributed this chart to Thomas Friedman (he who proved the flat earth), who apparently made the following comment:

And what you basically see is this relationship where as the price of oil goes down the pace of freedom goes up in countries like Nigeria, Iran and Russia, and as the price of oil goes up the pace of freedom goes down, and the lines actually cross in all of these graphs

Sadly, his most amazing finding is the least interesting: given two lines drawn on two different axes, measured in two different units (one in dollars, the other in index numbers), the crossovers are merely an artifact created by the chart designer.  Just changing the scale, or shifting the lines vertically, will cause this amazing feature to immediately vanish.

Also see related post here


Google trends

Google Trends is both fascinating science and a dangerous tool.  The following example is lifted from Andrew Sullivan's blog.

Times_v_blogs776285First off, this is a supreme example of turning volumes of data into useful information.  (If you data-mine, you'll understand the amount of work needed to generate something like this, in automated fashion.)  The chart provides a comparison of the volume of traffic for different search keywords over time.  The lines are sharp, and some well-chosen amount of smoothing is applied so that some spikes are seen but not too many.  The concept of flagging certain "special" points is also admirable.  No wonder this caught the attention of lots of marketers!

However, user beware!  For unexplained reasons, all of the information required to interpret this chart is missing.  The vertical scale is missing, which means that we do not know how many searches include the word "blog".  While the relative gap between the lines is large, the absolute difference may in fact be tiny.

Also, what sample size was used?  How were the samples selected?  This gets even more tricky because Google then categorizes the results by cities, regions and languages.  Do they have enough samples to make meaningful statements at that level of detail?  Similarly, on the time scale, what kind of smoothing was employed?

Times_vs_blogs_2The special flags, while a wonderful concept, fall flat in practice, highlighting the limitations of machine intelligence.  On the right, I copied the headlines for the flags.  You may also be bewildered at the choice: not a one has anything to do with comparing NYT and blogs.

Such half-baked tools are very dangerous indeed, as demonstrated by Andrew's comment.  Andrew is one of the pioneers of news blogging who eloped from mainstream media, thus his bias is well known.  Using this chart, he proclaimed: "They're [NYT] doomed."

Not so fast.  It is unfair to "spread their votes" by using "new york times", "nytimes" and "ny times" as three separate entries.   Times_vs_blogs_3Besides, NYT is only one publication; pitting it against a world of blogs is absurd.  Especially when the top 8 regions searching for "blog" are outside North America!  (see the light blue bars on the right)

Meanwhile, this bar chart is also impossible to interpret.  By "normalization", one assumes they are removing the effect of the total number of searches, or else the US will always end up at the top.  Normalization is forever a double-edged sword: if you are the marketer, even if you see Peru as having the highest % of searches using "blog", you can't conclude that Peru is the market you should go after, since you may be worried just how widespread Internet/Google penetration is in Peru.  By hiding the scale (again), Google Trends stubbornly remains just a toy.

The nature of variation 2

In a previous post, we saw a statistical reason for why the observed distribution of birth-months of NHL players may be remarkably more variable than those of the population at large, purely due to the process of random sampling of 761 people from millions.  It is not at all surprising that certain months would account for say 10% of the births of NHL players (but would be surprising if this happens with the US population).

Next, is it unusual to have higher-than-8% values in the spring months and the lower values in the winter months?  Again, we want to know if the pattern we observed may just happen by chance.  The answer is contained in the following histogram.

BdaypvalhistHere, I did 1000 random selections of 761 people.  For each selection, I fitted a line through the monthly percentages.  If the slope of the line is significantly different from 0, then the line is not flat, which  provides evidence that a month-of-year effect exists.  By convention, a p-value of 0.05 or smaller (for the t-test of the month coefficient) indicates the slope is not flat.

The histogram collects all the p-values for the 1000 regression lines.  We note that a great proportion of the 1000 p-values is greater than 0.05 (actually, only 49 out of 1000 p-values <= 0.05).  Thus, we conclude that it is exceedingly unlikely to see a significant downward trend from spring to winter if indeed 761 people were randomly selected from the at-large population.

"Exceedingly unlikely" however does not mean impossible.  Below are the data and the regression lines for the first 25 simulations.  The one labelled p-value = 0.03 is one of the 49 non-flat scenarios (shown by red lines) and closely resembles the observed data!  In this case, statistics gives us that the probability of observing this is about 0.049 (= 49/1000) and we'd elect to believe that the assumption of random selection (no month-of-year effect) is incorrect, rather than accept that we saw an exceedingly rare event.


To sum up, the fact that the NHL line fluctuates much more wildly than the population lines is not surprisingly and easily explained by sample size.  However, the fact that there is a temporal downward trend deserves attention as it is highly unlikely to occur if the 761 players were randomly selected.  (To get an even better picture, it may be worthwhile to figure out the likelihood of a downward trend conditional on having a trend.)

The charting process

Once in a while, I get reader emails asking me to discuss the charting process.  I still haven't set pen and paper to that but came across this article which is a good start.  Suffice it to say that the responsible "chartist" would have generated all of the figures in this article during the charting process and then picked one (or more) charts that most ably display the key message.

The nature of variation 1

BirthsbymonthI refer readers to Andrew's comments on a graph purporting to demonstrate the existence of a month-of-year selection bias in the NHL, cited on the Freakonomics blog as an example of "overwhelming" evidence of such effects in sports.  (The original graph may have come from here.)

In particular, note the Professor's point #4.  It is always necessary to ask oneself if perceived "trends" are real or not before attempting to provide an explanation.  What Andrew computed can be interpreted to mean that approximately 30% of the time, we expect to see percentages larger than 9% or smaller than 7%.  Thus, out of 12 months, we'd expect to see about 3.6 months with those "extreme" values (even if players were randomly picked from the population so that their birthdays would have been evenly spread out).  The NHL line contains 4 such values and so while there is some evidence of bias, it is certainly not "overwhelming" as Freakonomics suggested.

The chart itself is, sadly, misleading by its very choice of comparing NHL players to the populations of Canada and USA.  To cite the original website, the key message of this chart was:

The 761 NHL players show a distinctly different pattern than that for Canada or the United States with the highest percentage of births in January and February and the lowest in September and November.

This "pattern" is the larger observed dispersion of NHL monthly percentages from the mean percentage of 8%, as compared to Canada or USA.  In other words, the NHL line fluctuates more wildly. 

Too bad there is a statistical law that guarantees this "pattern": the law says that in looking at sample averages, the larger the sample size, the smaller the dispersion.  (This is why Andrew used the sample size 761/12 in his calculation.)  Because the Canada and USA lines represent averages of millions of people while the NHL line represents only 761 people, it is absolutely no surprise to find the NHL line fluctuating more wildly!

Thus, the comparison is not valid.  It'd have been more useful to have drawn the NHL line for various historical periods.  If all the lines show a downward slope, then it would be time to examine why this is occurring.

To further fix ideas, look at the following set of lines.  Each line represents an alternative universe in which 761 people were randomly selected to be NHL players from the US and Canadian populations.  While in theory the line connecting monthly percentages should be flat (at 1/12 or 8%, i.e. the green lines below), in reality, because of random selection, the lines fluctuate quite a bit.


While the amount of dispersion is not "overwhelming", perhaps the observed trend of decreasing percentage with increasing month is unusual enough to warrant further study.  I'll take a closer look next time.

References: Andrew Gelman's blog, Freakonomics blog, Freakonomics NYT column

Bell Curves: Not on charts please

The Bell Curve has become such a fixture in both research and everyday situations that it is often over-used and mis-used.  I will wait for another day to talk about that topic specifically; here, I want to suggest that bell curves rarely be shown in a chart, and never more than one bell curves on one chart.

WbindiaweightsmI thank the Truck and Barter blog for bringing my attention to the paired bell curves in the World Bank Malnultrition Report.  Professor Gelman's comments started me thinking about this.

There is serious distortion in this presentation.  If you recall your first stats class, it is the tail probabilities, not the height of the curves, that matter.  Unfortunately, curves like these tend to draw our attention to the heights.

You might also recall that the mean and the dispersion together completely define any normal distribution curve so really, the only salient features of each curve is the "center" of the curve (in this case, 0 vs -1.8) and the "width" of the curve (here, 6 units vs 8 units).  Sadly, while the labels are numerous, they do not point out these salient features.

I wouldn't be so insistent were it not for the fact that Tukey had long ago invented a far superior way to show distributions.  Here is a boxplot style representation of the same information:


The chart is not quite to scale, and the vertical axis dimension is missing.  Plus the length of the box is not the usual interquartile range.  But the center and width of each curve are clearly shown, and their relative sizes easily read off the chart.

Using data tables

Charts are supposed to elucidate data.  We love charts here but sometimes the love is misplaced.  I noticed the following Economist chart by way of the Truck and Barter blog.


It's a very simple chart, with only 6 pieces of data.  And yet, presenting the data in a table would have been clearer.  One measure of the effectiveness of charts is the amount of time the reader uses to locate the data.  On the table, everything the reader needs require two steps, looking up the right row and the right column.  However, on the bar chart, the reader must first look up the right chart, then the right bar, and then estimate the length of the bar by referencing the axis; if the reader wants the totals, s/he must estimate three lengths and mentally add them up.

Reference: "Into the Fold", Economist, May 4 2006.