« June 2009 | Main | August 2009 »

Sociology of numbers

I picked up a copy of AM New York (free newspaper given out in the subway stations) yesterday morning.  So we are told: tanning beds are killers.  See here for example.

How bad?

The article began:

Using a tanning bed regularly is as deadly as taking arsenic, a shocking study to be published today says.
The report found that the risk of skin cancer jumped by 75 percent when tanners started regularly using the beds before the age of 30.

A few paragraphs later, an "occasional tanning bed user" made a confession:

"I always knew it wasn't good for your skin.  Definitely 75 percent risk of cancer is not that encouraging and not worth the risk."


What started out as a relative increase in risk of 75 percent (those using tanning beds compared to those who don't) ended up as a 3 in 4 chance of getting cancer!

This reminds me of Joel Best's books in which he explored how data gets "adulterated" as it moves through society.  I find his perspective fascinating and his books well worth a read.  They are not your typical statistics book for sure.


Space and time

When it comes to space or time in graphics, old habits die hard.   When we have spatial data, the default is to put it on a map.  When we have a time series, the default is to plot time along the horizontal axis.  Sometimes, these defaults work; other times, breaking up the map or straight-time-line works better.

Thanks to a reader, I noticed that Google put up a "Flu Trends" website to help us track the flu season.  They use two main charts to plot the data, as shown below.


Google_flu_time1 On the right side is the
time series, showing the severity of flu cases from month to month.  There are many great things about this chart and one serious flaw.  I love the fact that they did not plot time on the horizontal axis; they realize the seasonality and they create overlapping lines.  They make good use of foreground and background; it's easy for us to compare year to year differences.

The serious flaw: no vertical scale.  This was a problem with Google Trends from day one (see my post here).  They still haven't fixed it.  Because of this, we don't know if the peak shown was 5 cases or 5000 cases.  While for Google key word searches, one can excuse them for trying to protect commercial secrets.  I would imagine that this public health data is, well, public.  Since the apparent purpose of this chart is to allow citizens to declare a flu epidemic (say, when they see the current trend depart from the historical norm), not having the scale is a huge problem.

Google_flu_time2 I also disagree with shifting the months around for the Northern Hemisphere so that the peaks of the graphs are aligned towards the middle.  It is better for the peaks to appear on the left and let the order of the months conform to our expectation.  (The "peak" would be split on the sides and the chart would look like a valley, which presumably is why they did it this way.)

Google_flu_aust The charts on the left side plot the spatial data, not surprisingly on maps.  Sadly, the standard exhibited on the time-series charts is nowhere found on these maps.

First, the legend is seriously deficient.

Second, the gradation of the colors is not fine enough, or put differently, the aggregation to the state/province level is much too coarse for any interesting pattern to be seen.

This poorly aggregated map becomes a farce when applied to the U.S.  There is not much left to be said, is there? 


Visual analogies

Many a startling faux pas in graphics start with the desire to utilize visual analogies: the recent shopping mall chart is a good example.  There is something inviting about making a chart about shopping malls look like a shopping mall, or a chart about breakfast foods look like a donut, and so on.


The New York Times (or maybe OECD) makes a chart about economic cycles look like, you guess it, cycles.  Don't really have too much to say about this one - except these two points:

  • Much of the validity of the chart depends on the theory behind it.  Do the two variables being tracked (amount produced relative to trend; change in production in last 6 months) capture everything?
  • The focus on generating cycles visually left out one of the most important dimension - that of time scale.   For this chart to work, we would like to know not just where we are in the cycle but how much time we'd spend at each point in the cycle.  The interactive charts (graphs 4, 5, 6, 7) show us that the cycles are traversed not at constant speed.  A proper rendering of this data needs to play up the time scale.

If you have ideas of how to improve this, feel free to comment.

Reference: "Turning a corner?", New York Times, July 2 2009.  

A shocking failure to communicate

So said a reader, Stephen B., of the following graphic (note: pdf) in the London Times concerning Andy Murray's recent tennis triumphs.


How can we disagree?  Shocking?  Yes.  Failure?  Definitely.  Failing to communicate?  No doubt.

Lt_murray_a Let's first start with the five tennis balls at the bottom.  It fails the self-sufficiency test.  It makes no difference whether the balls (bubbles) are the same size, or different sizes.  Readers will look at the data and ignore the bubbles.

Amazingly, the caption said that "Murray has one of the best returns of serve in the game."  And yet, the graphic showed the five players who were better than Murray, and nobody worse!  For those unfamiliar with tennis statistics, it does not provide any helpful statistics like averages, medians, etc. to help us understand the data.

But that is only the beginning.

Take a look at these two donuts.

(The color scheme from light to dark: first, second, third, fourth round of tournament)

So we're told: the 75% of first-serve points won in the fourth round was 25.6% of the sum of the percentages of first-serve points won from first to fourth rounds (75%+70%+71%+76%).  What does this mean?  Why should we care?

The challenge with these two statistics is that they are correlated and have to be interpreted together.  If a first-serve is won, then there would be no second serve, etc.  Here's one attempt at it, using statistics from the Soderling-Federer match.  It's clear that Federer was better on both serves.


Reference: "Murray's march to the last eight", London Times.