Jul 01, 2009

A shocking failure to communicate

So said a reader, Stephen B., of the following graphic (note: pdf) in the London Times concerning Andy Murray's recent tennis triumphs.


Lt_murray

How can we disagree?  Shocking?  Yes.  Failure?  Definitely.  Failing to communicate?  No doubt.


Lt_murray_a Let's first start with the five tennis balls at the bottom.  It fails the self-sufficiency test.  It makes no difference whether the balls (bubbles) are the same size, or different sizes.  Readers will look at the data and ignore the bubbles.

Amazingly, the caption said that "Murray has one of the best returns of serve in the game."  And yet, the graphic showed the five players who were better than Murray, and nobody worse!  For those unfamiliar with tennis statistics, it does not provide any helpful statistics like averages, medians, etc. to help us understand the data.


But that is only the beginning.

Take a look at these two donuts.

Lt_murray_b
(The color scheme from light to dark: first, second, third, fourth round of tournament)

So we're told: the 75% of first-serve points won in the fourth round was 25.6% of the sum of the percentages of first-serve points won from first to fourth rounds (75%+70%+71%+76%).  What does this mean?  Why should we care?

The challenge with these two statistics is that they are correlated and have to be interpreted together.  If a first-serve is won, then there would be no second serve, etc.  Here's one attempt at it, using statistics from the Soderling-Federer match.  It's clear that Federer was better on both serves.

Redo_murray


Reference: "Murray's march to the last eight", London Times.
 





Jun 29, 2009

Round up

Here are some interesting reading from other places:


Blog_foodtag Tag clouds have caught on since we approved them a while ago.  One interesting use was at the Life Vicarious blog.  They use it to compare the inclinations of three New York-based restaurant reviewers.  What they should have done is to remove irrelevant words like  "one", "also", "many", "make"/"made", etc.  In statistics, this is called removing "noise" which helps bring out the "signal".








Nyt_babyimbalance Andrew Gelman discussed the NYT article that reported the finding of unexpected male bias in the children of Asian American families.  He can be counted on to make useful comments on any accompanying graphics.  He rightly pointed out that this is one example of not starting at zero: the relevant baseline is 100 since the metric is essentially the over-age of males relative to females.  I also agree that a line chart with a longer time series plotting percentages rather than over-age would work better.


















Fd_calorie The racetrack chart made an appearance at Flowing Data.  This one is even more busy and just as impossible to decipher.


Jun 12, 2009

Reading comprehension

Note: I am in the middle of a holiday and so posting will be limited.

Andrew posted a pretty chart that caught my attention.  This is the sort of sophisticated chart that rewards careful reading. 

Vouchermaps2000

Below is a guide to reading the chart:

  • It is a small multiples chart with the components arranged in two dimensions (income levels, and a race-religion hybrid category).  The top row is a summary of voters of all race-religion grouped by income.  Note that there is no corresponding summary column for voters of all incomes grouped by race-religion.
  • Source of data: 2000 poll but applied to 2008 demographic patterns.  In other words, there is an underlying assumption that opinions have stayed stable within the demographic groups.
  • The chart is in fact three dimensional because each map gives us the geographical (state by state) breakdown.
  • It is useful to figure out the smallest unit of data: in this case, this is the percentage support of federal school vouchers by voters of a given race-religion-income-and-state category.
  • The color scheme is such that red represents highest support and blue lowest support, with pink and purple in the middle
  • It's almost always better to start from the aggregate (that is, the average) and then study variations along different dimensions, and this is how the chart is arranged from top to bottom
  • On the top row, the higher income groups tended to favor vouchers more than lower income groups, with a break point around $75k; even here, the regional differences are significant, with northeast and southwest hotter for vouchers at all income levels
  • As we move from row to row, we realize that the aggregate data hides many disparities.  For example, white Catholics (second row) are more likely to support vouchers regardless of income level while white non-evangelical Protestants (fourth row) are much less likely than average to support vouchers at all income levels.
  • Notice that the statistician (Andrew) has carefully defined the race-religion categories to balance between collapsing subgroups that are distinct and showing too many subgroups so as to cloud the patterns.  That is why there are many more race-religion subgroups that are not shown.  The ones shown are of special interest.  Consider the white protestants, evangelical vs. non-evangelical (third and fourth rows).  If one were to fix the race, geography and income dimensions, and even fix half of the religion dimension, we still find the two subgroups to be on different ends of the spectrum relating to the voucher issue.  This is why the evangelical or not dimension has been included.
  • The white space is interesting.  Here, the issue faced by the statistician is sparse data when one gets down to multi-dimensional subgroups.  Andrew chose to ignore all the data, which is the wise thing to do.  With so few samples, it is particularly easy to draw bad conclusions.   
  • Because of the white space, we get additional information on the spatial distribution of the demographic subgroups.  The black population (at least the voters) are predominantly found in the southeast while Hispanics are in the southwest.  The subgroup of income higher than $150k is essentially all white.  Admittedly, this is a very crude read because we only have two levels (below 2% of state population and above).  Of the colored states, we cannot differentiate between densely populated and not.

  

Such rich graphics deserve careful reading.  Enjoy!
  

Jun 03, 2009

Spinning the climate

Mike L. pointed us to this pair of "climate change model pie charts", with the brief comment "Yuck".

Sd_climate
What they are doing is to use the spinning wheel analogy to present probabilities (odds).  Not a good use of pies either.  Histograms do the job with minimal fuss:

Redo_climate 


I collapsed the 2-2.5 and 2.5-3 degrees sectors since every other one is a one-degree interval.  We see immediately that the effect of the policy is to shift the probability distribution to changes of fewer degrees.


Reference: "Climate change odds much worse than thought", Science Daily, May 20 2009.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2009

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31