Nov 18, 2007

The absolutely meaningless pie chart

Simon J., from New Zealand, sent this in during the recent Rugby Cup but I didn't notice it till now.  As he stated, "they do a good job confirming our views of pie charts!"  Dropkicks is a site about rugby, and other sports popular in the south Pacific.

So here is our light entertainment for Thanksgiving week:
Dropkicks_pie_chart


This chart accompanied a very serious statistical analysis to address the monumental question of whether some countries were borrowing strength from foreign players.  If this is your cup of tea, follow this link.

P.S. Today I started the Junk Charts Core Collection, which include books I recommend on graphics, statistics, data mining and related topics (top right).  Some categories are sparse right now as I build out the collection.  If you have favorites, let me know and I will include them.  (I am using the Amazon interface to organize the list; if you buy books, you are buying from them.  I am not becoming a bookstore.)

11/19: Amazon seems to be having problems serving up the images.  I have turned off the image for now.  You can follow the text link above to see the book collection.

11/20: the image is up again

Aug 28, 2007

Cheers

Nyt_mets07


This is an exemplary chart from the NYT Sports page.  It provides a clear, informative and exciting way to visualize how the baseball season has gone for the Mets this and last year.  It's been mostly up and not much down. 

We can observe the more subtle differences: last season was a steady rise with only two prolonged down periods; this season's curve is driven by two up periods (including right now), outside of which the record has hovered around two levels (0, +3).

Especially commendable is the judicious use of axis labels.  However, I'm not clear on how some of the labels were chosen.  For example, 14 games ahead seem to me a rather arbitrary one.

All in all, a job well done.

Source: "Not Only Yankee Fans Cheering for Week 22", New York Times, Aug 27, 2007

Mar 21, 2007

March mildness

The Times published this great graphic to show 2007 was an upset-starved year in the recent history of the NCAA Basketball tournament, which is on-going.

Nyt_mildness Each box contains the number of upsets in a given year of a given pairing, e.g. in 1998, there was one case of a 9-seed beating an 8-seed.  An upset is defined as a lower seed beating a higher seed although the editorial comment argued that 9 beating 8 is "rarely considered an upset".

The rightmost column (which sums across a row) tells us that the number of upsets fluctuates wildly between the years, ranging from 3 to 13.  (That's why people bet on NCAA pools.)

A couple of improvements will make this chart even more effective:

  • Include a row showing the average number of upsets for each pairing;
  • Include a column of zeroes for 16-1 pairings.

This second point cannot be emphasized more.  The fact that no 1-seed has ever lost to a 16-seed should not be relegated to a footnote.  Think of it this way: if the results for 15-2 and 16-1 were reversed so that no 15-seed had ever beaten a 2-seed but one 1-seed had lost to a 16-seed, nobody would omit the 15-2 column! 

In his seminal work, The Visual Display of Quantitative Information, Tufte discussed the Challenger disaster at considerable length.  A key learning was that non-events (things not happening) contain important information, and should never be dropped from an analysis without unassailable logic.

The mildly improved chart would look like this. Redo_mildnessWhat then to make of the comment that "9 beating 8 is rarely an upset"?  For one thing, 9-8 upsets happen about as frequently as 10-7 upsets so if the comment refers to the surprise factor, then even 10-7 upsets should be excluded.

But the comment also underlines a deeper issue, which is hindsight.  Obviously, the seeding committee felt, and predicted, that the 8 seed would beat the 9 seed.  It was only after the fact that we found out 9 had beaten 8.  Instead of denying the 9-8 upset, would it make more sense to ask if there was a seeding error?

Reference: "March Mildness", New York Times, March 17, 2007, p.D2.

Oct 23, 2006

Tracking tigers

Nyt_tigers_1


This chart is fantastic work from Amanda Cox and Joe Ward at NYT.  It tracked the baseball Tigers' season, showing how they peaked in early August (with a 10 games lead) and limped into the playoffs, five days after losing the division title.  That slide, beginning in mid-September, set them back 4.5 months.  (It would help to label the 5 games behind the leader line.)

The shading to show which team(s) were chasing them is a stroke of genius.

Further, the dot plots on the right very cleanly brings out their advantage in pitching.  The hitting numbers are mixed.

The following chart is for the Cardinals:

Nyt_cardinals

 

Reference: "World Series Preview", New York Times, Oct 21 2006.

Sep 17, 2006

Much data, zero info

The number crunching college football fans at the Wall Street Journal wondered out loud:

One of the biggest developments in college football in recent years was the decision by Virginia Tech and Miami -- perennial top-20 teams -- to leave the Big East conference and join the Atlantic Coast Conference.  How much has that strengthened the ACC?

Wsj_accThe data table on the right was ostensibly the answer.  Readers were drawn to the bolded numbers, the almost identical winning percentages of ACC and SEC (averaged over the last decade, as the text explained).

The question is a classic one of cause and effect: did the addition of two strong teams cause the ACC to become stronger?  Startlingly, the data cited was useless, and the analysis conducted irrelevant.

First, the difference in winning percentages between ACC and SEC is the wrong metric.  Something more pertinent is, for example, the change in winning percentage of ACC before and after the team additions.

Second, the observation period is seriously mistaken.  The ACC expansion occurred in 2004 so average winning percentages from 1995-2005 have zilch to say about its effect.

Third, a Web search uncovers that major realignment occurred again in the ACC in 2005, making it very difficult to isolate the effect of adding Virginia Tech and Miami in 2004.

Thus, the data table contains zero information for addressing the stated problem.  How to measure the effect properly seems to me a tall order, and a good discussion topic.

Besides the iffy statistics, it is also impossible to read this table.  The data in the lower left triangle is a reflection of those in the upper right triangle, containing no new information.  Head-to-head conference comparisons seem to serve no purpose.  Actual win-loss numbers create clutter while adding no insight.  (Theoretically, the larger the number of contests between any two conferences, the more reliable are the winning percentages.  Confidence intervals is a much better way to present such information but even those would be over-kill for our purpose.)

Reference: "College Football's Power Struggle", Wall Street Journal, Sept 16-17, 2006.

Jul 31, 2006

Enigma of the big-buck pitcher

A data table accompanied a recent NYT article pointing out that big-buck pitchers were far from sure wins for those clubs who have taken Scott Boras' pitches.  The table contains a wealth of data but very little information is immediately revealed to the reader.

Nyt_bigcontracts


Sorting by size of contract makes no sense, especially since the key metric of success, i.e. change in winning percentage pre- and post-contract, cannot be discerned without pulling out a calculator.  Further, once the contract size is expressed by dollars per season, it is clear that all these contracts fall into the same range (about $10-13 million per year).

BigcontractsOne graphical alternative is shown on the right.  It brings out the desired message, that big-buck pitchers may or may not perform after signing big-buck contracts.  Several pitchers are annotated as these have improved or declined by more than 200 points.

A graph cannot hope to achieve the data density of a data table.  But the process of making a graph forces the designer to focus on the most important data, which itself has great benefits.

Reference: "Big-buck pitchers are often big busts", New York Times, July 16, 2006.

Jul 23, 2006

Visual brilliance

Finally I found some brilliant visualizations in this collection by a Parsons MFA grad (warning: very graphic-intensive site!).   Here is an example (created by FAS.research) which analyzed the ball movement in the World Cup final between Italy and France:

Worldcupfinal
It is immediately clear which were the key players in the match.  If you saw the match, check your memory of what happened against what was plotted here.

Notice that every player was placed in his side of the pitch.  The chart did not use any data on where passes occurred, only who passed the ball to whom.  Such distortion is unavoidable in multi-dimensional charts, as the designer must choose some dimensions to display while hiding others.

May 17, 2006

The nature of variation 2

In a previous post, we saw a statistical reason for why the observed distribution of birth-months of NHL players may be remarkably more variable than those of the population at large, purely due to the process of random sampling of 761 people from millions.  It is not at all surprising that certain months would account for say 10% of the births of NHL players (but would be surprising if this happens with the US population).

Next, is it unusual to have higher-than-8% values in the spring months and the lower values in the winter months?  Again, we want to know if the pattern we observed may just happen by chance.  The answer is contained in the following histogram.

BdaypvalhistHere, I did 1000 random selections of 761 people.  For each selection, I fitted a line through the monthly percentages.  If the slope of the line is significantly different from 0, then the line is not flat, which  provides evidence that a month-of-year effect exists.  By convention, a p-value of 0.05 or smaller (for the t-test of the month coefficient) indicates the slope is not flat.

The histogram collects all the p-values for the 1000 regression lines.  We note that a great proportion of the 1000 p-values is greater than 0.05 (actually, only 49 out of 1000 p-values <= 0.05).  Thus, we conclude that it is exceedingly unlikely to see a significant downward trend from spring to winter if indeed 761 people were randomly selected from the at-large population.

"Exceedingly unlikely" however does not mean impossible.  Below are the data and the regression lines for the first 25 simulations.  The one labelled p-value = 0.03 is one of the 49 non-flat scenarios (shown by red lines) and closely resembles the observed data!  In this case, statistics gives us that the probability of observing this is about 0.049 (= 49/1000) and we'd elect to believe that the assumption of random selection (no month-of-year effect) is incorrect, rather than accept that we saw an exceedingly rare event.

Bdaylmmatrix

To sum up, the fact that the NHL line fluctuates much more wildly than the population lines is not surprisingly and easily explained by sample size.  However, the fact that there is a temporal downward trend deserves attention as it is highly unlikely to occur if the 761 players were randomly selected.  (To get an even better picture, it may be worthwhile to figure out the likelihood of a downward trend conditional on having a trend.)

May 14, 2006

The nature of variation 1

BirthsbymonthI refer readers to Andrew's comments on a graph purporting to demonstrate the existence of a month-of-year selection bias in the NHL, cited on the Freakonomics blog as an example of "overwhelming" evidence of such effects in sports.  (The original graph may have come from here.)

In particular, note the Professor's point #4.  It is always necessary to ask oneself if perceived "trends" are real or not before attempting to provide an explanation.  What Andrew computed can be interpreted to mean that approximately 30% of the time, we expect to see percentages larger than 9% or smaller than 7%.  Thus, out of 12 months, we'd expect to see about 3.6 months with those "extreme" values (even if players were randomly picked from the population so that their birthdays would have been evenly spread out).  The NHL line contains 4 such values and so while there is some evidence of bias, it is certainly not "overwhelming" as Freakonomics suggested.

The chart itself is, sadly, misleading by its very choice of comparing NHL players to the populations of Canada and USA.  To cite the original website, the key message of this chart was:

The 761 NHL players show a distinctly different pattern than that for Canada or the United States with the highest percentage of births in January and February and the lowest in September and November.

This "pattern" is the larger observed dispersion of NHL monthly percentages from the mean percentage of 8%, as compared to Canada or USA.  In other words, the NHL line fluctuates more wildly. 

Too bad there is a statistical law that guarantees this "pattern": the law says that in looking at sample averages, the larger the sample size, the smaller the dispersion.  (This is why Andrew used the sample size 761/12 in his calculation.)  Because the Canada and USA lines represent averages of millions of people while the NHL line represents only 761 people, it is absolutely no surprise to find the NHL line fluctuating more wildly!

Thus, the comparison is not valid.  It'd have been more useful to have drawn the NHL line for various historical periods.  If all the lines show a downward slope, then it would be time to examine why this is occurring.

To further fix ideas, look at the following set of lines.  Each line represents an alternative universe in which 761 people were randomly selected to be NHL players from the US and Canadian populations.  While in theory the line connecting monthly percentages should be flat (at 1/12 or 8%, i.e. the green lines below), in reality, because of random selection, the lines fluctuate quite a bit.

Bdaylinematrix2

While the amount of dispersion is not "overwhelming", perhaps the observed trend of decreasing percentage with increasing month is unusual enough to warrant further study.  I'll take a closer look next time.

References: Andrew Gelman's blog, Freakonomics blog, Freakonomics NYT column

Feb 24, 2006

Noble attempt

One can scarcely find a media outlet more committed to printing interesting data graphics than the New York Times.  By printing the following scatter plot, it shows a level of sophistication as yet unmatched.
Nyt_skate

The graphic has a few problems:

  • confusing labels: it appears that athleticism is equivalent to "elements" so that "better elements" and "more athletic" are the same
  • redundant labels: there shouldn't be a need to print "more artistic" inside the chart and "better artistry" along the axis; it's even more confusing as they point in different directions
  • mysterious line: it's not clear how the line was created; is it a linear regression line?  I'd have thought that a 45 degree line would be appropriate if the scoring scales for artistry and elements are identical
  • bad choice of statistic: even though they have access to two years of data, instead of plotting average or median scores, they showed "highest" score.  Thus, a poor skater may show up on the top right corner on the strength of just one performance during those two years

Reference: "Cohen Cultivates Sublime Status: Quiet Contender", New York Times, Feb 21 2006.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31