« May 2008 | Main | July 2008 »

A splitting headache

Fry_baseballsalaryTodd B didn't like this chart showing the correlation between baseball team salaries and their win-loss records.

A few problems are in plain sight:

  • Most importantly, putting a second set of logos next to the salaries column would really help
  • Unclear why the lines should be of varying widths
  • Winning percentage is more telling than win-loss, especially in the middle of a season when there is a  slight imbalance in total games played
  • the spread of salaries is so wide (10 times) that reducing the numerical scale to rank scale meant a big loss of information
  • Each column is sorted by its own metric while the most important sorting variable should be the slope of the lines (i.e. the cost per win)

The interactive feature of individual plots for each day (control bar at the top) of the baseball season is something of a gimmick.  Props though for realizing that the first few days of the season don't tell us anything.  There really is little use for investigating this correlation on a day-by-day basis.  Particularly when the salaries are given in aggregate.

On the diagram, the blue lines represent teams such as the Devil Rays and Arizona that had better winning records than their salaries would suggest.  Red lines display those teams spending more money than their records would suggest.  The steeper the line, the best/worst the team's cost efficiency.

With so many long steep lines in both colors (directions), one might posit that a negative correlation may exist between salary level and winning record. 

The following scatter plot suggests otherwise:

Redo_baseballsalary The correlation between salary and winning is very weak.  If one were to fit a linear model, it would show that the higher-salaried teams generally were doing slightly better (black line).  The Yankees were sufficiently outside the range in salaries that I didn't include them in estimating the line.  (However, as the chart shows, the line in fact estimated the Yankees winnning percentage really well.)

Teams above the line are performing better than their salaries would lead us to believe. 

Reference: Ben Fry's baseball salary page

Graphs as catalogs

Junk Charts typically concerns itself with statistical graphics a la Tufte and Cleveland, treating charts as a means to summarize, elucidate and highlight aspects of data.  We haven't been too kind on so-called infographics, often finding these cluttered and confusing.  Recently, I have a small change of heart.

I now see infographics as innovative in one way, and a complement to traditional graphics.  This is the idea of graphs as catalogs.  What many of these graphs try to do is to present a structured way for users to explore massive amounts of data.  They don't serve the traditional purpose of summarization and that's why they are innovations.

The following chart from NYT tracing Serena Williams' tennis ranking prompted this post.


As a traditional statistical graphic, this chart leaves much to be desired.  The general outline of her career could be described in one sentence without the need for any graphic.  The colorful vertical lines serve little purpose, nor the short line segments on the other side of the axis.

However, as a catalog of data on Serena's career, this graphic is fascinating.  Mousing on the vertical lines changes the information on the top right corner, including the tournament being played and the media event she participated in, as well as photos and her rankings.  Similarly, the left and right arrows on the top left allow readers to browse through the list of events chronologically. (You need to click on the link to use the interactive features.)  Without this chart, it would have been very difficult to learn about Serena's record at a particular tournament or point in time.  It acts like a data table but presents the information in a much more accessible way.

Thus, relying on interactivity, this compact graphic enables any of us to browse to a user-defined depth a reservoir of data.  Bravo!

Reference: "Serena William's Professional Career", New York Times, June 2008.

Close races

Nyt_citylimits1 Perhaps harkening to the close race between Obama and Clinton, the designer chose to illustrate this with what we have called the "racetrack" graph.  We have previously discussed the problems here and here.

Nyt_citylimits2 In this rendition, a pie chart was divided into three race tracks with "cities" getting the inside track and "rural/small cities" getting the outside track.  (As the Clinton supporters might say, elitism was in the air.)  There were two great choices: the courage to not print the data and let the chart speak for itself, and the wisdom to white out the votes for "others".

Nevertheless, as we discussed before, the data is coded into the angles rather than the lengths of the strips, which presents a real problem in comparing vote shares.  For example, try figuring out if there were more Obama supporters in rural Tennessee than there were Clinton supporters in cities in Tennessee (bottom right).

Nyt_citylimits3 Also note where the white "others" space were, and the impossibliity of comparing them.

The arrangement for Wisconsin, meanwhile, posed a challenge for anyone who wanted to estimate how many rural Wisconsin voters went for "others".

In the junkart version, we go with the two-sided bar chart, typically found in population pyramids.  The information presented jumps out at you.

Redo_citylimits3 This chart is essentially the same as the racetrack; one just needs to straighten out the strips from the original chart, and pull the Clinton ones clockwise, and Obama ones anti-clockwise.

Reference: some recent issue of New York Times magazine.

Whither complexity?

The ever interesting Gelman blog ("Too clever by half") ponders about this enterprising NYT chart.  Whatever its merits, this is one that requires close study. 


Reception is generally positive.  Andrew himself learnt an important fact, that there are still more white people than other races in America!  In statistics, we distinguish between two types of errors, the significant kind and the ignorable kind.  From this perspective, using admissions count is a gigantic problem; it renders the rest of the chart useless.  So I agree with Andrew.  As ever, picking the right scale is the beginning of making a nice chart.

We can also use this example to discuss the concept of "interactions".  When we go about presenting small multiples, i.e. comparisons of subgroups within a population, it's because we have observed differences between those subgroups; otherwise, it is both simpler and clearer to present the aggregate results.  The present chart presents subgroups defined by race, gender, age and substance abused, that is quite a lot of subgroups. 

Focusing on the first row (Alcohol), we note that the colored mass has shifted to the right, indicating more older people abused alcohol.  This trend appeared for all races.  Now scanning the other rows, we discover that only heroin abuse showed a distinctly different pattern,
but only among whites.  For every other row, it seemed that the change from 1996 to 2005 was similar across races.

By breaking out substance abused, the designer added 21 little charts (7 sets of 3).   Only one set  (heroin) added information to what was true in aggregate i.e. that substance abusers got older.  The incremental gain in information does not justify the added complexity.

Nevertheless, the chart had many positive things such as judicious use of axis and gridlines and letting the graphical constructs speak for themselves (without accompanying data labels).


Reference: "Why is Mum in Rehab?",  New York Times, Jun 14 2008.

A budding field

Avinash has an interesting piece about some examples of visualization of Web data.  That's a very rich area since there is so much data.  I agree with his observation that there are precious few truly great charts that have thus far appeared.  (Note, though, that typically the more data, the more noise.  See this post.)

He discussed a tag cloud display of the top cities from which website visitors hail.  We like tag clouds too. See here, here and here.

He praised a particular pie chart because "the pie ... is just a stage prop".  It worked because all the data was printed on the chart itself.  This violates our self-sufficiency principle: if all the data is printed on the chart, and the only way to read it is to look at the data, then the chart serves no purpose.  More here.

He liked the Amazon's feature of customer ratings distributions.  Me too.  A powerful example of small graphics that make a huge impact.  Here is the typical Web rating display:
Amazon1 Almost everyone uses the statistical average. This hides information about how dispersed (or not) customer's reactions were.  The current Amazon display gives us this information:
Notice that 108 customers actually gave this book the lowest rating even though the average was four stars.

The most intriguing example was Google's comparison of keyword performance to the site average.  It's a good idea but the execution is wanting.


Firstly, I believe the percentages are much better presented as index values, with 100 being the site average.  Secondly, it is unnerving to have red associated with positive values, green with negative values, or to have negative values on the right of positive values.  I think they realize green and to the right should represent "good" (bounce rate of visitors lower than average) but this just doesn't work.  Thirdly, are the data labels really necessary?  they impede our sight lines when comparing bars.  And do we need to know to two decimal places?

PS. Apologies for the inconsistent font.  Typepad continues its mischief: I couldn't change the font size after adding a hyperlink.  Apparently I have to fix the font size before adding a link.  You also might notice the changing font size as I write this paragraph.  Don't know why there was a switch; I didn't ask for it.

Rise and fall

Via Adam came this "colorful" chart of the rise and fall of house prices since 2000, as measured by the Case-Shiller index.  He commented that this showed the old saw "the taller they are, the harder they fall".


A different chart allows us to test this theory directly.  From the above, we noted that each curve was composed of two phases, a long rise from 2000 to roughly mid-2000s followed by a steep decline.  We computed two data series: the average monthly growth rate during the inflation phase and the average monthly decline during the deflation phase.  The scatter plot showed the correlation. 


The dots displayed pretty strong correlation, confirming that on average, the faster they rise, the steeper they fall.

The diagonal line indicated equal rates of growth and subsequent decline.  The cities above the line, especially Boston and New York, have witnessed declines that were much slower than the earlier rises.  On the other end, cities like Detroit, Cleveland, Atlanta and Dallas suffered price deflation much faster than earlier inflation.  Indeed, the ratio of decline to rise rates is given by the slope from the origin to the dot.

As for the original chart, it showed all the signs of Excel defaults.  It just does not make sense for a charting program to pick a different color for each time series, no matter how many there are.  Beyond four or five colors, it is impossible for readers to tell the lines apart.  In these situations, we should adopt a foreground / background strategy: decide on the key lines, highlight those with color, gray out the remaining lines.

Reference: Standard & Poor

The right scale

Oftentimes, picking the right scale for a chart makes all the difference.  The following chart showed up in the New York Times Magazine some time ago.  Readers will immediately recognize this as "infotainment" rather than a serious attempt to convey the data.


The data came from a study by the Center on Education Policy which counted the amount of instruction time spent on various subjects at a sample of elementary schools in the U.S.

A simple bar chart would make a nice graphic, as shown on the right.  Instead of sorting by decreasing minutes, we pulled out "lunch" and "recess" since they belong to a separate category.

Our main focus, though, is on the scale.  The original report - and thus the original graphic - used minutes per week.  We contend minutes per day (or even hours per day) to be more user-friendly.  This is because any number makes sense only in comparison to other numbers.  There is no easy reference to a number such as 500 minutes per week.  However, being told it's 100 minutes per day (or 1 hr 40 min per day) means a lot because everyone knows there are 24 hours in a day.

This is a small example of a larger problem with using averages.  The media loves to give out statistics like six people are dying of diabetes every minute (e.g. here).  This is typically done by dividing the total number of diabetes-related deaths in a year by the number of minutes in a year.   Why divide by total number of minutes in a year?  The fallacy of such a calculation is evident if one applies this logic to natural deaths (since we all have to die some day).  As the world population grows, there will just be more and more people dying every minute!

Choosing the appropriate reference point -- just like picking the right scale -- is the beginning of any good analysis.

Reference: New York Times magazine, April 27 2008; Center on Education Policy.


The following charts showed up in Internet Retailer, and ripe for some reconstruction.

There are few situations in which a grouped bar or column chart is the best choice.  In such charts, readers frequently have to examine the tips of the bars and yet the bodies of the bars obstruct comparisons.  Placing data labels instead of an axis is a nice touch; lining the labels up would be even better.   The junkart version below uses a dot plot which allows for comparisons within each payment type, and comparisons between payment types, to reveal themselves.


The second chart is also unnecessarily complex.  The use of double axes announces trouble, so too does the superposition of lines and columns.   The data to ink ratio of the chart is low because the data in the columns adds up to the numbers in the line.   Crucially, it is always important to clearly point out projected values (versus actual values).  Here is a junkart version.
The first revision focuses on dollar volume, showing that despite faster growth, alternative payments are merely catching up to traditional payment growth.   The higher growth rate is applied to a much smaller base!

The second revision focuses on growth rates.  Notice that all values here are projections.

Reference: "New way ahead for online payment methods", Internet Retailer, Nov 2007.