##### Jun 30, 2008

Todd B didn't like this chart showing the correlation between baseball team salaries and their win-loss records.

A few problems are in plain sight:

• Most importantly, putting a second set of logos next to the salaries column would really help
• Unclear why the lines should be of varying widths
• Winning percentage is more telling than win-loss, especially in the middle of a season when there is a  slight imbalance in total games played
• the spread of salaries is so wide (10 times) that reducing the numerical scale to rank scale meant a big loss of information
• Each column is sorted by its own metric while the most important sorting variable should be the slope of the lines (i.e. the cost per win)

The interactive feature of individual plots for each day (control bar at the top) of the baseball season is something of a gimmick.  Props though for realizing that the first few days of the season don't tell us anything.  There really is little use for investigating this correlation on a day-by-day basis.  Particularly when the salaries are given in aggregate.

On the diagram, the blue lines represent teams such as the Devil Rays and Arizona that had better winning records than their salaries would suggest.  Red lines display those teams spending more money than their records would suggest.  The steeper the line, the best/worst the team's cost efficiency.

With so many long steep lines in both colors (directions), one might posit that a negative correlation may exist between salary level and winning record.

The following scatter plot suggests otherwise:

The correlation between salary and winning is very weak.  If one were to fit a linear model, it would show that the higher-salaried teams generally were doing slightly better (black line).  The Yankees were sufficiently outside the range in salaries that I didn't include them in estimating the line.  (However, as the chart shows, the line in fact estimated the Yankees winnning percentage really well.)

Teams above the line are performing better than their salaries would lead us to believe.

Reference: Ben Fry's baseball salary page

You can follow this conversation by subscribing to the comment feed for this post.

I come to the defense of this chart. Ben Fry put it up to illustrate some functions of processing, like how to draw lines of different colors and thickness, write with different fonts, let the user select a date, etc. So the chart is not an example in itself, but rather a demo of what it is possible to do. That being said it is far from being bad.

You also have to imagine the graph as an interactive object rather than a still image. In its finished version you can slide the date from start to end of season. So the win-loss record which is incremented is really much more telling than a percentage which won't change. Likewise the lines change in slope with the season which carries visual impact, more than a scatterplot which as we all know is not understood at a glance by all users.

the one thing that I can't excuse is the deliberate choice of non-lining numerals for the budget and the precision to the dollar of those 9-digit numbers.

What's the correlation between winning and having an arched team logo versus an upward-swooping one? :-)

Your critique contains some relevant points. Percent win would probably help streamline the point quite a bit, and it is quite true that the graph doesn't accurately scale the difference between the highest and lowest salaries. But it's important to mention that graph was created for chapter five of Fry's Visualizing Data: Exploring and Explaining Data with the Processing Environment, published by the technical publisher O'Reilly. Just as most graphs that appear in Nature would be inappropriate for the general audience of the NY Times, the criteria that we judge this graph by should be relative to the context it appeared in. The intended audience of this graph is NOT anyone who is interested in baseball, but anyone who is interested in data visualization as a topic. Each graph in the book is not a finished product, but an illustration of how to explore data with Processing. In one chapter he presents fifteen different ways to visualize the same data set. In the chapter of the book this is presented in, Fry explains that in this version he thickened "the stroke weight based on the team's salary." I think that the stroke weight probably would have been more meaningful if it represented the magnitude of the slopes of the line - that is, if it indicated the degree of disparity between team rank and salaries, for it seems like the whole point of the graph is what you say - there is almost a negative correlation between salary and winning statistics.

The chart is really confusing (I have to agree with you)..I mean the whole purpose of making a chart is to make something clearly visible and identifiable..and this chart does nothing of the kind...I mean there are a ton of unexplained stuff (all of which the author of this post already listed... All in all this is one of the worst charts I have ever seen..

The comments to this entry are closed.