March mildness
Mar 21, 2007
The Times published this great graphic to show 2007 was an upset-starved year in the recent history of the NCAA Basketball tournament, which is on-going.
Each box contains the number of upsets in a given year of a given pairing, e.g. in 1998, there was one case of a 9-seed beating an 8-seed. An upset is defined as a lower seed beating a higher seed although the editorial comment argued that 9 beating 8 is "rarely considered an upset".
The rightmost column (which sums across a row) tells us that the number of upsets fluctuates wildly between the years, ranging from 3 to 13. (That's why people bet on NCAA pools.)
A couple of improvements will make this chart even more effective:
- Include a row showing the average number of upsets for each pairing;
- Include a column of zeroes for 16-1 pairings.
This second point cannot be emphasized more. The fact that no 1-seed has ever lost to a 16-seed should not be relegated to a footnote. Think of it this way: if the results for 15-2 and 16-1 were reversed so that no 15-seed had ever beaten a 2-seed but one 1-seed had lost to a 16-seed, nobody would omit the 15-2 column!
In his seminal work, The Visual Display of Quantitative Information, Tufte discussed the Challenger disaster at considerable length. A key learning was that non-events (things not happening) contain important information, and should never be dropped from an analysis without unassailable logic.
The mildly improved chart would look like this. What then to make of the comment that "9 beating 8 is rarely an upset"? For one thing, 9-8 upsets happen about as frequently as 10-7 upsets so if the comment refers to the surprise factor, then even 10-7 upsets should be excluded.
But the comment also underlines a deeper issue, which is hindsight. Obviously, the seeding committee felt, and predicted, that the 8 seed would beat the 9 seed. It was only after the fact that we found out 9 had beaten 8. Instead of denying the 9-8 upset, would it make more sense to ask if there was a seeding error?
Reference: "March Mildness", New York Times, March 17, 2007, p.D2.
I'm no Tufte or "data visualization" expert, but I would go as far as adding empty circles or something so it clearly shows that some years ALL of the matchups were won by the underdogs. This would be much along the same vein as "non-events (things not happening) contain important information".
Posted by: Brent | Mar 21, 2007 at 01:01 PM
1) The left justification in each cell throws me off. Would having the dot(s) centered give a better view? As the eye flows down each column, the column would widen and thin like a river...
2) In the "mildly improved" chart, the Average row is very helpful. But this should have been done using dots, for example, 1 dot plus an eighth slice of a dot. It is a bit jarring to go from visual quantity (the space taken up by dots) to numeric (1.8 takes up as much visual space as 0.0). If the number is needed, it could go flush right in each cell.
3) There are a lot of cells in the background. I see vibrating intersections, just like in one of Tufte's illustrations. In a more detailed improved version, this should be fixed.
PS: Thanks for this blog -- it is very fun and informative!
Posted by: Patrick Murphy | Mar 21, 2007 at 03:38 PM
I like your improvements. One piece of information that would be worthwhile for less informed readers is that the maximum possible dots in a square is 4.
Cheers,
Tony
Posted by: Tony Kenck | Mar 22, 2007 at 03:48 AM
I think Patrick's point 1) is unnecessary; left justification ought to be enough. I think his point 3) is the real source of his inability to read down the column. Instead of a grid of cell gaps both horizontal and vertical, maybe eliminate the horizontals? This should preserve the vertical gaps between years, eliminate the visual vibration, and smooth the path of the eye down the column so that center justification is no longer necessary.
Ultimately, what we have here is a simple small multiple of eight bar charts. The NYT presentation, although fun, has obscured that simplicity a little. I don't criticise the dots for not being bars, as I approve of integer bar and column charts emphasising their discrete nature by using separate objects with a 1:1 aspect ratio, or at least, if solid bars, making the bar equal in width to a length of 1.0.
(I haven't seen this idea in the literature, but it's my corollary to Cleveland's thing about line graphs having an average slope of 45°)
Posted by: derek | Mar 22, 2007 at 06:38 AM
The correct reference for Tufte's Challenger analysis is not his first book, The Visual Display of Quantitative Information, but his third book, Visual Explanations (1997).
Posted by: Rosie Redfield | Mar 23, 2007 at 09:21 PM
When deciding whether a dot is an upset based on the number of dots in an average year, would the median dots per column be better than the mean, as means are skewed upward by exceptional years, when we want exceptional years to stand out from the average?
On the other hand, the mean does answer precisely the question "what are the chances of a game upsetting the seed order?". For 9-8 the answer is 50%, so an upset is a total non-surprise. For 10-7 the answer is 45%, so that's an almost total non-surprise. For 11-6 it's about 33%, so now we're getting into surprise territory.
I usually deprecate percentages, as they obscure variable sums by normalising over sum, but here the sum (4.0) is not variable: should we therefore go for a percentage? It would also sneak in the meme that the total is always 4.0, as pointed out by Tony.
I agree that dots would be better than digits for the average (it's not like it will bust the column width, as totals might). That's a plus point for medians, as the fractional symbol, if needed, will always be a half, and not some more complicated object.
Here's my version incorporating some of the suggestions in this comment thread: uses median; uses dots; removes horizontal gaps; includes empty circles. I think the empty circles make the chart too busy, so I made
this version without them.
Posted by: derek | Mar 24, 2007 at 08:23 AM
I'd use the more conventional orientation and put time on the x-axis.
Posted by: Andrew Gelman | Mar 26, 2007 at 09:22 PM
Thanks for making the updated versions. I must rescind my suggestion. I agree it does look too busy.
Posted by: derek | Mar 27, 2007 at 07:47 PM
derek:
Thanks for making the updated versions. I must rescind my suggestion. I agree it does look too busy.
Posted by: brent | Mar 27, 2007 at 07:48 PM
My problem with Derek's charts is that there is insufficient separation between years, and I have to think too much to consider year as a variable. I think I'd remove the gray background, space the rows more vertically, use solid dark circles for upsets and solid faint circles for non-upsets, and perhaps alternate between dark blue and dark gray for adjacent columns.
Posted by: Jon Peltier | Mar 27, 2007 at 09:19 PM
I agree with Jon about the dark and faint circles as a way of avoiding the cluttering that those loud open circles caused, but my technical skills failed when I thought about how to do that in an Excel table. That's not an excuse, as good graph design shouldn't be dictated by the technology immediately at hand.
Posted by: derek | Mar 29, 2007 at 07:07 AM