The Guardian (via Graphic News) has put out some fantastic infographics posters, so we can't say they are all bad. This is a big collection created in anticipation of the London Olympics. Here's one illustrating the 10,000m race: (link)
It's nice that they give an overview of the race, plus the calendar. The evolution of men and women times is shown on the same scale. In order to stress the improvement over time, they omitted those years in which the times did not improve (I think, although there are some mysterious omissions of data labels).
They have charts for all the different events and also in water sports, gymnastics, etc.
PS. I do not know why the women's times were omitted from some of the charts (100m, 200m etc.) In those charts, the lines for men are better colored blue to align with the dots on the calendar.
Conventionally, the bracket in a sports tournament is presented like this (link):
In the Euro 2012 that's happening right now, the group stage is followed by the knockout stage (quarter-, semi- and final).
The knockout stage is pretty straightforward. The group stage presents some challenges because it's difficult to present the chronology together with the team standing at the same time.
The official site of Euro 2012 has an innovative "Tournament Map" that is an attempt to improve upon the traditional design. (link)
I have mixed feelings about this presentation. It's easier to get a sense of how each team performed chronologically over the course of the competition. But then, I can't figure out what day the winner of a quarterfinal would play in the semifinal.
ESPN Magazine issued a special analytics edition to ride the Moneyball bandwagon. In an article talking about the disappearing midrange jump shot from college basketball, they put out this chart:
In the caption of the chart, the key conclusion is: "As you can see, threes reign outside the lane." Well, we must be blind, since that conclusion is very difficult to draw from what we see. A number of reasons contibutes to this failure:
In a chart like this, the reader is cued to the length of the arcs. But the arcs related to three-pointers are all medium length -- they don't stand out, exactly the opposite of what the caption is saying
It's impossible to interpret the scale of the chart. Compare the blue line on the left (Missouri around-the-basket attempts) and the yellow line in the middle (Kentucky three-pointers attempts). They both say 357 but the lines are clearly of different lengths.
The analyst is attempting to make a general statement about "college hoops" while the data being presented are from six specific teams. This means that readers are spending time digesting the variability between schools rather than understanding the commonality across schools.
The problem of this type of "racetrack graph" has been discussed here before (see here or here). By using ellipses rather than circles, this chart makes things worse. Now, we can't even imagine where the center of the circle is to judge the angles.
The six schools are not all the same in terms of their shot selection. In particular, California is the exception to the rule. Also, Missouri and to some extent Syracuse are extreme examples where their players try about the same numbers of three pointers as around-the-basket shots. In our Trifecta checkup (explanation), this means the data used on the chart is out of sync with the key question being addressed. No amount of graphical wizardry can fix this problem.
The new chart uses much more sensible units, attempts per game. The original chart shows total attempts for matches up to the day the chart was prepared. To make matters worse, the designer did not disclose anywhere what day that was, or how many games were included. By looking at season-end statistics (34 total games), it appears to me that the data being plotted are the total attempts in the first 22 games (up to the end of January). No reader can interpret total attempts in the first 22 games. I just divided each number by 22, and for anyone who follows basketball, this unit is much more interpretable.
What determined the order of the six schools being plotted? Your guess is as good as mine. In our version, I sorted the schools by the ratio of three-pointers to midrange jump shots. So Missouri and Syracuse came out top because they focus so heavily on three-pointers at the expense of midrange shots. At the other extreme, California uses both types of shots in about equal proportions.
When we call something a "pretty picture", what do we mean?
Based on the evidence out there, it would seem like "pretty" means one or more of the following:
unusual: not your Grandma's bar chart or line chart
visually appealing: say, have irregular shapes, lots of colors, curved lines and so on
complex: if you don't get the point right away, the chart must be smart, and must contain a lot of information
data-rich: a variant of complex
I pondered that question while staring at this chart, reprinted in the NYT Magazine, in which they pitched a new book by Craig Robinson called "Fip Flop Fly Ball". According to the editors, the book is a "beautiful, number-crunched (sic) combination of statistical and graphic-design geekery". So here's Exhibit A:
This chart is supposed to tell us whether big payroll equals success in Major League Baseball, and success is measured variously by making the playoffs, making the championship series or winning the championship. It nicely uses a relatively long time horizon of 15 years.
The problem: how are we supposed to learn the answer to the question?
To learn it, we have to go through these steps:
Read the fine print under the title that tells us the vertical scale is the rank by payroll, so within each season, the top spender is at the top, and the bottom spender at the bottom. (Strictly speaking, there are 15 different scales, see discussion below.)
Figure out that the black row has all of the championship teams aligned at the same vertical level.
Realize that the more teams that are listed below the black line, the bigger the payroll of the championship team in that season.
Alternatively, the more teams that are found above the black line, the smaller the payroll is of the winning team that year.
From that, we see that for almost every season in the last 15 years, the winner comes from a relatively free-spending team. Florida in 2003 is a big outlier.
Maybe that isn't too bad. Now, try to interpret the blue boxes, which label all the playoff teams in every season. Is it that playoff teams also are bigger spenders than non-playoff teams?
To learn this, try the following step:
Ignore the relative height of the columns from season to season, and focus only on the relative positions of the blue slots within each column.
Are these blue slots more likely to be crowded towards the top of the column than the bottom?
The answer should be obvious but why does it feel so hard?
You may be confused by the vertical scale. Is it the case that in 2003, the entire league decided to splurge on spending? Does the protruding tower in 2003 indicate especially high payrolls?
No, it doesn't. It turns out there are really 15 separate vertical scales on this one chart; each column has to be viewed separately. There is a ranking within each column but the relative height from one column to the next means nothing. Each column is hinged to the black row which is the rank by payroll of the championship team in that season.
The decision to anchor the columns in this way is what dooms this chart. In the junkart version below, I reversed this decision and ended up with a much clearer picture:
It's now clear that almost all the playoff teams come from the top quartile or top third of the table in terms of payroll. In more recent years, the correlation between spending and success seems less assured - perhaps it's partly a result of the analytics revolution, as nicely portrayed in Moneyball. It is still true that any team in the bottom third of the payroll scale has little chance to making the playoffs; however, once the smaller-payroll team makes the playoffs, it appears that they do well, as in three of the last four seasons, a small-payroll team has made the finals.
Note that I grayed out the four cells at the bottom left. There were only 28 teams before 1997. I also removed the names of the teams that didn't make the playoffs, which serves no purpose in a chart like this.
That's the descriptive statistics. It's really hard to draw robust conclusions from such data. You can say it's harder for small-payroll teams to have consistently great performance in the regular season but easier in a short playoff series - so in a sense, we are looking at luck, not skill.
But could it be that those small-payroll teams, given that they made the playoffs, must have some usual success in that season, perhaps because they discovered some young talent that cost next to nothing, and so the fact that they made the playoffs despite the smaller payroll is a good predictor that they would do well in the playoff?
The other important issue to realize is that by plotting the rank of payroll, rather than true payroll, the scale of payroll differences has been taken out of the picture. The team listed at the median rank most likely spent much less than half of the team listed at the top of the table. If you grab the actual payroll amounts, there is much more you can do to display this data.
Too much art, not enough science. (See this post.)
I wish the designer lost some of the data. The graphic would stand a better chance of succeeding if the unimportant bits were not shown, or faded out. Giving every piece of information equal status, whether it's a shot on goal or a dribble, is another way to distort information. It downplays the important information while overstressing the filler materials.
Colors shouldn't be assigned at random. They should surface patterns. Make the Barca data visually distinct from Man U's data. Similarly, unify the numerous statistics on goal-keeping.
A more subtle misstep is mixing up whites and blanks. According to the legend, blank means ball out of play while white means offside (for either team) but readers can't tell these two apart. The whitespace looks like gaps.
For me, this is a lost opportunity. Visual exploration of data is a very powerful concept; it can guide further analysis and even guide the construction of mathematical models. But the visual has to help organize the information. Here it didn't.
The Trifecta checkup requires us to align all three aspects to make a great chart. It is sometimes the case that a wise choice has been made regarding the type of chart, but the other elements are missing. Reader Parker S. sent in an example of such a chart.
This chart created by ESPN illustrates the evolution of the "power ranking" of the San Diego Chargers football team within each 18-week-long season and across multiple years.
The bumps chart is invented for this type of ranking over time data. And in fact, we are looking at a bumps chart.
But with lots of distractions: the multiple colors (instead of year labels), the dots, the legends, the year selector, no foreground (current season).
*** Parker couldn't figure out the practical question this chart is supposed to answer (the top corner of the Trifecta).
It seems to me that the more interesting question is how different teams fare from week to week within a given season, rather than how one team fared from week to week over consecutive seasons.
In fact, one of the secrets of the Bumps chart -- the reason why it feels far less cluttered than it has the right to be -- is that no two data points will overlap, that is, for any given week, only one team occupies any particular rank. This simple rule is violated when the same team's rank across multiple seasons is plotted, and thus the chart feels very busy.
It proves impossible to find a source of ESPN power rankings that has all teams for a given season. However, I found something similar at CBS Sportsline, a competitor. Here is their version of the ranking chart:
They got the practical question right but severely under-utilized the form. We can see how the Chargers season is going but have no ability to compare them to other teams.
We can start with the question of visualizing how Chargers and their AFC West compatriots are doing relative to the rest of the league:
The AFC West is a mediocre division this season, with all four teams in the middle of the pack, none in the top quarter of the table. The Chargers started high, plunged and are recovering while the Oakland Raiders have improved over the course of the season.
The Bumps chart is more powerful when the full set of data is plotted, and when the lines are highlighted with reference to the question being answered. Are AFC teams or NFC teams doing better?
The next one highlights the teams that earned the largest change in ranking from week 1 to week 10. The background (gray lines) consists of those teams whose rankings in Week 10 were within 5 places of their initial rankings.
The practical question might be whether Week 1 rankings are a good predictor of Week 10 rankings. The following chart shows that most teams in the top quartile remain there (except San Diego which is coming back, and Dallas which could be coming back too), the bottom-quartile teams also tend to remain there, while not surprisingly, the middle teams don't tend to stay in the middle. The color scheme should be reversed if one wants to highlight the dispersion of the rankings of these middle teams by Week 10.
I look at a fair number of online videos, especially those embedded on blogs. But I haven't seen this feature implemented broadly. It is a wow feature.
Look at the dots above the progress bar: they tell you what topic is being discussed and allow you to jump back and forth between segments. (the particular dot I moused over said "Randy Moss") The video I saw came from this link.
This simple-looking feature is immensely useful to users. You can efficiently search through the audio file and find the segments you're interested in. It's like bookmarks students might put on pages of a textbook for easy reference, except these are audio bookmarks.
Why isn't this feature more prevalent? I think it's because of the amount of manual effort needed to set this up. Imagine how the data has to be processed. In the digital age, the audio file is a bunch of bits (ones and zeroes) so no computer or humans will be able to identify topics from data stored in that way. So, someone would need to listen to the audio file, and mark off the segments manually, and tag the segments. Then, the audio bookmarks can be plotted on the progress bar... basically a dot plot with time on the horizontal axis.
In theory, you can train a computer to listen to an audio file and approximate this task. The challenge is to attain the required accuracy so you don't need to hire an army of people to correct mistakes.
A very simple concept but immensely functional. Great job!
Reader Joran recalled our feature of Tour de France bumps charts, made then by Kraig, and he decided to make his own for this year's tour. (He typically blogs about Nordic skiing.)
Here are some highlights:
You'd notice a similar pattern in 2010 as in 2007. The yellow jersey pretty much stays in the front of the pack throughout... the green jersey (sprints) eventually fades away while the polka dots jersey (mountains) improves as the tour continues.
From the design perspective, one decision concerns whether the colored lines track the jersey or track the current owner of the jersey. Over the course of the tour, jersey change owners, possibly multiple times. What to do?
Notice that the top of the chart slopes downwards, and that is due to withdrawals of riders during the course of the race.
In the second chart, Joran brings this out by tracking each withdrawn rider until the stage they dropped out, and we can see their then ranks when they faltered.
This shows good use of foreground/background to bring out aspects of the data. In the original post, when you mouse on the red dots, a label appears showing the name of the rider.
In this next chart, a small multiples format is adopted, with the riders from each team plotted together and each team in a separate plot. This allows us to see the relative performance easily. Joran tried using one plot, and many colors -- and not surprisingly, discovered that the resulting chart is unreadable. The small multiples format is a solution to this problem.
As someone not too familiar with the race, I find the high variance of the ranking within each team to be unexpected. Can't explain why this would be. In particular, even when a team (Saxobank) has a highly ranked cyclist, it's interesting that the other members of the team are much lower ranked. I thought that team members try to cluster together and protect the team leader. Well, you may be able to make more sense out of this than I can.
I think these charts are ranked alphabetically by the name of the team -- I'd order them by the rank of the leading cyclist of each team.
Another improvement is to label the stages as Mountain vs. Sprint. This can be done by coloring the column for the respective stage... sort of like those economic charts where they color the periods of recession. This helps explain what we are seeing, why some riders achieve drastic improvements (or reductions) in ranks over some stages.
What is clear is that having domain knowledge is an important asset to making good charts. Research is key. This is something Joran also realized, and it's useful to read his commentary about the issues of interpreting the data, being able to recognize typos, etc.