The New York Times graphics team shows us how to do infographics poster the right way. They recently put up a feature showing how the repeal of helmet laws is linked to increasing vehicle fatalities. The graphic is here.
One of the key charts is this one (second to last screen):
The graphic tells the story, no additional words are needed. (Actually, you'd have to come from the prior page to know that the white vertical line represented the year in which Florida repealed its helmet law.)
Of course, one state does not prove a trend. It appears that other states face the same situation. It would be nicer if they could start this next chart at an earlier time.
I'm surprised by how much these lines fluctuate given that the raw counts are in the hundreds.
I wonder if there is any active debate in Florida or elsewhere as it would appear that the helmet law repeal may have caused hundreds of unnecessary deaths. Have people been coming up with other explanations for the sharp rise in motorcycle fatalities involving those not wearing helmets?
On Twitter, Andy C. (@AnkoNako) asked me to look at this pretty creation at NFL.com (link).
There is a reason why you don't read much about spider charts (web charts, radar charts, etc.) here. While this chart is beautifully constructed, and fun to play with, it just doesn't work as a vehicle for communication.
This example above allows us to compare four players (here, quarterbacks) on eight metrics. Each white polygon represents one player, and the orange outline represents the league average quarterback.
What are some of the questions one might have about comparing quarterbacks?
Who is the best quarterback, and who is the worst?
Who is the better passer? (ignoring other skills, like rushing ability)
Is each quarterback better or worse than the average quarterback?
How will you figure these out from the spider chart?
Not sure. The relative value of the quarterbacks is definitely not encoded in the shape of the polygon, nor the area. To really figure this out, you'd need to look at each of the eight spokes independently, and then aggregate the comparisons in your head. Unless... you are willing to ignore seven of the eight metrics, and just look at passer rating (below right).
Focusing on passing only means focusing on five of the eight metrics, from pass attempts to interceptions. How do you combine five metrics into one evaluation is your own guess.
One can tell that Joe Flacco is basically the average quarterback as his contour is almost exactly that of the average (orange outline). Are the others better or worse thean average? Hard to tell at first glance.
First, the chart invites users to place equal emphasis on each of the eight dimensions. (There is a control to remove dimensions.) But the metrics are clearly not equally important. You certainly should value passing yards more than rushing yards, for example.
Second, the chart ignores the correlation between these eight metrics. The easiest way to see this is the "Passer Rating", which is a formula comprising the Passing Attempts, Passing Completions, Interceptions, Touchdown Passes, and Passing Yards. Yes, all those five components have been separately plotted. Another easy way to see the problem is that Passing Yards are highly correlated with Passing Attempts or Passing Completions.
Third, the chart fails to account for different types of quarterbacks. I deliberately chose these four because Joe Flacco was a starter, Tyrod Taylor was a backup who almost never played, while at San Francisco, Alex Smith and Colin Kaepernick shared the starting duties. So for Passing Yards, the numbers were 3817, 179, 1737 and 1814 respectively. Those numbers should not be directly compared. Better statistics are something like yards per minute played, yards per offensive series, yards per plays executed, etc. The way that this data is used here, all the second- and third-string quarterbacks will be below average and most of the starters will be above average.
From a design perspective, there are a small number of misses.
Mysteriously, the legend always has only two colors no matter how many players are being compared. The orange is labeled Average while the white is labeled "Leader". I have no idea why any of the players should be considered the "Leader".
The only way to know which white polygon represents which player is to hover on the polygon itself. You'll notice that in my example, several of those polygons overlap substantially so sometimes, hovering is not a task easily accomplished.
The last issue is scale. Turns out that some of the metrics like interceptions, touchdown passes, rushing yards, etc. can be zeroes. Take a look at this subset of the chart where I hovered on Tyrrod Taylor.
Do you see the problem? The zero point is definitely not the center of the circle. This problem exists for any circular charts like bubble charts.
Now look at Interceptions. Because the scale is reverse (lower is better), the zero point of this metric will lie on the outer edge of the circle. This is a vexing issue because the radius is open-ended on the outside but closed-ended on the inside.
In the next post, I will discuss some alternative presentation of this data.
Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.
Here are my comments:
I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.
I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)
After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?
At the NY Tech Meetup, Andrei Scheinkman showed off some work his team at Huffington Post did relating to gun violence in America.
Interactive version is here. The animation shows day by day, where the victims of gun violence were located. The table below contains the details of each victim, and links to the news story covering the event.
What is not seen on the chart is even more impressive. Andrei described how they looked around for databases that would provide them the raw materials for creating this chart but no timely source exists. This means that a team of 15 (if I heard correctly) spent a month or so manually collecting all the data on a spreadsheet.
It's also the reason why they cannot continue the map indefinitely, as people have other things to do.
Andrei also contrasted this visualization with a text article that describes the state of gun violence in words. You guessed it, the visual presentation is hands-down more compelling.
There is a tendency when producing dashboards to go for the cutesy-cutesy. Reader Daniel L. came across an attempt by Facebook to document its data center metrics (link). They chose this circular, spiraling design:
Notice that the lines of equal distance on a circular plot are the concentric circles. Thus, when they connect different points in a continuous way, as if it were a standard line chart, the line segments between data points are distorted. The diagram below shows the problem:
One potential advantage (although not worthwhile) of wrapping the data into a circle is that the 24 hours become a continuous line. Except that it isn't the case here! Weirdly, the purple and blue lines show a huge discontinuity at the ray that points vertically upwards from the origin. This leads to an even more fascinating find.
The circle actually rotates! It's like a rotating restaurant. The time shown vertically pointing upwards keeps changing as I write this post. This makes the discontinuity even more baffling. You'd think the previous data point just shifts anti-clockwise but apparently not. If any of you can figure this out, please leave a comment.
As Daniel pointed out, the traditional line charts shown in the bottom half of the page would have done the job with less fuss. Not as eye-catching, but not as baffling either.
One innovation of on-line charts is the replacement of axis labels with mouse-over effects. Mousing over the chart here produces the underlying data values. This is elegance.
One horrible trend with on-line charts is the horrendous choice of scale. Look at the top two charts, especially the orange line chart about power usage. It makes no sense to choose a scale that completely annihilates the underlying fluctuations.
I have found the same problems with many Google charts. It looks as if nothing is happening except when you look more closely, you learn that a tiny distance represents a big percentage shift in the underlying data.
Josh hated this "dataless visualization" from ABC. (link; warning: ads). Here are his comments:
The report has planes leaving China, landing across the globe and
instantly infecting us all with bird flu. It doesn't do a good job
explaining how and the rate pandemics actually spread. However, it does
do a good job scaring us all.
The entire flu pandemic theater is unscientific. It is based on the 100-year flood type of argument, with scientists claiming that we are "overdue" for some catastrophe. Reminds me of earthquake forecasting, covered by Nate Silver in his book. It is possible to predict the average frequency of, but virtually impossible to predict the timing of rare natural disasters.
The 100-year flood type calculations is based on averaging a small number of events over a very long time scale. There is no reason why these events should be spread out evenly over time (i.e. one event every 100 years).
This is a fallacy of "law of small numbers": if one throws a fair coin 10 times, one shouldn't expect exactly 5 heads, as the distribution of heads should look like the chart on the right. The chance of exactly 5 heads is only 25%.
Also, doctors keep me honest but I believe only one type of mutation, i.e. the one that makes the virus able to pass from human to human, has a chance of causing a pandemic. So it is wrong to say that "if the virus mutates," a pandemic will result. In addition, in the past, some viruses were able to pass from human to human but the rate of infection was not fast enough, and they failed to lead to a pandemic.
Daniel L. did not like the map shown below, from a research article on female mortality rate in the U.S., via Jezebel.
I was amused by what the blogger at Jezebel was able to take in from the map. Her post started with a huge version of the map, under which she said:
Mortality rates are rising in 43% of U.S. counties, as illustrated by this map from health researcher Bill Gardner.
Mortality rate is a statistic about the population. The map is an illustration of geographical area (distorted by the map projection). The map carries no information about population at all. Thus, it is not the right chart to display population data.
The statistic itself is poorly chosen. What does 43% of counties mean? Some counties have few people while others are very densely populated. New York County is barely visible on this map yet it has the heaviest weight on the average.
According to the CDC data, the death rate, age-adjusted, for women has been decreasing over time. So, the backward motion in those 43% of counties is somehow compensated for by forward progress in the other 57% of the counties, it appears.
Maybe the average for the whole country masks some local patterns. The cited map doesn't help because it assumes that the importance of the mortality rate is proportional to the geographical size of the county, when the right comparison should be the population of women in the county.
Just as we don't always need a map to do justice to geographic data, we don't always usually need animation to convey time-varying data.
Some examples of good visualizations of time-series information without moving parts include the "horse-race" charts on the Presidential election (link), and the NY Times plot of Olympic race times (link).
Reader Jeff Cole did some nice Web charts (link) that show NFL matches as horse races. The look of these charts is exceptionally clean and easy to understand. They tell us which matches were blow-outs, and which ones were close, and which ones were tales of two halves.
This Green Bay-New England match looked like a thriller, with five lead changes, and a final score gap of only 4 points. Green Bay scored pretty much regularly through the four periods while New England had two relatively long droughts in scoring.
I'm not sure if the Margin of Victory section is worthwhile. It seems redundant to me. The labeling can be better, showing that New England leads are above the zero reference line while Green Bay leads are below the line. I'd consider making this a cumulative amount of time in the lead up to a specific point of the match. That would give an extra piece of information that is difficult to grasp from the Game Scoring chart.
This was an exciting match for a different reason. Dallas was always ahead and eventually won by three points. But it was a shootout in which each team scored regularly. Dallas had a scoring drought for most of the seond half which allowed Washington to get even but then scored a field goal before the final whistle to come out on top. I didn't watch the game but this chart tells me all of that.
On the Game Scoring chart, I'd add vertical tickmarks to help read the intermediate scores. Also maybe put dots to highlight the crossover points, which are where lead changes occur.
NFL is a complex game that is difficult to fully capture in a simple chart like this. It would be nice to add some extra event indicators, such as interceptions, fumbles, and other turning points. That's easier said than done, especially when trying to automate the graph production.
The accomplished graphics team at NYT outdid themselves with this feature on the 100m dash through Olympic history (link). You should really go and check out the full presentation.
They start with a data table like the one shown on the right. It's a boring list of names and winning times by year and by medal type. What can one do to animate this data? The NYT team found many ways.
The presentation consists of a static dot plot plus a short movie.
They found many ways to convey the meaning of the tenths and hundredths of a second that separate the top performers. In the dot plot, for example, they did not draw the actual winning times. Instead, they converted the differences in winning times into distances. Here is the right section of the chart:
We are drawn into compressing time and place, having Usain Bolt race all of the former winners and assuming everyone ran the same race they did in real life. The dot plot tells us how far ahead of each past winner Bolt is.
Some time ago, I wrote about the "audiolization" of duration data, in another piece about a NYT chart (link). They deployed this strategy beautifully at the end of the short film. The runners were aligned like keys on a piano, and the resulting sound is like playing a scale across the keyboard. Lovely, that is to say.
The authors bring in a number of other data points to create reference points for understanding this data. For example, if you blink, you might miss the national jerseys worn by each winner in the hypothetical competition:
Later, the dominance of American runners is plainly shown via white lanes:
The perspective hides the relative impotency of American sprinters in recent Olympics. This view of the surge of Caribbean runners makes up for it:
Next, they compared the times for U.S. age group record holders to Olympic winning times. This is a fun way to look at the data. (Pardon the strutting Play button.)
They play with foreground/background here in an effective way. The 15- and 16-year-old age-group record holder is said to be "good enough for a bronze as recently as 1980".
Fun aside, think twice before you repeat this "insight". It falls into the category of those things that sound impressive but are quite meaningless. For one thing, the gap between the two runners is affected by a multitude of factors: the age of the runner (which is elevated here over and above other factors), the nationality of the runner, and the time of the run. This last point is key: if we compare the 15-to-16-year-old 100m record time from 1980 to the winning times of Olympic medalists from that year, the gap would be much wider.
Also, pay attention to the distribution of runners. It gets very crowded very quickly near the top end of the scale. In other words, while the gap as measured in part-seconds may seem small, the gap as measured in individual athletes would be very wide -- we'd find loads of athletes whose times fit into the gap illustrated here.
According to the dot plot, in some years, like the 1950s, there were no gold medalists. Looking at the data here, I think this is an overplotting effect, where two times were so close that the dots were literally on top of each other. This creates the situation where one of the dots will be on top of the other, and which one is on top is a feature of the software you're using. Jittering is one common strategy to deal with this problem, or we can just place the gold, silver and bronze dots on their own levels. The latter strategy would look exactly like the over-the-top view used in the short film:
(We'll also note that this view has time running left to right, which is perhaps more natural than time running bottom up, as in the dot plot. However, we are used to seeing runners cross the finish line from left to right on a TV screen so this is a case of eight ounces and half a pound.)
In the short film, I find the gigantic play/pause button at the center of the screen an annoyance, ruining my enjoyment. (I'm using Firefox and a Mac.)
Now, go check out the entire feature (link), and applaud the effort.
@TheChadd submitted the following chart via Twitter.
I don't know if "fun fairs" mean the same thing to me as to you but that's where I got introduced to spinning wheel games. You stand 10 feet away from a multi-colored pie chart, you are supposed to throw darts (or other objects) at the circle, you win gigantic teddy bears if you hit the narrow wedge and maybe a sweet if you hit the big wedge.
To add to the fun, the pie chart is made to spin around slowly.
Well, we are at the fun fair and here is the spinning pie chart: