Reader Aaron W. came across this "Facts and Figures" infographic about Boise State University that seemingly is aimed at alumni of the school. Given that Boise State has a good reputation for analytics, Aaron found it disconcerting to see such a low-quality data graphic. (click on the image to see it in full size).
There are numerous little things to grumble about in each section of the chart. The larger issue though is the overall composition. When assembling a chart like this, it is important to provide a navigation path for readers, whether explicitly or through cues.
It's difficult to discern the organizing principles of this chart. Aaron felt this way: "the total information flow is haphazard, if not entirely incoherent. There is some valuable information here, but at best it gets lost in the shuffle."
For example, some statistics are for undergraduate students only, some are for graduate students, and some are offered in aggregate.
Confusion reigns. We learn that the school has total enrollment of 22K students but it's a little math quiz to learn how many are undergraduates. In certain sections, data about faculty members are mixed with those about students.
Not breaking out undergraduates from graduates is a particular problem when presenting demographics, such as age distributions, ethnicity, etc.
It's odd to present this distribution of age without remarking that the undergrads are shown on the left and the graduate students are shown on the right.
Then, the sections presenting counts of students, faculty, degrees, etc. overlap with sections presenting financial data.
A rethinking of this page should start with identifying the key questions readers would be interested in learning, and then organizing the data to suit those needs.
One of my students analyzed the following Economist chart for her homework.
I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.
The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.
As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.
The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.
This NYT graphic published on the eve of the Senate elections represents the best of data visualization: it carries its message with a punch.
The link to the web page is here. The graphic proudly occupied the front page of the print edition on Tuesday.
This graphic is not cliched. The typical consequence of such a statement is that it has to come with a reader's manual. The beauty of this beauty is that the required manual is compact:
The rectangular areas indicate the lack of competitiveness in each race. The extremes are: the entirely filled rectangle is a lock from start to finish; and the completely blank rectangle is a 50/50 tossup from start to finish. The more color, the less competitive the race.
Red implies the Republican candidate is projected to be leading at that moment; Blue, the Democrat; and Green, an independent. (The juxtaposition of red and green is one of the few mis-steps here.)
If you stick to the above, you will do fine.
If you start thinking the height of the area is the chance of winning, you run into trouble.
*** Here is a more conventional way to show time-series projections. It is a mirrored line chart, in which one of the two lines is redundant. (This chart shows up elsewhere on the NYT site.)
To turn this into the other style, draw a line through the 50-percent level, erase everything below 50, and then switch from line to area.
On the far right, where it says 75%, you can see that it is precisely half-way between 50 and 100 percent. So the new chart breaks the start-at-zero rule for area charts.
Except... this is an ingenious violation of that rule. Like I said, if you are able to get your head around to thinking that the area maps to lack of competitiveness (or, the amount of lead the leader has, regardless of who's leading), and suppress the urge to interpret the areas as the chance of winning, then the axis starting at 50-percent is not a problem. (I'm assuming that most of these races are in essence two-horse races. If there are more than two viable candidates, this particular chart form doesn't work.)
The payoff is a very compact chart that shows a lot of data in a small space. The NH race was a lock for the Democrats at the start bu the lead kept dwindling so that on the eve of the election, the lead has been cut in half. But the halved chance is still 75 percent in favor of the Dems.
Iowa and Colorado both flipped from Democratic to Republican lead around middle of September.
When the visualization is driven well, the readers have an effortless ride.
Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.
Note: I'm traveling a lot lately and it is affecting my ability to post on a regular basis.
It's three weeks into my chart-building workshop (link) at NYU and we are starting to discuss individual projects. One of the major discussion points this week is the quality of the underlying data being visualized.
One student is visualizing movie data from IMDB. He showed a chart comparing the year of a movie's release and the number of votes it has received. Do people talk more about new or old movies? Not surprisingly, the distribution is highly skewed with recent movies getting a lot more votes. The consensus in the room is that you never just want to see the pattern; the natural question to ask is why are we seeing such a pattern.
The easiest response is people tend to vote on recent movies. This is the availability heuristic. You tend to talk about things that are top of mind. But there is a lot more to that. Perhaps movies of specific genres get discussed more often. Perhaps movies with larger marketing budgets get more buzz. etc. etc. If any of these factors are important, a good data visualization should bring them out.
Another factor that isn't obvious is that IMDB only started recently relative to the history of movies. The start date of data collection is highly informative here. Imagine a database that gets created five years ago versus one that was created five decades ago. The former dataset is not a random sample of the latter, far from it. The availability heuristic matters here. Also, the movie industry is growing in the meantime so the number of movies is changing. Internet access is also growing so the number of votes is changing. Finally, all students agree that anyone caring to comment on older movies probably is someone who likes those movies, and thus expect that the average rating on older movies to be higher than more recent ones... we'd have to verify this hypothesis using the data.
A lot of Big Data have these characteristics. The starting date of data collection matters a lot. Averaging data without accounting for these timing issues leads to wrong conclusions.
The dynamics of people rating/commenting on movies is a topic I'm interestsed in. If you go to Amazon and pull up Freakonomics, published 6 years ago, it has over 1800 reviews, of which over 800 are five stars, and 1300 are four or five stars, and yet the most recent reviews submitted are dated 3, 5, 6, etc. days ago. Why do people keep writing reviews? For example, two of the reviews written this week just said "great!" and "great book". Another said "Outstanding take on the odd correlations between things in our culture. Definitley makes you think outside the lines." That comment has probably been repeated hundreds of times already by the preceding reviewers. Have anyone studied this yet?
I had the pleasure of visiting the Facebook data science team last week, and we spent some time chatting about visual communication, something they care as much about as I do. Solomon reported about our conversation in this blog post. One topic is stacked bar charts, which are useful in limited situations, such as when the categorical variable has two or three levels.
Solomon used stacked bars in his fascinating post about how candidates from the two political parties are using Facebook messages in the run-up to mid-term elections. Be sure to read about it here. This is an example of good data journalism in which the outcomes of the analysis are presented simply, hiding the amount of technical work that went into its production.
This stacked bar chart is effective at pointing out the differences in the types of messages being sent out by party:
I do have one question, which is the placement of the 50-percent line. The line is very important to this chart, and I like the way it looks. When the line sits at 50 percent, it implies that the Republican and Democratic candidates were issuing about equal numbers of Facebook messages. If the share of total messages is not 50/50, then the reference line should sit elsewhere.
They later split the races by tosses-up versus uncompetitive, and use confidence intervals to communicate both the expected rate and the uncertainty of the estimate. The uncertainty bars in effect tell readers that there are many more uncompetitive elections than tosses-up.
The choice of the chart form is fine. But it makes me pull out my Tufte book. The data-ink ratio on this chart needs a little help. The gridlines can go. Even the 250 label on the x-axis can go. I might even go with just labelling the midpoints.
Lastly, this next chart is enlightening. Seems like older adults are much more likely to comment and/or like such political messages; men are more likely to comment while women are more likely to like. The small-multiples format helps us grasp the three-way analysis without much suffering.
What makes this work is that the picture of the running back serves a purpose here, in organizing the data. Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.
I went ahead and experimented with it:
I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.
Here are three temptations that I did not implement:
Not include the legend
Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
Alberto Cairo left a comment about "data decorations". This is a name he's using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience--one might argue that the experience was worse than just staring at the data table.
It just happens that I have another example of such a chart, submitted by Xan. This one is from Consumer Reports, and illustrates some findings from a recent survey on what things air travellers hate most. Good luck figuring all this out!
A few of these ideas work, such as the complaints about leg room being tied to the seated passengers inside the plane. But then, the data about people hating middle seats is placed on the upper left corner between the left wing and the tail. All of the atypically shaped charts (the cloud, the triangle, the octaogon) seem to use the oft-criticized convention of coding the data onto just one dimension of these multi-dimensioned objects. I just find the organization of the text confusing and poorly structured.
Xan pulled something from a much older Consumer Reports. And they dared to use a boring bar chart:
A nice compromise would be to create some subsections under Airlines to group different types of complaints (stuff relating to seating, stuff about service, stuff about punctuality, etc.). Ask a designer to draw some icons (remember the NYT dog graphic!)
I am mystified by the intention behind this chart, published in NYT Magazine (Sept 14, 2014).
It is not a data visualization since the circles were not placed to scale. The 650 and 660 should have been further to the right on a horizontal time scale. And if we were to take the radial time axis literally, the 390 circle would be closest to the center.
It is not a work of art. It doesn’t look particularly appealing. Sometimes, designers are inspired by imagery. The accompanying article concerns windshield wipers, and I’m not seeing the imagery.
The arrangement of the circles actually interfere with the reader’s comprehension. Here is a straightforward version of the data as a column chart.
Now, let’s turn it on the side, with time running vertically instead of horizontal (the convention).
Then, we need to invert convention once again by making the vertical axis run in reverse so that time runs from up to down, instead of down to up.
Finally, distort the frequency axis, replace the bars with circles, and you have essentially replicated the original.
The point is each step obscures the pattern more. In this case, following conventions makes a better chart.
I have a pet peeve about presenting partial data next to complete data, even if it is labeled correctly. On this chart, the number 390 cannot be compared against any of the other numbers because we are not even half way into the decade of the 2010s. Instead of plotting total number of patents per decade, it would have been more useful to plot number of patents per year in each decade. 43, 26, 65, 41, etc. For the 2010s, I am assuming they have data for 3.5 years.
A simple column chart looks like this:
The per-year view shows that the 2010s is unusual. Of course, I should add a footnote to the chart to make it clear that we only have partial data for 2010, and that the assumption behind the averaging is that the pace of patents will remain the same on average for the remainder of the decade.