This chart cited by ZeroHedge feels like a parody. It's a bar chart that doesn't utilize the length of bars. It's a dot plot that doesn't utilize the position of dots. The range of commute times (between city centers and airports) from 18 to 111 minutes is compressed into red/yellow/green levels.
ZeroHedge got this from Bloomberg Businessweek, which has a data visualization group so this seems strange. The project called "The Airport Frustration Index" is here.
It turns out the above chart is a byproduct of interactivity. The designer illustrates the passage of time by letting lines run across the page. The imagery is that of a horse race. This experiment reminds me of the audible chart by New York Times (link).
The trick works better when the scale is in seconds, thus real time, as in the NYT chart. On the Businessweek chart, three different scales are simultaneously in motion: real time, elapsed time of the interactive element, and length of the line. Take any two airports: the amount of elapsed time between one "horse" and the other "horse" reaching the right side is not equal to the extra time needed but a fraction of it--obviously, the designer can't have readers wait, say, 10 minutes if that was the real difference in commute times!
Besides, the interactive component is responsible for the uninformative end state shown above.
Now, let's take a spin around the Trifecta Checkup. The question being asked is how "painful" is the commute from the city center to the airport. The data used:
Here are some issues about the data worth spending a moment of your time:
In Chapter 1 of Numbers Rule Your World (link), I review some key concepts in analyzing waiting times. The most important concept is the psychology of waiting time. Specifically, not all waiting time is created equal. Some minutes are just more painful than others.
As a simple example, there are two main reasons why Google Maps say it takes longer to get to Airport A than Airport B--distance between the city center and the airport; and congestion on the roads. If in getting to A, the car is constantly moving while in getting to B, half of the time is spent stuck in jams, then the average commuter considers the commute to B much more painful even if the two trips take the same number of physical minutes.
Thus, it is not clear that Google driving time is the right way to measure pain. One quick but incomplete fix is to introduce distance into the metric, which means looking at speed rather than time.
Another consideration is whether the "center" of all business trips coincides with the city center. In New York, for instance, I'm not sure what should be considered the "city center". If all five boroughs are considered, I heard that the geographical center is in Brooklyn. If I type "New York, NY" into Google Maps, it shows up at the World Trade Center. During rush hour, the 111 minutes for JFK would be underestimated for most commuters who are located above Canal Street.
Reader Aaron W. came across this "Facts and Figures" infographic about Boise State University that seemingly is aimed at alumni of the school. Given that Boise State has a good reputation for analytics, Aaron found it disconcerting to see such a low-quality data graphic. (click on the image to see it in full size).
There are numerous little things to grumble about in each section of the chart. The larger issue though is the overall composition. When assembling a chart like this, it is important to provide a navigation path for readers, whether explicitly or through cues.
It's difficult to discern the organizing principles of this chart. Aaron felt this way: "the total information flow is haphazard, if not entirely incoherent. There is some valuable information here, but at best it gets lost in the shuffle."
For example, some statistics are for undergraduate students only, some are for graduate students, and some are offered in aggregate.
Confusion reigns. We learn that the school has total enrollment of 22K students but it's a little math quiz to learn how many are undergraduates. In certain sections, data about faculty members are mixed with those about students.
Not breaking out undergraduates from graduates is a particular problem when presenting demographics, such as age distributions, ethnicity, etc.
It's odd to present this distribution of age without remarking that the undergrads are shown on the left and the graduate students are shown on the right.
Then, the sections presenting counts of students, faculty, degrees, etc. overlap with sections presenting financial data.
A rethinking of this page should start with identifying the key questions readers would be interested in learning, and then organizing the data to suit those needs.
One of my students analyzed the following Economist chart for her homework.
I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.
The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.
As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.
The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.
This NYT graphic published on the eve of the Senate elections represents the best of data visualization: it carries its message with a punch.
The link to the web page is here. The graphic proudly occupied the front page of the print edition on Tuesday.
This graphic is not cliched. The typical consequence of such a statement is that it has to come with a reader's manual. The beauty of this beauty is that the required manual is compact:
The rectangular areas indicate the lack of competitiveness in each race. The extremes are: the entirely filled rectangle is a lock from start to finish; and the completely blank rectangle is a 50/50 tossup from start to finish. The more color, the less competitive the race.
Red implies the Republican candidate is projected to be leading at that moment; Blue, the Democrat; and Green, an independent. (The juxtaposition of red and green is one of the few mis-steps here.)
If you stick to the above, you will do fine.
If you start thinking the height of the area is the chance of winning, you run into trouble.
*** Here is a more conventional way to show time-series projections. It is a mirrored line chart, in which one of the two lines is redundant. (This chart shows up elsewhere on the NYT site.)
To turn this into the other style, draw a line through the 50-percent level, erase everything below 50, and then switch from line to area.
On the far right, where it says 75%, you can see that it is precisely half-way between 50 and 100 percent. So the new chart breaks the start-at-zero rule for area charts.
Except... this is an ingenious violation of that rule. Like I said, if you are able to get your head around to thinking that the area maps to lack of competitiveness (or, the amount of lead the leader has, regardless of who's leading), and suppress the urge to interpret the areas as the chance of winning, then the axis starting at 50-percent is not a problem. (I'm assuming that most of these races are in essence two-horse races. If there are more than two viable candidates, this particular chart form doesn't work.)
The payoff is a very compact chart that shows a lot of data in a small space. The NH race was a lock for the Democrats at the start bu the lead kept dwindling so that on the eve of the election, the lead has been cut in half. But the halved chance is still 75 percent in favor of the Dems.
Iowa and Colorado both flipped from Democratic to Republican lead around middle of September.
When the visualization is driven well, the readers have an effortless ride.
Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.
Note: I'm traveling a lot lately and it is affecting my ability to post on a regular basis.
It's three weeks into my chart-building workshop (link) at NYU and we are starting to discuss individual projects. One of the major discussion points this week is the quality of the underlying data being visualized.
One student is visualizing movie data from IMDB. He showed a chart comparing the year of a movie's release and the number of votes it has received. Do people talk more about new or old movies? Not surprisingly, the distribution is highly skewed with recent movies getting a lot more votes. The consensus in the room is that you never just want to see the pattern; the natural question to ask is why are we seeing such a pattern.
The easiest response is people tend to vote on recent movies. This is the availability heuristic. You tend to talk about things that are top of mind. But there is a lot more to that. Perhaps movies of specific genres get discussed more often. Perhaps movies with larger marketing budgets get more buzz. etc. etc. If any of these factors are important, a good data visualization should bring them out.
Another factor that isn't obvious is that IMDB only started recently relative to the history of movies. The start date of data collection is highly informative here. Imagine a database that gets created five years ago versus one that was created five decades ago. The former dataset is not a random sample of the latter, far from it. The availability heuristic matters here. Also, the movie industry is growing in the meantime so the number of movies is changing. Internet access is also growing so the number of votes is changing. Finally, all students agree that anyone caring to comment on older movies probably is someone who likes those movies, and thus expect that the average rating on older movies to be higher than more recent ones... we'd have to verify this hypothesis using the data.
A lot of Big Data have these characteristics. The starting date of data collection matters a lot. Averaging data without accounting for these timing issues leads to wrong conclusions.
The dynamics of people rating/commenting on movies is a topic I'm interestsed in. If you go to Amazon and pull up Freakonomics, published 6 years ago, it has over 1800 reviews, of which over 800 are five stars, and 1300 are four or five stars, and yet the most recent reviews submitted are dated 3, 5, 6, etc. days ago. Why do people keep writing reviews? For example, two of the reviews written this week just said "great!" and "great book". Another said "Outstanding take on the odd correlations between things in our culture. Definitley makes you think outside the lines." That comment has probably been repeated hundreds of times already by the preceding reviewers. Have anyone studied this yet?
I had the pleasure of visiting the Facebook data science team last week, and we spent some time chatting about visual communication, something they care as much about as I do. Solomon reported about our conversation in this blog post. One topic is stacked bar charts, which are useful in limited situations, such as when the categorical variable has two or three levels.
Solomon used stacked bars in his fascinating post about how candidates from the two political parties are using Facebook messages in the run-up to mid-term elections. Be sure to read about it here. This is an example of good data journalism in which the outcomes of the analysis are presented simply, hiding the amount of technical work that went into its production.
This stacked bar chart is effective at pointing out the differences in the types of messages being sent out by party:
I do have one question, which is the placement of the 50-percent line. The line is very important to this chart, and I like the way it looks. When the line sits at 50 percent, it implies that the Republican and Democratic candidates were issuing about equal numbers of Facebook messages. If the share of total messages is not 50/50, then the reference line should sit elsewhere.
They later split the races by tosses-up versus uncompetitive, and use confidence intervals to communicate both the expected rate and the uncertainty of the estimate. The uncertainty bars in effect tell readers that there are many more uncompetitive elections than tosses-up.
The choice of the chart form is fine. But it makes me pull out my Tufte book. The data-ink ratio on this chart needs a little help. The gridlines can go. Even the 250 label on the x-axis can go. I might even go with just labelling the midpoints.
Lastly, this next chart is enlightening. Seems like older adults are much more likely to comment and/or like such political messages; men are more likely to comment while women are more likely to like. The small-multiples format helps us grasp the three-way analysis without much suffering.
What makes this work is that the picture of the running back serves a purpose here, in organizing the data. Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.
I went ahead and experimented with it:
I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.
Here are three temptations that I did not implement:
Not include the legend
Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy