In a Trifecta Checkup (link), the Vulture chart falls into Type DV. The question might be the relationship between running time and box office, and between Rotten Tomatoes Score and box office. These are very difficult to answer.
The box office number here refers to the lifetime gross ticket receipts from theaters. The movie industry insists on publishing these unadjusted numbers, which are completely useless. At the minimum, these numbers should be adjusted for inflation (ticket prices) and for population growth, if we are to use them to measure commercial success.
The box office number is also suspect because it ignores streaming, digital, syndication, and other forms of revenues. This is a problem because we are comparing movies across time.
You might have noticed that both running time and box office numbers have gone up over time. (That is to say, running time and box office numbers are highly correlated.) Do you think that is because moviegoers are motivated to see longer films, or because movies are just getting longer?
PS. [12/15/2014] I will have a related discussion on the statistics behind this data on my sister blog. Link will be active Monday afternoon.
Found this chart in the magazine that Charles Schwab sends to customers:
When there are two variables, and their correlation is of interest, a scatter plot is usually recommended. But not here!
The text labels completely dominate this chart and the designer tried very hard to place them but a careful look reveals that some boxes are placed above the dots while others are placed to their right and the dot for "Short Treasuries" holds refuge quite a while away from the dot. This means the locations of the text boxes do not substitute for the dots.
Here is a different view of this data:
I am using a bumps-style chart, which allows the labels to be written horizontally outside the canvass. Instead of all categories plotted on the same chart, I use a small multiples setup to differentiate three types of risk-return relationships.
This chart cited by ZeroHedge feels like a parody. It's a bar chart that doesn't utilize the length of bars. It's a dot plot that doesn't utilize the position of dots. The range of commute times (between city centers and airports) from 18 to 111 minutes is compressed into red/yellow/green levels.
ZeroHedge got this from Bloomberg Businessweek, which has a data visualization group so this seems strange. The project called "The Airport Frustration Index" is here.
It turns out the above chart is a byproduct of interactivity. The designer illustrates the passage of time by letting lines run across the page. The imagery is that of a horse race. This experiment reminds me of the audible chart by New York Times (link).
The trick works better when the scale is in seconds, thus real time, as in the NYT chart. On the Businessweek chart, three different scales are simultaneously in motion: real time, elapsed time of the interactive element, and length of the line. Take any two airports: the amount of elapsed time between one "horse" and the other "horse" reaching the right side is not equal to the extra time needed but a fraction of it--obviously, the designer can't have readers wait, say, 10 minutes if that was the real difference in commute times!
Besides, the interactive component is responsible for the uninformative end state shown above.
Now, let's take a spin around the Trifecta Checkup. The question being asked is how "painful" is the commute from the city center to the airport. The data used:
Here are some issues about the data worth spending a moment of your time:
In Chapter 1 of Numbers Rule Your World (link), I review some key concepts in analyzing waiting times. The most important concept is the psychology of waiting time. Specifically, not all waiting time is created equal. Some minutes are just more painful than others.
As a simple example, there are two main reasons why Google Maps say it takes longer to get to Airport A than Airport B--distance between the city center and the airport; and congestion on the roads. If in getting to A, the car is constantly moving while in getting to B, half of the time is spent stuck in jams, then the average commuter considers the commute to B much more painful even if the two trips take the same number of physical minutes.
Thus, it is not clear that Google driving time is the right way to measure pain. One quick but incomplete fix is to introduce distance into the metric, which means looking at speed rather than time.
Another consideration is whether the "center" of all business trips coincides with the city center. In New York, for instance, I'm not sure what should be considered the "city center". If all five boroughs are considered, I heard that the geographical center is in Brooklyn. If I type "New York, NY" into Google Maps, it shows up at the World Trade Center. During rush hour, the 111 minutes for JFK would be underestimated for most commuters who are located above Canal Street.
Reader Aaron W. came across this "Facts and Figures" infographic about Boise State University that seemingly is aimed at alumni of the school. Given that Boise State has a good reputation for analytics, Aaron found it disconcerting to see such a low-quality data graphic. (click on the image to see it in full size).
There are numerous little things to grumble about in each section of the chart. The larger issue though is the overall composition. When assembling a chart like this, it is important to provide a navigation path for readers, whether explicitly or through cues.
It's difficult to discern the organizing principles of this chart. Aaron felt this way: "the total information flow is haphazard, if not entirely incoherent. There is some valuable information here, but at best it gets lost in the shuffle."
For example, some statistics are for undergraduate students only, some are for graduate students, and some are offered in aggregate.
Confusion reigns. We learn that the school has total enrollment of 22K students but it's a little math quiz to learn how many are undergraduates. In certain sections, data about faculty members are mixed with those about students.
Not breaking out undergraduates from graduates is a particular problem when presenting demographics, such as age distributions, ethnicity, etc.
It's odd to present this distribution of age without remarking that the undergrads are shown on the left and the graduate students are shown on the right.
Then, the sections presenting counts of students, faculty, degrees, etc. overlap with sections presenting financial data.
A rethinking of this page should start with identifying the key questions readers would be interested in learning, and then organizing the data to suit those needs.
One of my students analyzed the following Economist chart for her homework.
I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.
The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.
As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.
The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.
This NYT graphic published on the eve of the Senate elections represents the best of data visualization: it carries its message with a punch.
The link to the web page is here. The graphic proudly occupied the front page of the print edition on Tuesday.
This graphic is not cliched. The typical consequence of such a statement is that it has to come with a reader's manual. The beauty of this beauty is that the required manual is compact:
The rectangular areas indicate the lack of competitiveness in each race. The extremes are: the entirely filled rectangle is a lock from start to finish; and the completely blank rectangle is a 50/50 tossup from start to finish. The more color, the less competitive the race.
Red implies the Republican candidate is projected to be leading at that moment; Blue, the Democrat; and Green, an independent. (The juxtaposition of red and green is one of the few mis-steps here.)
If you stick to the above, you will do fine.
If you start thinking the height of the area is the chance of winning, you run into trouble.
*** Here is a more conventional way to show time-series projections. It is a mirrored line chart, in which one of the two lines is redundant. (This chart shows up elsewhere on the NYT site.)
To turn this into the other style, draw a line through the 50-percent level, erase everything below 50, and then switch from line to area.
On the far right, where it says 75%, you can see that it is precisely half-way between 50 and 100 percent. So the new chart breaks the start-at-zero rule for area charts.
Except... this is an ingenious violation of that rule. Like I said, if you are able to get your head around to thinking that the area maps to lack of competitiveness (or, the amount of lead the leader has, regardless of who's leading), and suppress the urge to interpret the areas as the chance of winning, then the axis starting at 50-percent is not a problem. (I'm assuming that most of these races are in essence two-horse races. If there are more than two viable candidates, this particular chart form doesn't work.)
The payoff is a very compact chart that shows a lot of data in a small space. The NH race was a lock for the Democrats at the start bu the lead kept dwindling so that on the eve of the election, the lead has been cut in half. But the halved chance is still 75 percent in favor of the Dems.
Iowa and Colorado both flipped from Democratic to Republican lead around middle of September.
When the visualization is driven well, the readers have an effortless ride.
Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.