I'd like to start 2015 on a happy note. I enjoyed reading the piece by Steven Rattner in the New York Times called "The Year in Charts". (link)
I particularly like the crisp headers, and unfussy language, placing the charts at the center. The components of the story flow nicely.
Here are my notes on some of the charts:
This chart is missing context, which is performance against population growth or potential. Changing the context also changes the implicit yardstick. The implied metric here is more-than-zero growth or continued growth.
It took me a while to find the titles to know what each section depicts. I'd prefer to put the titles back to the top or the top left corner. The "information in my head" is making me look at the "wrong" places. But otherwise, this is Tufte goodness.
This innocent thing prompts a host of questions. First, how could a "median" be found to have so many values within one population? It would appear that this is an exercise in isolating each quintile (decile in the case of the top 20%) and computing the median within each segment. In other words, the data represent these income percentiles: 95th, 85th, 75th, 50th, 3oth and 10th. Given that the income data have already been grouped, computing group averages makes more sense than calculating group medians. This is especially so when comparing changes over time. The robust median suppresses changes.
The bucketing of income presents another challenge. All buckets except at the very top are essentially bounded. All the central buckets have minimum and maximum values. The bottom bucket is bounded under by zero. The top bucket, however, is basically unbounded so important features of this data could be lost by summarizing the top bucket by its median.
A third problem surfaces if one were to inquire how the survey collects its data. According to the Federal Reserve description, the data concern "usual income" as opposed to "actual income". Respondents are told to ignore "temporary" conditions in describing their "usual incomes". It is likely the case that people think income increases are permanent while getting laid off is temporary so while usual income solves one problem (the long-term planner's problem), it creates a different problem (short-term bias). I particularly don't think it is a good metric for assessing changes around a recession/recovery.
I also wonder about the imputation of missing data. I'd assume that possibly there is a preponderance of missing values for unemployed people. If the imputation cannot predict the employment status of those people, then it would surely have inflated incomes.
I wonder if any of my readers knows details about some of these potential problems. Would love to hear how the Fed's statisticians deal with these issues.
On this chart, the author has found an excellent story, and the graphic is effective. I prefer to see the horizontal axis labelled "More Unequal" as opposed to "Less Equal" because of the conventional that "more" is usually placed to the right of "less" on the horizontal axis. Here is a scatter plot version of the data:
It shows the U.S. is a bit more extreme than all others.
This is another great chart. I like the imagery of the emptying middle. I find the labels a bit too long and requiring too much interpreting. I prefer this:
I have yet to understand why the vertical axis of the top chart keeps changing scales over time. The white dot labelled "Peak 1982" (70 million) is barely above the other white dot for "2007" (38 million). This chart hides a clear trend: the population of sheep in New Zealand has plunged by 45% over 25 years.
To address the question of sheep versus human, one should plot the ratio of sheep-to-human directly. In this case, the designer probably faced a problem: because of the plunging population of sheep, the ratio has plunged steeply in 25 years. To make a point that "people are outnumbered more than 9 to 1", the designer didn't want to show a plunging trend. (Could this be the reason why the human population in 1982 was not printed?)
This is a case of too many details. Instead of manipulating the scale to distort the data, one can simply show the current ratio, or the average ratio in the last five years.
As the reader scans to the bottom set of charts, a cognitive wedge is encountered, as the curved scale of the New Zealand chart gave way to the normal uniform scale. These smaller charts are no less confusing, however.
The two lines on these two charts appear almost the same and yet, the Australian chart (on the left) shows a ratio of 4 to 1 while the Icelandic chart (on the right) shows a ratio of 1.5 times. Makes you wonder if each one of the small-multiples have a dual axis.
Again, I'm not convivned that the time series adds anything to the message.
In a Trifecta Checkup (link), the Vulture chart falls into Type DV. The question might be the relationship between running time and box office, and between Rotten Tomatoes Score and box office. These are very difficult to answer.
The box office number here refers to the lifetime gross ticket receipts from theaters. The movie industry insists on publishing these unadjusted numbers, which are completely useless. At the minimum, these numbers should be adjusted for inflation (ticket prices) and for population growth, if we are to use them to measure commercial success.
The box office number is also suspect because it ignores streaming, digital, syndication, and other forms of revenues. This is a problem because we are comparing movies across time.
You might have noticed that both running time and box office numbers have gone up over time. (That is to say, running time and box office numbers are highly correlated.) Do you think that is because moviegoers are motivated to see longer films, or because movies are just getting longer?
PS. [12/15/2014] I will have a related discussion on the statistics behind this data on my sister blog. Link will be active Monday afternoon.
This NYT graphic published on the eve of the Senate elections represents the best of data visualization: it carries its message with a punch.
The link to the web page is here. The graphic proudly occupied the front page of the print edition on Tuesday.
This graphic is not cliched. The typical consequence of such a statement is that it has to come with a reader's manual. The beauty of this beauty is that the required manual is compact:
The rectangular areas indicate the lack of competitiveness in each race. The extremes are: the entirely filled rectangle is a lock from start to finish; and the completely blank rectangle is a 50/50 tossup from start to finish. The more color, the less competitive the race.
Red implies the Republican candidate is projected to be leading at that moment; Blue, the Democrat; and Green, an independent. (The juxtaposition of red and green is one of the few mis-steps here.)
If you stick to the above, you will do fine.
If you start thinking the height of the area is the chance of winning, you run into trouble.
*** Here is a more conventional way to show time-series projections. It is a mirrored line chart, in which one of the two lines is redundant. (This chart shows up elsewhere on the NYT site.)
To turn this into the other style, draw a line through the 50-percent level, erase everything below 50, and then switch from line to area.
On the far right, where it says 75%, you can see that it is precisely half-way between 50 and 100 percent. So the new chart breaks the start-at-zero rule for area charts.
Except... this is an ingenious violation of that rule. Like I said, if you are able to get your head around to thinking that the area maps to lack of competitiveness (or, the amount of lead the leader has, regardless of who's leading), and suppress the urge to interpret the areas as the chance of winning, then the axis starting at 50-percent is not a problem. (I'm assuming that most of these races are in essence two-horse races. If there are more than two viable candidates, this particular chart form doesn't work.)
The payoff is a very compact chart that shows a lot of data in a small space. The NH race was a lock for the Democrats at the start bu the lead kept dwindling so that on the eve of the election, the lead has been cut in half. But the halved chance is still 75 percent in favor of the Dems.
Iowa and Colorado both flipped from Democratic to Republican lead around middle of September.
When the visualization is driven well, the readers have an effortless ride.
I had the pleasure of visiting the Facebook data science team last week, and we spent some time chatting about visual communication, something they care as much about as I do. Solomon reported about our conversation in this blog post. One topic is stacked bar charts, which are useful in limited situations, such as when the categorical variable has two or three levels.
Solomon used stacked bars in his fascinating post about how candidates from the two political parties are using Facebook messages in the run-up to mid-term elections. Be sure to read about it here. This is an example of good data journalism in which the outcomes of the analysis are presented simply, hiding the amount of technical work that went into its production.
This stacked bar chart is effective at pointing out the differences in the types of messages being sent out by party:
I do have one question, which is the placement of the 50-percent line. The line is very important to this chart, and I like the way it looks. When the line sits at 50 percent, it implies that the Republican and Democratic candidates were issuing about equal numbers of Facebook messages. If the share of total messages is not 50/50, then the reference line should sit elsewhere.
They later split the races by tosses-up versus uncompetitive, and use confidence intervals to communicate both the expected rate and the uncertainty of the estimate. The uncertainty bars in effect tell readers that there are many more uncompetitive elections than tosses-up.
The choice of the chart form is fine. But it makes me pull out my Tufte book. The data-ink ratio on this chart needs a little help. The gridlines can go. Even the 250 label on the x-axis can go. I might even go with just labelling the midpoints.
Lastly, this next chart is enlightening. Seems like older adults are much more likely to comment and/or like such political messages; men are more likely to comment while women are more likely to like. The small-multiples format helps us grasp the three-way analysis without much suffering.
The New York Times Upshot team came up with a dataviz that is worth your time. This is a set of maps that gives a perspective on migration patterns within the US. The metric being portrayed is the birthplace of current residents of each state.
Here is the chart for California:
I see a few smart ideas, starting with the little map on the bottom left. It servies multiple functions. It is a legend mapping colors to four regions of the US. It serves as a visual guide to the definition of regions. It serves as an interactive tool to select states. Readers might remember the use of a pie chart as a legend in my remake of one of the Wikipedia pie charts (link).
The aggregation up to regions is what really makes this chart work. This aggregation reduces the number of pieces from about 50 to about 10.
They also did a great job with the axes and gridlines. Much of the data labels are hidden but the most important numbers are retained. These include the proportion of residents who were born in their home state, the proportion of residents who were born outside the U.S., and any state(s) that contribute a significant portion of residents. In the California example, we see that the proportion of Midwest-born people living in California has declined by a lot over time.
Users can interactively hover over the gridlines to uncover the data labels.
As you scroll through the states, there are some recurring patterns.
Some states clearly have become more desirable over time. Georgia, for instance, has seen strong in-migration (colored pieces) especially from non-Southern states:
This pattern is repeated in other southeastern states, including Virginia, North Carolina and Tennessee.
By contrast, some states are not getting the migrants. As a result, the share of residents born in the home state has increased over time. The Midwestern states have this problem. For instance, Minnesota:
I also find a few states with special features. Nevada has always been a state of migrants:
Wyoming on the other hand has become popular with migrants over time but the composition has shifted away from MidWest states.
I'd have preferred presenting the charts in clusters based on patterns.
I haven't been able to figure out the multi-color spaghetti. I think the undulations are purely for aesthetic reasons.
One way to read the chart, then, is to first see three big patches (light grey for born in current state; white patch for born in other U.S. states; dark gray for born outside the U.S.). Within the white patch, we are looking for the shift between the colors (i.e. regions).
Matthew Yglesias, writing for Vox, cited the following chart from a World Bank project:
His comment was: "We can see that while China has overtaken Germany and Japan to become the world's second-largest economy (i.e., total area of the rectangle) its citizens are nowhere near being as rich as those of those countries or even Mexico."
Yes, the chart encodes the size of the economy in a rectangular area, with one side being the per-capita GDP and the other being the population. I am not sure about the "we can see". I am not confident that the short and wide rectangle for China is larger than the thin and tall ones for Japan and for Germany. Perhaps Matthew is relying on knowledge in his head, rather than knowledge on the chart, to come to this conclusion.
This is the trouble with rectangular area charts: they have a nerdy appeal since side x side = area but as a communications device, they fail.
Here are some problems with the chart:
it's difficult to compare rectangular areas
the columns can only be sorted in one way (I'd have chosen to order it by population)
colors are necessitated by the chart type not the data
the cumulative horizontal axis makes no sense unless the vertical axis is cumulative GDP (or cumulative GDP per capita)
Matthew should also have mentioned PPP (Purchasing Power Parity). If GDP is used as a measure of "wellbeing", then costs of living should be taken into account in addition to incomes. The cost of living in China is much lower than in Japan or Germany and using the prevailing exchange rates disguises this point.
Try your hand at fixing this one. There are no easy solutions. Does interactivity help? How about multiple charts? You will learn why I classify it as QDV instead of just DV.
[Update, 8/18/2014:] Xan Gregg created a scatter plot version of the chart. He also added, "There is still the issue of what the question is, but I'm assuming it's along the lines of "How do economies compare regarding GDP, population, and GDP/capita?" I'm using the PPP-based GDP, but I didn't read the report carefully enough to figure out if another measure was better."
Is data visualization worth paying for? In some quarters, this may be a controversial question.
If you are having doubts, just look at some examples of great visualization. This week, the NYT team brings us a wonderful example. The story is about whether dogs feel jealousy. Researchers have dog owners play with (a) a stuffed toy shaped like a dog (b) a Jack-o-lantern and (c) a book; and they measured several behavior that are suggestive of jealousy, such as barking or pushing/touching the owner.
This is how the researchers presented their findings in PLOS:
And this is how the same chart showed up in NYT:
Same data. Same grouped column format. Completely different effect on the readers.
Let's see what the NYT team did to the original, roughly in order of impact:
Added a line above the legend, explaining that the colors represent different experimental conditions
Re-ordered the behavior by their average prevalence from left to right
Added little cartoons to make the chart more fun to look at
Added colors and removed moire patterns (a Tufte pet peeve)
Changed the vertical scale from 0 to 1 (scientific) to 0-100
Reduced the number of tick marks on the vertical scale (this is smart because the researchers observed only about 30 dogs so only very large differences are of practical value)
Clarified certain category details, e.g. Snapping became "bite or snap at object"
Removed technical details of p-values, not important to NYT readers
Even simple charts illustrating simple data can be done well or done poorly.
I was traveling quite a lot recently, and last week, read the Wall Street Journal cover to cover for the first time in a while. I am happy to report that there are many more data graphics than I remember of past editions.
The following chart illustrating findings of an FCC report on broadband speeds has a number of issues (a related blog post containing this chart can be found here):
The biggest problem with the visual elements is the lack of linkage between the two components. The two charts should be connected: the one on the right presents ISP averages by the broadband technology while the one on the left presents individual ISP results. Evidently, the designer treats the two parts as separate.
If that was the intention, there are two decisions that create confusion for readers. First, the charts use two different but related scales. Just add 100% to the scale of the left chart and you get the scale of the right chart. There really is no need for two different scales.
Secondly, orange and blue are used in both charts but for different purposes. In the left chart, orange denotes all ISPs whose actual speeds were below their advertised speeds. In the right chart, orange denotes ISPs using DSL technology.
I also do not understand why some ISP names are bolded. The bolded companies include several cable providers (but not all), several DSL providers (but not all), one fiber provider and no satellite.
Lastly, I'd prefer they stick to one of "advertised" and "promised". I do like the axis labels, saying "faster than" and "slower".
One challenge of the data is that the FCC report (here) does not provide a mathematical linkage between the technology averages and the ISP data. We know that 91% for DSL is the average of the ISPs that use DSL as shown on the left of the chart, but we don't know the weights (relative popularity) of each ISP so we can't check the computation.
But if we think of the average by technology as a reference point to measure individual ISPs, we can still use the data, and more efficiently, such as in the following dot plot where the vertical lines indicate the appropriate technology average:
(The cable section should have come before the DSL section but you get the idea.)
The key message of the chart, in my mind, is that DSL providers as a class over-promise and under-deliver.