I had the pleasure of visiting the Facebook data science team last week, and we spent some time chatting about visual communication, something they care as much about as I do. Solomon reported about our conversation in this blog post. One topic is stacked bar charts, which are useful in limited situations, such as when the categorical variable has two or three levels.
Solomon used stacked bars in his fascinating post about how candidates from the two political parties are using Facebook messages in the run-up to mid-term elections. Be sure to read about it here. This is an example of good data journalism in which the outcomes of the analysis are presented simply, hiding the amount of technical work that went into its production.
This stacked bar chart is effective at pointing out the differences in the types of messages being sent out by party:
I do have one question, which is the placement of the 50-percent line. The line is very important to this chart, and I like the way it looks. When the line sits at 50 percent, it implies that the Republican and Democratic candidates were issuing about equal numbers of Facebook messages. If the share of total messages is not 50/50, then the reference line should sit elsewhere.
They later split the races by tosses-up versus uncompetitive, and use confidence intervals to communicate both the expected rate and the uncertainty of the estimate. The uncertainty bars in effect tell readers that there are many more uncompetitive elections than tosses-up.
The choice of the chart form is fine. But it makes me pull out my Tufte book. The data-ink ratio on this chart needs a little help. The gridlines can go. Even the 250 label on the x-axis can go. I might even go with just labelling the midpoints.
Lastly, this next chart is enlightening. Seems like older adults are much more likely to comment and/or like such political messages; men are more likely to comment while women are more likely to like. The small-multiples format helps us grasp the three-way analysis without much suffering.
This is a case of the chart telling a different story from the data. Let's look at one of the charts, piece by piece.
The first pie(ce) suggests that methane and carbon dioxide (CO2) adds up to some total. That is the only way to read a pie chart. A pie chart shows components of a whole.
What is the whole? It's hard to interpret without some explanation. The title at the bottom says "Radiative Forcing change over the last 30 years" with a footnote disclosing... hold your breath... "Radiative forcings from other gases and human impact are not shown."
In other words, the visual object says that Radiative forcing from CO2 is about 5 times larger than that of Methane. A column chart would have displayed this relative scale more clearly.
But that chart is only one of a pair. Here is the whole picture:
This pair tells a particular story: Methane was a much larger share of something in the past and is predicted to become an almost irrelevant share of something in the future.
But such an interpretation would almost surely be wrong. The designer left a misleading cue here, which is to show two pies of equal size. There is just no conceivable way that the total "radiative forcing change" is identical in the last 30 years to that in the next 30 years.
The second pie chart also has a footnote. A better person can help me interpret what the following sentence means:
The radiative forcing that our current emissions have committed us to, 20 years from now, is based on a 300-year initial drawdown time scale for carbon dioxide, and 12 years for methane
I'm sure these words say something to a climate expert but this attempt stinks as a piece of public communication.
Returning to the equal-size pies for a moment. Since all other factors are removed, the chart only shows us the relative impact of Methane versus Carbon dioxide. If the data are to be believed, then the scale of the impact of Methane is expected to become much smaller relative to that of CO2 in the next 30 years. This does not imply that the absolute impact of Methane will be lower in the future than in the past.
There are three possible stories, all consistent with the above chart:
1) the absolute impact of Methane declines while the absolute impact of CO2 increases, and thus the relative impact of Methane decreases drastically
2) the absolute impacts of both decline but the impact of Methane declines a lot more
3) the absolute impacts of both increase but the increase of Methane's impact grows a lot more slowly
It is the designer's job to make it clear to readers the story of the data.
The fact that the entire blog post contains a PDF image and no words is either laziness or arrogance. The title of the piece is "the story of methane, in five pie charts". I don't know what the story of methane is. I doubt that the intention of the author was to tell us that methane is extremely unimportant relative to CO2.
PS. Steven below linked to a response from RealClimate.org. They confirm that the "story of methane" is that it is unimportant relative to CO2. Perhaps they should have called it the "non-story of methane". They see no problem with these pie charts.
An anonymous reader sent in a Type V critique of the following map of July unemployment rates by state. The map was published by the Bureau of Labor Statistics (BLS), and used in a recent article in Vox.
Matt @ Vox took the BLS's bait, and singled out Mississippi as the worst in the nation. Our reader-contributor is none too pleased with this conclusion.
He noted that the red state stands out only because of the high "out of sample" top range of the legend. Three out of the seven colors are not found on the map at all! This is kind of like the white space problem when doing a line plot with large values and an axis starting at zero (for example, here), but the opposite. All the states are compressed into four colors, three of which are shades of orange.
The reader investigated, and reported back:
The top end of the legend seems to be set by Puerto Rico's 13.1%. Puerto Rico is omitted from the Vox map as well as from the BLS publication (link to PDF).
Mississippi only has the bare minimum, 8.0%, to qualify for the red color. Georgia is a 7.8; Michigan, Nevada, and Rhode Island are all 7.7.
24 (of the 50 States plus DC) are in the 6-8% band, and 21 are in the 4-6% band, with the remaining 5 under 4%.
None of the above is obvious when looking at the map.
In the Trifecta Checkup, this is a Type V chart. The data is accurate. The question being asked is clear but the visual construction is problematic.
[I'm seizing back the mike.] While the map is often not the best choice for showing geographic data, something we frequently cover on this blog, in this particular case, there is a strong regional pattern. Of course, with the compressed choice of colors, this regional pattern is not easily observed in the original.
Matthew Yglesias, writing for Vox, cited the following chart from a World Bank project:
His comment was: "We can see that while China has overtaken Germany and Japan to become the world's second-largest economy (i.e., total area of the rectangle) its citizens are nowhere near being as rich as those of those countries or even Mexico."
Yes, the chart encodes the size of the economy in a rectangular area, with one side being the per-capita GDP and the other being the population. I am not sure about the "we can see". I am not confident that the short and wide rectangle for China is larger than the thin and tall ones for Japan and for Germany. Perhaps Matthew is relying on knowledge in his head, rather than knowledge on the chart, to come to this conclusion.
This is the trouble with rectangular area charts: they have a nerdy appeal since side x side = area but as a communications device, they fail.
Here are some problems with the chart:
it's difficult to compare rectangular areas
the columns can only be sorted in one way (I'd have chosen to order it by population)
colors are necessitated by the chart type not the data
the cumulative horizontal axis makes no sense unless the vertical axis is cumulative GDP (or cumulative GDP per capita)
Matthew should also have mentioned PPP (Purchasing Power Parity). If GDP is used as a measure of "wellbeing", then costs of living should be taken into account in addition to incomes. The cost of living in China is much lower than in Japan or Germany and using the prevailing exchange rates disguises this point.
Try your hand at fixing this one. There are no easy solutions. Does interactivity help? How about multiple charts? You will learn why I classify it as QDV instead of just DV.
[Update, 8/18/2014:] Xan Gregg created a scatter plot version of the chart. He also added, "There is still the issue of what the question is, but I'm assuming it's along the lines of "How do economies compare regarding GDP, population, and GDP/capita?" I'm using the PPP-based GDP, but I didn't read the report carefully enough to figure out if another measure was better."
This sort of chart is, unfortunately, quite common in business circles. Just about the only thing one can read readily from this chart is the overall growth in the plug-in vehicle market (the heights of the columns).
To fix this chart, start subtracting. First, we can condense the monthly data to quarterly:
This version is a bit less busy but there are still too many colors, and too many things to look at.
Next, we can condense the makes of the vehicles and focus on the manufacturers:
This version is still less busy and more readable. We can now see Chevrolet, Nissan, Toyota, Ford and Tesla being the five biggest manufacturers in this category. All the small brands have been aggregated into the "Others" category. The stacked column chart still makes it hard to know what's going on with each individual brand's share, other than the one brand situated at the bottom of the stack.
This shows the growth in the overall market, as well as several interesting developments:
The growth in the number of competitors in the market especially since 2012
The fragmentation of the market. Before mid 2012, Chevrolet was dominating the market. Since then, there are five or six brands splitting the market
The first-to-market brands have not been able to sustain their advantage
A smoothed version of the line chart is even more readable:
Graphics is a discipline that often rewards subtracting. Less is more.
In the above discussion, I focused on the Visual aspect of the Trifecta Checkup. This dataset is really difficult to interpret, and I'd not want to visualize it directly.
The real question we are after is to assess which manufacturer is leading the pack in plug-in vehicles.
There are a number of obstacles in our path. Different makes are being launched at different times, and it takes many months for a new make to establish itself in the market. Thus, comparing one make that just launched with another that has been in the market for twelve months is a problem.
Also, makes are of different vehicle types: compacts, SUVs, sedans, etc. More expensive vehicles will have fewer sales whether they are plug-ins or not.
Thirdly, population grows over time. The analyst would need to establish growth that is above the level of population growth.
Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.
The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)
The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.
The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.
Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.
Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.
In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home. Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.
The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.
The prior post on the original graphic can be found here.