This is a case of the chart telling a different story from the data. Let's look at one of the charts, piece by piece.
The first pie(ce) suggests that methane and carbon dioxide (CO2) adds up to some total. That is the only way to read a pie chart. A pie chart shows components of a whole.
What is the whole? It's hard to interpret without some explanation. The title at the bottom says "Radiative Forcing change over the last 30 years" with a footnote disclosing... hold your breath... "Radiative forcings from other gases and human impact are not shown."
In other words, the visual object says that Radiative forcing from CO2 is about 5 times larger than that of Methane. A column chart would have displayed this relative scale more clearly.
But that chart is only one of a pair. Here is the whole picture:
This pair tells a particular story: Methane was a much larger share of something in the past and is predicted to become an almost irrelevant share of something in the future.
But such an interpretation would almost surely be wrong. The designer left a misleading cue here, which is to show two pies of equal size. There is just no conceivable way that the total "radiative forcing change" is identical in the last 30 years to that in the next 30 years.
The second pie chart also has a footnote. A better person can help me interpret what the following sentence means:
The radiative forcing that our current emissions have committed us to, 20 years from now, is based on a 300-year initial drawdown time scale for carbon dioxide, and 12 years for methane
I'm sure these words say something to a climate expert but this attempt stinks as a piece of public communication.
Returning to the equal-size pies for a moment. Since all other factors are removed, the chart only shows us the relative impact of Methane versus Carbon dioxide. If the data are to be believed, then the scale of the impact of Methane is expected to become much smaller relative to that of CO2 in the next 30 years. This does not imply that the absolute impact of Methane will be lower in the future than in the past.
There are three possible stories, all consistent with the above chart:
1) the absolute impact of Methane declines while the absolute impact of CO2 increases, and thus the relative impact of Methane decreases drastically
2) the absolute impacts of both decline but the impact of Methane declines a lot more
3) the absolute impacts of both increase but the increase of Methane's impact grows a lot more slowly
It is the designer's job to make it clear to readers the story of the data.
The fact that the entire blog post contains a PDF image and no words is either laziness or arrogance. The title of the piece is "the story of methane, in five pie charts". I don't know what the story of methane is. I doubt that the intention of the author was to tell us that methane is extremely unimportant relative to CO2.
PS. Steven below linked to a response from RealClimate.org. They confirm that the "story of methane" is that it is unimportant relative to CO2. Perhaps they should have called it the "non-story of methane". They see no problem with these pie charts.
Back in 2008, I wrote about this unfortunate chart by the Guardian (link):
The barrel imagery interferes with communicating the data. The green portion looks about the same size as the red portion when the number is four times smaller.
This week, the staff at WSJ publish a similar chart in this article about North Dakota fracking.
They kind of recognize the distortion and utilize a horizontal cutup instead of following the edge of the barrel. But it doesn't really fix the problem if you look at how 3 percent at the top and at the bottom of the barrel are portrayed.
It's not clear to me why they don't use a simple stacked column chart with horizontal text labels.
Reader and tipster Chris P. found this "death spiral" chart dizzying (link).
It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.
Why doesn't this spider chart work? We can be more precise.
A big problem is the lack of scalability. This chart looks different every year. If you add an extra year to the chart, you either have to increase the density of the years or you have to drop the earliest year.
Years are not circular or periodic so the metaphor doesn't quite work.
Axis labeling is also awkward. Because of the polar coordinates, the axes are radiating so the numbers run up toward the top but run down toward the bottom.
This specific instance of spider chart benefits from the well-behaved data: the between-year variability is much lower than the within-year variability. As a result, the lines don't cross each other much. If the variability from year to year fluctuates a lot, we would have seen a bunch of noodles.
This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)
This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.
Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:
The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)
This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:
Or switch to a small-multiples set up with three lines per chart and one chart per season.
The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:
The pattern is UP-DOWN-DOWN-UP.
In fact, a side-by-side boxplot of the data provides a very informative look:
The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.
If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:
Regardless of season, the trend is down.
PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.
Back in 2007, the New York Times graphics team produced a fabulous chart explaining the rise in prices at the pump (link).
Let's start with the tab labeled "Regional Price" which contains a well-executed map of the average gas prices by county:
The colorscale is wonderful. It's just one color and yet the gradations are easily discerned. The general spatial pattern jumps out at you, with prices being higher in the Pacific coast, and lower in New England all the way down south. The Lakes region also has higher prices so does New Mexico and Colorado and Hawaii.
What sets this legend apart is varying lengths of the segments. In particular, the darkest blue also corresponds to a wide range of prices (3.45-3.94). One can also easily figure out the lowest and highest price in the nation--the designers located exactly in which counties those prices were recorded, which is another nice touch.
To determine the breakpoints on the legend, one can use a statistical methodology: a standardized scale anchored on both sides of the national average price (from the other chart, the average price was $3.22). Then, we have each color mapping to the length of one standard deviation of prices in both directions. What this does is to put counties into standardized groups: for example, all counties whose prices were within one standard deviation above the average are given one tint while those that were one to two standard deviations above the average has a darker blue, and so on. In effect, we would have created a contour map.
I see the designers' intention in clearly labeling the areas where they do not have data, with the diagonal stripes on white. My own preference is to put those areas in a mild gray, in effect blending them into the surroundings. In this way, the missing data do not distract the average reader, while the fastidious reader can still figure out where the data holes are.
This is a key learning for most research scientists. We have a tendency to train our eyes on the outliers and the data holes because they are like imperfections in diamonds. This leads us to the tendency of highlighting the least important message up front. And it's a bad habit.
In the following, I put the county and state level views side by side. The NYT graphic allows users to switch between the two views via a tab.
Much like the recent post on the age of buildings in Brooklyn, the state aggregates tell a simpler story but still capture almost all of the spatial pattern. The average prices per state are now printed directly on the chart. The question the designer should ask is what the readers want to learn from such a chart, and which one delivers more of such requirements. It's possible the Times is catering to two types of readers. Perhaps one can strike a middle ground, which is to break out certain states like Texas into contiguous "regions".
Thanks to you for continuing to make this blog a success. Writing it has given me much enjoyment over the years, and I have learned much from your comments as well as from the visualization projects of many colleagues. 2013 also saw the publication of my new book Numbersense: How to Use Big Data to Your Advantage (link). I thank those of you who have purchased the book, and supported my writing. For those who haven't, please check it out. I have also been speaking at various events, mostly about interpreting data analyses published in the mass media, and building effective data analytics teams. In addition, I am heavily involved in the new Certificate in Analytics and Data Visualization at New York University (link). While the frequency of posting has suffered a little due to my other projects, I hope you found the contents as engaging, fun, and constructive as before.
Looking forward to 2014, I have as usual a basket of projects. Besides the two blogs, I will be expanding my teaching at NYU, including a visualization workshop that I'll be writing about here soon; taking on consulting projects; evangelizing better communications of data and analytics; and prospecting several book projects. I continue to spend most of the week at Vimeo, where my team analyzes data.
This will be my last post in 2013. It is an extra-long post to tie you over to the New Year. Happy New Year!
A short while ago, I was in correspondence with Thomas Rhiel who created a lovely map depicting the age of buildings in Brooklyn (link). In this case, it's the data that intrigues my interest. I haven't seen this type of data visualized before. The map type is exquisitely aligned to the data: buildings are geographically located and the age is a third, non-geographical dimension which is encoded in the colors. Red-orange is the most recent while green-blue is the oldest.
The data is at the level of individual buildings. If you hover over a building, you find the raw data including the address and the year of construction. The details seem to show that even the shape of each building is depicted. This really impressed me since a lot of manual labor must have been applied (according to Rhiel, there is a source for this type of data). Here is the map at its most magnified:
I came across this starry patch near the Manhattan Bridge, in which the buildings show up as red asterisks. (Rhiel said the shape came from the data. I am not sure I believe the data. Anyone lives near Sands Street?)
The map is useful if you are interested in questions such as "where are the new developments" (look for the deep red buildings) or "what's the average age of the buildings in a specific block" or "what's the age distribution of the buildings in a set of blocks". At the magnified level shown above, the street names are available to help readers orient themselves. The light gray color keeps the roads and the names safely in the background.
Now, zoomed to the other extreme, we get the image of the whole of Brooklyn:
I have a couple of suggestions for Rhiel. As someone who is not familiar with the geography of Brooklyn, this view presumes knowledge that I don't have. Unlike the magnified view, there are no text labels to help us decipher the different sections of Brooklyn. It would be nice if there is a background map to indicate the better-known areas like Williamsburg or Brooklyn Heights or Red Hook, etc.
The other concern is the apparent lack of pattern shown here. At this level, an appropriate question is which sections of Brooklyn are being redeveloped and which sections have older buildings. I see sprinkles of colors everywhere, giving the impression that everything is average. I suggested to Rhiel that aggregating the data would help bring out the pattern.
In data visualization, there is an obsession of plotting the "raw data" at its most granular level. Sometimes, this strategy backfires. It's the classic signal versus noise problem. Aggregation is a noise removal procedure. If for example, Rhiel gives up the data for individual buildings, including those beloved building shapes, and looks at the average age of buildings within each block, or even Census tracts, I suspect that the resulting map would be more informative.
It turns out that the Graphics team at the New York Times just published an interactive map that illustrates exactly what I suggested to Rhiel. Since this post is getting long, please go to the next post to continue reading.
On Friday, I'm attending and speaking at the Leaders in Software and Art Conference, organized by Isabel Draves. LISA is an amazing gathering of artists interested in technology and software. For example, there is a panel on 3D printing and hardware hacking, and one on "creative coding, art and advertising". Check out videos from past years, and click here to register. My talk is at around 3:30 in a tightly packed day of activities.
Andrew Sullivan highlighted a chart showing the public attitude toward climate change globally:
Andrew summarized the above chart thus: "Sadly, America is home to far more climate skeptics than the global average."
This conclusion may be correct but the chart is less convincing than it appears.
Let's pull out the Junk Charts Trifecta Checkup. Recall that there are three sides to the triangle. The question is well-posed, and the bar chart is an adequate choice for this data. We thank the designer for not printing the entire data set on the tight space, and to start the vertical axis at zero.
There are a few improvements one can still make to the bar chart. Start with turning it around so that the reader doesn't have to turn his/her head around. Also, extend the axis to 100% helps the interpretation a little bit.
If you have keen eyes, you notice that Greece showed up at the top of the revamped chart. The bar for Korea is also a tad too short in the original chart; it should be at 85%.
To what extent is the set of countries "global"? Take a look:
It missed all of Scandanavia, most of Indochina, India, much of Africa, and all of Central America.
In the Trifecta checkup, we note that the data may not be complete for the posed question. Given this flaw, the map is perhaps a better choice to show us where the holes are.
Abhinav asks me to check out his blog post on a chart on global warming (I prefer the term climate change) featured on Wonkblog. The chart is sourced to a report by the World Metereological Association (link to PDF).
Hello, start the axis at zero whenever you are using plotting columns. That's as fundamental as only plot proportions on a pie chart.
There is a reason why the designer didn't like to start the axis at zero. It is this (Abhinav helpfully made all these charts):
The trouble is that for this data set (on global average temperature), the area below 13 is completely useless. It's like plotting body temperature on a scale of 0 - 100 Celsius when all feasible values fall into a tight range, maybe 35-38 Celsius. I recount a similar situation that led to a college president saying something stupid in Chapter 1 of my new book, Numbersense. (Information on the book is here.)
So we understand the desire to get rid of the irrelevant white space. This is accomplished by using a line chart. (I'd prefer to omit the data values, and rely on the axis.)
Abhinav then created various versions of this by compressing and expanding the vertical scales. I don't think there is anything wrong with the above scale. As I mentioned, the scale should focus on the range of values that are feasible.