Omegatron re-cycled some Wiki charts and we are happy to report that they are great improvements over the originals. I welcome other readers to alert us when you have done your bit of community outreach, by ridding the world of chartjunk. The email address is the name of the blog at gmail for any submissions.
Why are pie charts such poor tools for communications? Think about where we can place the message, and you'll find that this chart type is far too rigid. In a pie chart, the key resource is the relative size of the sectors, followed by the number of sectors, and sometimes the size of the total pie. Other than those, there are little else useful in a pie chart.
The histograms used by Omegatron are much more flexible. There can be information encoded in the height of the bars, the width of the bars, the total area, the relative distribution of bar areas, the existence and location of peaks and troughs, etc. etc.
For comparing data collected in slightly different formats, the pie charts are hopeless. Notice that the lowest category on the left (pink) corresponds to 8 weeks or less, which would include two and a half sectors on the right plus potentially a missing sector for 4 weeks or less. The histograms below handle this easily.
Omegatron asked for some feedback. I think the new ones are significantly better. A few minor points:
Instead of coloring the background to the chart, I'd color the bars themselves into green/yellow/orange according to the trimester
I'd put the trimester labels under the horizontal axis, close to the "week" labels
The charts obviously need to identify the country and year of the data (which I added). Omegatron pointed me to an inexplicable Wiki convention of not putting text inside charts (see here). I must disagree with this convention. Annotations on charts are some of the most useful things.
If these two charts are to be placed side by side for comparison, then we need to sort out the verticalscale. It cannot be the absolute number of abortions but some kind of relative scale in proportion to the population size, or some similar metric.
When it comes to space or time in graphics, old habits die hard. When we have spatial data, the default is to put it on a map. When we have a time series, the default is to plot time along the horizontal axis. Sometimes, these defaults work; other times, breaking up the map or straight-time-line works better.
Thanks to a reader, I noticed that Google put up a "Flu Trends" website to help us track the flu season. They use two main charts to plot the data, as shown below.
On the right side is the time series, showing the severity of flu cases from month to month. There are many great things about this chart and one serious flaw. I love the fact that they did not plot time on the horizontal axis; they realize the seasonality and they create overlapping lines. They make good use of foreground and background; it's easy for us to compare year to year differences.
The serious flaw: no vertical scale. This was a problem with Google Trends from day one (see my post here). They still haven't fixed it. Because of this, we don't know if the peak shown was 5 cases or 5000 cases. While for Google key word searches, one can excuse them for trying to protect commercial secrets. I would imagine that this public health data is, well, public. Since the apparent purpose of this chart is to allow citizens to declare a flu epidemic (say, when they see the current trend depart from the historical norm), not having the scale is a huge problem.
I also disagree with shifting the months around for the Northern Hemisphere so that the peaks of the graphs are aligned towards the middle. It is better for the peaks to appear on the left and let the order of the months conform to our expectation. (The "peak" would be split on the sides and the chart would look like a valley, which presumably is why they did it this way.)
The charts on the left side plot the spatial data, not surprisingly on maps. Sadly, the standard exhibited on the time-series charts is nowhere found on these maps.
There are few situations in which a grouped bar or column chart is the best choice. In such charts, readers frequently have to examine the tips of the bars and yet the bodies of the bars obstruct comparisons. Placing data labels instead of an axis is a nice touch; lining the labels up would be even better. The junkart version below uses a dot plot which allows for comparisons within each payment type, and comparisons between payment types, to reveal themselves.
The second chart is also unnecessarily complex. The use of double axes announces trouble, so too does the superposition of lines and columns. The data to ink ratio of the chart is low because the data in the columns adds up to the numbers in the line. Crucially, it is always important to clearly point out projected values (versus actual values). Here is a junkart version. The first revision focuses on dollar volume, showing that despite faster growth, alternative payments are merely catching up to traditional payment growth. The higher growth rate is applied to a much smaller base!
The second revision focuses on growth rates. Notice that all values here are projections.
In the course of business and governing, a lot of charts are generated. An anonymous tipster pointed us to a set created by the "Communities and Local Government" division in the UK government. Judging from the content, this division has responsibility for economic development in local neighborhoods.
Below are a pair of exhibits. Truly they are trying too hard! What we see is a hybrid scatter-bubble chart. Between the jargon, the acronyms (LAD, LSOA), the boxed text, the multi-color circles, the colored axis labels and lack of title, the reader is plunged into a state of confusion.
The chart can be unraveled. Each district was evaluated based on two measures of "gaps in worklessness". The vertical axis compares each district to the national average; positive numbers indicate an above-average district relative to the nation. The horizontal axis compares the most deprived 10% neighborhood within each district to the local average; positive numbers indicate worst neighborhoods improving.
Thus, the policy goal would be to move all districts into the upper right quadrant. The multi-color bubbles were designed to show us the state of the nation. On the left chart, 41% of the districts (or population?) reside in the improving districts while 19% live in deteriorating areas.
The following strategies can help improve readability:
use English on the axis
relegate technical definitions to the legend
add succinct title to tell the story
use color on the data rather than on axis or data labels
use color to draw attention to the upper right quadrant
I share reader Bernard L.'s enthusiasm for this very imaginative chart, courtesy of the graphics people at NYT. The chart captures the ebb and flow of weekly movie receipts over the last two decades. The details that particularly interest me include:
The addition of area colors (on top of lines) serves to highlight box office successes; this really helps readers sort out the massive amount of data
Nicely spaced text (and dots) does not interfere with our reading of the chart
The hiding of text for less important films, plus taking advantage of interactivity to show their titles if the reader mouses over the respective areas
All of the above indicate a keen sense of foreground versus background. Besides, the authors had the good sense to speak of inflation-adjusted box office sales; I'm tired of the movie industry proclaiming higher sales each year when ticket prices are rising, and the population is growing.
This is another chart where more data do not easily translate into better communication (see my guest post at Flowing Data). While I like the playful nature of the interactive chart, it is left to the reader to discover the information buried in the data, such as the assertion in the header that Oscar-winning films typically take time to attain box-office success while many blockbusters do not Oscars make.
In this presentation, it is challenging to compare the total receipts of one film versus another (this requiring comparing oddly shaped, partially obscured areas). It is also hard to compare across years since the data is spread out over a lot of space.
There may really be two types of graphics: the one like the example here which is a dictionary and designed for exploration; and the other kind where the designer has selected a subset of the data to make a specific point.
Reference: "The ebb and flow of movies", New York Times, Feb 232008.
It's pretty hard to decree hard-and-fast rules for graphical design; every rule seems to admit its exception. This reinforces Tufte's contribution as he has successfully organized the rules in his collection of books.
Dustin J sent in this chart from the Economist. Its first impression is ugly and overly complex.
Steven Few says not to use stacked bar charts because you cannot compare individual values very easily and as a rule I avoid stacked bars with more than six or seven divisions. What do you think of this stacked bar--I think it is quite effective in telling the story.
On this blog, I have also re-done some stacked bar charts but this one is truly an exception to the rule. The reason why this one works is that it's not about
the individual components, it's showing that the US consumes more than
all those countries combined.
If only it has the proper caption! The Economist is uncharacteristically detached here: "Petrol consumption per day", "Litres bn, 2003". How about "Goliath v. Davids"? "US v. the World"? "Dream Team USA"?
It'd help if they tone down the colors; also, by simply annotating the total litres for the US and the total for the other countries, they would have made a clearer point without using gridlines. But these are minor glitches in an otherwise effective chart.
An anonymous reader dropped a comment pointing us to Martin Wattenberg's gallery at Business Week. Martin's work falls into the category of information visualization, which typically concerns cramming as much high-dimensional data as possible onto 2D or 3D displays, augmented heavily by colors, shapes, interactivity, superpositioning and other tricks. Often pleasing to the eye, these graphics usually take time to warm up to. Sites like Infosthetics and Visual Complexity cover them well.
Derek C. points us to this effort by a science journalist to use graphs to help "clarify the concept of climate change". The graph on the left shows that actual greenhouse gas emissions have exceeded the level predicted by the most pessimistic climate models. The 3D bar chart on the right examines which countries had most increased emissions since 1990.
While the bar chart contains many of Tufte's "ducks" (not sorted by percent change, 3D, color, gridlines, sufficiency, etc.), it's the left chart that can be made more powerful.
The casual observer does not need to know which model led to which trajectory of predictions; the graph is vastly simplified, and the message much clearer in the junkart version. (I only included the CDIAC data because I didn't locate the EIA numbers.)
The general point here is recognizing what is foreground, and what is background. Aside from gridlines, data labels, axis labels and so on, some of the data usually constitute background material, often as in this case being used to establish comparability.
One message I got out of this chart is that these climate models have done a good job! (Now, I have no idea if part of the curve included the training period. It is curious that the predictions were very narrowly contained in the early 1990s.)
In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median. In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.
Which one to use depends on the situation. Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location. In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.
On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days. Then perhaps the median error would come into play.
It turns out it doesn't much matter for our weather forecast data. In this new chart, I superimposed the mean error data (in black). The scatter of points was exactly as it was for median error (in red). (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)
This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.