As hinted in the previous post, there are rare situations in which pie charts are acceptable; typically, these charts must show proportions that add up to 100%. If column charts (or line charts) are used instead, readers who aren't careful may assume incorrectly that the columns add up to the whole.
Pie charts show distributions. How should one state the key message of the following pie chart?
I. Type A is the majority.
II. The most frequent type is Type A.
III. Type A is a minority.
IV. Every other type but A form the majority.
I would pick statement II, followed by statement I. Statement I is the only false statement out of the four if one uses a strict definition of "majority" (more than half). If one goes by the spirit rather than the word of the law, statement I does pick up the key message albeit imprecisely. Statement III is a true statement but particularly misleading in the context of this pie chart. For every type is a minority type if we define "minority" as less than half. Statement IV is a tortuous way to define a "majority" where there is none.
Neither III nor IV points to a key feature of the data. It seems ridiculous to even include them. Lets reveal the underlying data.
Last week, a story coursed through the mainstream media, relating to the above projections published by the Census Bureau. (Projections were created for 2050 but mention was made of the fact that the largest racial group would account for less than half the population by 2042.) Here were some of the headlines:
"Minorities fixed to become new majority" (Daily Vidette, Illinois State University, 8/20/2008) -- IV
"US set for dramatic change as white America becomes minority by 2042" (Guardian, 8/15/2008) -- III
"...minorities collectively will make up the majority of people in America by 2042..." (Detroit Free Press, 8/21/2008) -- IV
Like I said, statement III is strictly speaking true but by 2042 every race is projected to be a minority. Statement IV is just odd: of course, if one started adding up enough "minority" types, one will eventually attain majority.
Not all is lost, however. The following headlines painted a more vivid image:
Elsewhere, a Boston Globe column makes an important observation: that Hispanic whites should probably be grouped with whites rather than Hispanics. Technically, he argued that Hispanic is not a race. From his point of view, the pie chart looks like this:
The New York Times continued to push the envelope by printing super-complicated data graphics (while the Economist regrettably seemed to have picked the USA Today route... more on that in a future post). The following graphic was used to illustrate the relationship between CEO compensation and their company's stock performance.
The two dotplot lookalikes depicted the percent change in CEO pay and the change in companies stock price, in both cases, from 2006 to 2007. The size of the dots indicates the relative value of the CEO's pay. The gray dots depict "similarly sized" companies for comparability.
In this post, I will focus on the comparison between change in pay and change in stock price for a given CEO. In particular, the calibration of the axis/scale is problematic. The scale is automatically determined by an algorithm; as one switches from one CEO to another, the graphs take on different ranges, use different axis labels, and the zero-percent points shift.
This means that the two charts have different scales. In this example, each tick mark advances 6% in the top chart but 12% in the bottom chart.
Since the zero points do not line up, the distance between the zero and the orange dot loses meaning: the 2.5x longer distance in the top chart actually represented the same percentage change as in the bottom chart (31% versus 28%).
In order to respect the grid-lines (white lines), the tick marks fall onto stray percentages (24%, 36%, 48%, etc.). That's unfortunate.
What's the culprit? This chart is "bound to extremes". In other words, the range of the depicted data is used to determine the plot area. The bottom chart had zero on the left edge because all the stocks depicted rose between 2006 and 2007. It is often better to use domain knowledge to determine the plot area. Extreme values should be omitted if they don't add to the message. Oftentimes, by leaving extreme values in the picture, we squash the rest of the data.
This is also why programs like Excel do a poor job picking a scale.
As an aside, the use of bubbles is almost always troubling. Bubbles do not have a scale so the only information we get is relative size. However, we can't estimate areas properly so we get the relative size wrong. Sometimes, even the chart designer may get stumped. In the chart of Steve Jobs, you would think his bubble (total compensation $1) would be dwarfed by all the other bubbles, as in the WSJ chart we showed the other day. Not so.
There are few situations in which a grouped bar or column chart is the best choice. In such charts, readers frequently have to examine the tips of the bars and yet the bodies of the bars obstruct comparisons. Placing data labels instead of an axis is a nice touch; lining the labels up would be even better. The junkart version below uses a dot plot which allows for comparisons within each payment type, and comparisons between payment types, to reveal themselves.
The second chart is also unnecessarily complex. The use of double axes announces trouble, so too does the superposition of lines and columns. The data to ink ratio of the chart is low because the data in the columns adds up to the numbers in the line. Crucially, it is always important to clearly point out projected values (versus actual values). Here is a junkart version. The first revision focuses on dollar volume, showing that despite faster growth, alternative payments are merely catching up to traditional payment growth. The higher growth rate is applied to a much smaller base!
The second revision focuses on growth rates. Notice that all values here are projections.
This chart from the NYT was intended to show how the EPA has moved the bar on vehicle mileage ratings: 2008 estimates were lower than 2007 estimates across the board, regardless of manufacturer, model and city/highway.
The chart was built from one basic component, repeated for each model. I like the discreet gridlines (the white ticks) which enable readers to count off the mileage ratings.
The data is rich: ratings were given along three dimensions (model, year of estimate and city/highway). Readers can benefit from a stronger guidance in where to look for the most pertinent information. As the chart stands, it is merely a container for the data. It fails our self-sufficiency test: all the data were printed on the chart, and the bars add little.
In the junkart version, I use knowledge of the data to structure the chart. First, noting that sedans, hybrids and trucks/SUVs/minvans have different levels of mileage ratings, I clustered the models into three groups. Secondly, the city and highway ratings were separated into two columns as I consider the between-model comparisons more important than city-highway comparisons. The chart is a dot plot, with a vertical tick for 2007 estimates and a dot for 2008 estimates. It's easy to see that all dots sit to the left of vertical ticks.
More subtly, we can also see that the hybrids appeared to have been penalized more. Or perhaps, the higher the rating, the larger the downward adjustment...
Source: "Mileage Ratings Are Still Estimates, Though Closer to Reality", New York Times, Sept 16 2007.
A couple of you noticed this table of bubbles in the Times, and asked what I think of it. Dustin J suggested that this could be considered a decent application of bubble charts. I agree, with some reservations.
The data set is the best thing about this chart. The riches that lay beneath! Many questions can be addressed, including:
Which Presidential candidates are getting the most face time?
Are candidates seen equally often across the stations?
Are there differences between network and cable stations in terms of total face time? In terms of individual face time?
Are there Democratic/Republican leanings by station? by type of station?
The intrepid can even build a regression out of it.
The bubble chart contains answers to all those questions but nothing jumps out. Okay, it's easy to see the station that gives each candidate the most face time. Anything else requires moderate to a lot of effort. Here's the junkart version.
The list of things done to the data is long:
Candidates are grouped together by party
Candidates within each party are arranged in order of decreasing maximum face time
Stations are arranged by increasing total face time, this order happens to retain the network vs cable divide
A heat map construct is used instead of bubbles: the legend is missing but there are four hues for each color: darkest = top 10%; medium = 10th - 50th percentile; light = bottom 50th percentile excepting zeroes; white = no face time. In raw numbers, 90th percentile = 81 minutes, 50th percentile = 19 minutes.
The only data shown are the totals by candidate and totals by station.
On the right margin are little bar charts that show the distribution of network/cable for each candidate.
On the bottom margin are little column charts showing the distribution of party affiliation by station.
A few observations follow:
Cable stations gave much more face time to the candidates in general. Fox, no surprise, gives Republicans 85% of its time while all the others were roughly equal.
The more mainstream the candidate, the balanced was the time spent on networks versus cable. John McCain (R), Hillary Clinton (D) and John Edwards (D) had the highest proportion of network time.
More time is not necessarily good since McCain was the clear winner but his campaign is struggling
Source: "Tracking Face Time", New York Times, August 1, 2007.
One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough). The thoughtfulness of these nominations continues to impress me.
Evan sent in 254 charts he created after looking at the post on baby names. An example is shown on the right.
He is particularly interested in the question of names that are given to both males and females.
For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side.
It's a nice touch to label the most recent year. I'd also label the values for the most recent year on the axes.
Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:
My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.
In other words, for less popular names, the top chart would look much more compressed.
There are many more charts to sift through on his site. Evan welcomes suggestions.
Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line. This is a pretty chart that does an admirable job with a difficult data set.
The key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense. So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line. In addition, the total of each column can be much more than 100% because multiple responses were allowed.
Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people. A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers". But this is wrong because the chart hides the age distribution. While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives". This is the difference between prevalence and incidence rate. (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)
The construct of the square grids is less damaging than it seems. In effect, the data has been rescaled by dividing by 10. The reader is then forced to apply "rounding". If you are someone who sees $19.95 as $19, then you'd round down the partial rows. If you see $19.95 as $20, you'd round up the partial rows. So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.
Here's another example where the profile chart shines. Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work. (Prior discussion of this issue here.)
This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities. The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives". We also see that the likelihood of being "Collectors" has little to do with age.
An anonymous reader dropped a comment pointing us to Martin Wattenberg's gallery at Business Week. Martin's work falls into the category of information visualization, which typically concerns cramming as much high-dimensional data as possible onto 2D or 3D displays, augmented heavily by colors, shapes, interactivity, superpositioning and other tricks. Often pleasing to the eye, these graphics usually take time to warm up to. Sites like Infosthetics and Visual Complexity cover them well.