Main | August 2005 »

Donuts: Still inedible

Concerning my post on the Economist's use of donut charts, Phillip on Blogcritics raised two issues worth further study.

First, are pie and donut charts "natural" for representing percentages?  In my opinion, if only one set of percentages is involved, then either a table of numbers or a "decile chart" works much better.

Redosingledonut_1The "decile chart" would look better if I have used 10 human figures, one for each 10% of the population.

 


This "decile chart" addresses the visual estimation issue Phillip brought up.  Because each dot / human figure represents 10%, even if the percentages are not annotated, the reader can gauge them visually.  Not so for pie slices: no one will be able to tell a 15% slice from a 10% slice.

Besides, the point of the original graphic was to compare percentages.  The message is that the white population is expected to decline while the Hispanic, Black and Asian populations would increase.  When two pies (or donuts) are used, the reader is tasked with differentiating a 67% slice from a 58% slice situated an inch apart.  That, I submit, is a tall order.  By contrast, the growth rate is explicitly coded into the gradient of the lines in my junkart: the steeper the line, the higher the growth (or decay).

Second, what if growth rates are chaotic and lines criss-cross each other?  This presents no problem at all:
Redocrisscrosslines
The first line chart shows two segments increasing at the same rate and one segment declining fast.  The second chart shows two segments dropping at different rates and one segment skyrocketing.  Because the growth rate is explicitly plotted, the reader has no problem picking it up.

The astute reader will note this chart looks like the marvellous Bumps chart.

Phillip also noticed another feature of the Economist chart that escaped me: the two donuts were sized proportional to the total populations in 2004 and 2030 respectively.  Ouch!  Now, the area of a slice depends on both angle and radius, making it nigh impossible to compare them.

Thanks Phillip for pushing my thoughts on this.


Big Revenues or Big Profits?

Bigcompanies072805_1The chart on the right displays revenues and profits of the 10 largest companies in the world in 2004 (as ranked by revenues).  The graphic is yet another container of data: it elevates revenues, treating profits as a side thought. This table is clearer:

Redobigco0_1

 

 

 

 


The next two charts attempt to put revenues and profits on an equal footing.  I include the left chart to illustrate the folly of typical bar charts: the height of the bars serves no purpose except to distort our judgement of the lengths.  The right chart is preferred.

Redobigco1_1

 

 

 

 


Finally, the following statistical graphic brings out insights from the data. 

Redobigco2_1

  • The 10 companies fall into three groups: Giants (Exxon Mobil, Royal Dutch/Shell, BP, Walmart), Big and Mean (GE, Total, Toyota) and Big and Fat (Ford, DiamlerChrysler, GM).
  • Giants have revenues over $270 billion, among them the oil companies are more profitable than Walmart, even though it has the highest revenues.
  • Big and Fat consists of auto companies with big revenues but low profit margin.
  • The ranking and relative size of revenues can be read off the horizontal axis.  Similarly for profits on the vertical axis.  So both bivariate and univariate distributions are available on one chart.

This chart, a favorite of statisticians, is not without problems.  It is difficult to place the data labels without obstructing readability.  I placed the labels where they don't interfere with any reader tracing  dots to the axes, effectively in the northeast quadrant of each dot.  I omitted the country labels so as not to clutter the space.  I used color to further separate the dots from the labels.  I also sacrificed printing every data value for readability, assuming that the reader's interest in magnitude is not absolute.

Reference: "The World's Biggest Companies", Economist, July 28 2005.

 


The value of sampling

In a previous post, I talked about "wasted motion" in stock return charts.  This wastage is another form of embarrassment of riches: a graphic should not be merely a container of data; a good graphics designer removes redundant data that do not amplify her message.

Sampling is one useful technique.  Samplingstocks_1 On the right I created three stock return charts displaying the same data.  The top one plots daily stock returns, following the NYT example.  The bottom two charts plot monthly and quarterly stock returns for the same period.  In essence, they show smaller portions (samples) of the data from the first chart.

The NYT article was concerned with growth stocks (Intel, Cisco) versus value stocks (GE, Pfizer).  The data point out that value stocks outperformed growth stock during the period up to July 17 2005 but both groups of stock underperformed the S&P market index.  The key features, then, should have been the end-of-period returns, followed by general trends.

The quarterly (or monthly) chart brings this message out much more clearly because it contains less clutter.  The end-of-period returns and the trends show up on all three charts.

The astute reader will notice that I moved the axis to the right side of the graph because the end-of-period return is the key feature (the beginning-of-period return is zero for everything).  Also,  color-coding the two groups of stocks and the S&P index helps bring out the message.  The start-of-period date should be annotated (which I haven't done here); in a later post, we will explore its importance.  A time axis is optional; if present, include a small number of ticks, say half-yearly.


Another Heatwave Chart

Here's another take of the heatwave chart, from AP. How does AP compare to NYT?

Ap_heatwave_3The Good

  • AP uses more relevant metrics, namely "extreme maximum temperature" and "above average temperature from normal".  The NYT chart picks out any city that reached the 100 F threshold, even for just one minute in a given week.

  • Both use the small multiples design: the heatwave-metric dimension makes more sense than NYT's week-of-the-month dimension.

  • Instead of scatters of dots, AP gives us density maps, telling us more while using less clutter. We immediately see which regions were most affected.

  • The titles and annotations indicate that this chart has a message while the NYT chart could just be a container of data.

The Bad

  • One imagines that two graphic designers produced a map each, their manager preferred the bottom one but decided to give the other map a supporting role, hiding it in the corner.  The staff is merry but the reader is dizzy. The metric and the scale of the map shifted simultaneously. A small multiples design works only if one and only one dimension is changed from chart to chart.

  • One designer preferred an explicit scale legend; the other annotated patches of color in lieu of a legend.  Different scales on one graphic confuses the reader.  Particularly so when the two maps use the same colors to represent different metrics!

  • I couldn't bring myself to comment on the convoluted, ungrammatical language of "extreme maximum temperature" and "above average temperature from normal".

  • Where is the time dimension?  Without time, "maximum temperature" is meaningless.

  • The text should always align with the picture: show us where Las Vegas and Death Valley are.

  • As with the NYT map, state boundaries are superfluous to the message.

Here, the manager had the right instinct.  The large map showing regions in the country that were experiencing abnormally high temperatures presents a strong, clear message.  The small map is excessive.


More on heat wave charts

How rich is the NYT heat wave map?  Lets do a quick calculation.

There are roughly 200 cities that showed up on at least one of the three maps.  For each city, they must have the maximum daily temperatures over about 20 days in order to figure out if it should appear on each map.  That would be 4,000 data points without counting other cities that did not reach 100 F during July.

Separately, drawing in state boundaries requires many data points.

Now compare with the donut chart which plotted 12 data points and you see what I mean by "data-rich".  [Correction: In fact, only 10 data points because the total population is just the sum of the individual segments.]


Embarrassment of Riches

The computing age has revealed our embarrassing ineptness at handling data richness.  Examples abound of data-rich but information-poor charts.  A sure symptom is upper-class guilt: I've got so much data ($$$) I know not what to do.

Now, ask yourself what information is conveyed by the following data-rich chart, then compare your thoughts with my comments.
(The graphic accompanied an article titled "Ferocious Heat Maintains Grip Across the West")  (temperatures in F)

23heat_5

  • This is an example of what Tufte calls "small multiples", a series of charts with a basic design replicated over changes on some dimension (here, week of the month). For "small multiples" to succeed, we must feel at home with the basic design.  Can we make sense of the scatter of dots (cities)?

  • You certainly noted that most points sit on the left (west).  If that were the whole point, why show city-level data?  Did you wonder how many cities reached over 100 F?  This information could have been denoted on the map

  • The state boundaries led us to  wonder which states were most affected.  Unfortunately, they presume we know the U.S. map well, which we don't

  • Next, how should we relate the first chart to the second, and to the third?  The dots appeared to have shifted north-westerly, and then towards the middle.  But the article focused on how the heat wave affected the West

  • Perhaps the point was not physical movement but the number of cities/states affected.  Then show us the counts because the naked eye cannot judge the relative size of scatters

  • Further, why show the three weeks of July, with "week" starting on Sunday and ending on Saturday?  This highlights only those changes occurring between Saturday and Sunday.  In the article, a meterologist identified July 12 as the start of the heat wave.  Two charts showing before and after July 12 would have confirmed the existence/effect of the heat wave

  • Instead of comparing before/after July 12, we can also compare July 12-21, 2005 with July 12-21, 2004.

When dealing with rich data, be picky.  Know your message before plotting.  Add details only if the reader can make sense of them.

In this example, we still aren't sure what the key message was, and including city-level detail without giving us counts and state-level detail without annotation frustrates rather than illuminate!

If you know where I can get temperature data, please leave a comment. I'd like to create a junkart version.

Reference: "Ferocious Heat Maintains Grip Across the West", New York Times, July 23 2005.


Random as doublespeak

I'll write a proper post on graphics later but for now, a short gripe on the mis-use of the concept of randomness.

What prompted me was a gabble of recent headlines such as:

  • New York subway riders face random searches
  • Police start random bag searches on NY subways
  • Random checks underway on some subways and commuter trains

Random implies everyone has equal chance of getting searched.  Putting myself in the shoes of a subway policeman, I wonder how I can enforce random.  Flipping a fair coin would do it but it's not practical.

More likely,
random is doublespeak.  It means exactly its opposite.

Sometimes,
random is an excuse for not thinking.  In market research, we often hear results are unbiased because of random sampling.  One pauses and asks: would the results have been more meaningful were the sample to be biased towards the specific segments of people of greatest interest?


Donuts and pies: which tastes worse?

I have never come across a situation that calls for a pie chart.  The human mind thinks linearly: we can compare lengths of line segments but when it comes to angles most of us can't judge them well.

The donut chart is a pie chart with a hole punched in the middle.  Alas, the missing middle contains the angles that help us size up the slices.  The donut chart is a useless chart made worse. 
Never ever use a donut chart.

Each publication gravitates to certain "pet" charts: the Economist happens to like donut charts.  Hopefully their editors will read this and stop using them.  Here is a recent example:Redoeconomistpop_7


We might as well point out three additional crimes: firstly, having one donut as a mirror image of the other denies us any chance of comparing like-colored slices properly; secondly, the lines linking labels to slices positively make us dizzy; finally, the least important detail, i.e. the total population size, stares us in the eye.

Reference: "The Americano Dream", The Economist, July 14, 2005


Stock return charts

Data abundance causes lazy, unfocused graphics as more is not always better.  I do not completely agree with Tufte on the matter of maximizing data-to-ink ratio.  He loves data-rich designs.  I believe that there are diminishing returns: too much data frequently crowd out the message.

A case in point is the typical stock return chart.  Here is an example from NYT:

20050717valugraphic_1

Notice the wasted motion in the zig-zagging lines, tracing weekly stock returns over 5 years.  Do they need so much data to make their point?

In fact, what is the point of this graphic?  The sub-title does not inform.

The infamous grid-lines make another appearance but are helpless against such granular data.  Similarly, the horizontal scale showing every other year is incongruous with weekly data.

Finally this chart can be very misleading: altering the time period or how returns are measured would change the graphic drastically.

I am working on various ideas for improvement.  Will post them later.

Reference: "The Blurry Boundaries of Growth and Value", Sunday New York Times, July 17, 2005.


The right emphasis

Junk artists have no shortage of raw materials; those living in NYC will understand what I mean.  So today I saw this graphic in the Economist. Redoeconomistaids
It supports an article that describes the "progress and problems" with WHO's 3x5 campaign to fight AIDS in poorer countries.

In my junkchart version, I have switched the emphasis from the absolute number of people in need to the percentage of those who have (or have not) received ARV therapy.

I dislike the common practice of the cut-off (see the sub-Sahara bar): our brain just isn't capable of extrapolating and understanding how far the bar would have stretched off the page.

The grid-lines are avoided by providing data labels.

As usual, the sub-title gives the main point of the graphic, and anything minor (namely, the date) is put to the periphery.

Reference: "Moving Targets", Economist, June 30 2005