« June 2008 | Main | August 2008 »

Web publishing

Jerome C., a reader and blogger, wrote up a wonderful piece on different ways to publish charts on the web.  Highly, highly recommended. 

*** Rant ***

One of the points he made was that images (jpegs, gifs, etc.) are often published with poor quality.  I feel the pain.  Ever since Typepad switched to its new and "improved" editor, this blog has been suffering from low-quality thumbnails.  I know, I know... I need to move to Wordpress.  But from a 15-minute online research effort, I realized that moving a blog with lots of images is rather impossible!  All of the images would have to be uploaded, and a lot of links would need to be fixed.  Maybe the next time I am on holiday, I will get around to it.
*** Excel ***

Take a look at his comparisons of four ways to forklift an Excel chart onto a blog.  (The image on the right showed one of the four ways.) The difference in image sharpness is marked.

Resizing Excel charts is a common source of headache.  Always right-size the chart inside Excel before exporting!

*** Swivel, Google, etc.***

I also share Jerome's point of view on these on-line graphics creators.  Good idea, wishing for more.  In his words:

to make a point, you absolutely need to be able to control every aspect of your graph, even if its form remains familiar: combine series, group or highlight some datapoints, format axis, and so on.

I would like to explore the other options he cited, such as Processing.

*** Great example ***

Jerome's blog has a promising beginning.  The following chart is both informative and beautifully crafted.  It brings out the clear message that OECD countries have done admirably well in life expectancy, and particularly impressive in reducing the variance among member countries by lifting the expected age of the worse-off, relative to the better-off, with most of the gain happening during the 1980s.  (Adding quartiles may also be meaningful.  And I prefer to put the labels outside the plot area.)  The graph does not explain what caused the shift in the 1980s but this is a great starting point for the curious.



Nyt_metro This graph, called the Bell Curve, is a wannabe.

The first hint is its asymmetry, the right tail being longer than the left tail.

Further, the helpful labelling of the "average" does not coincide with the peak of the curve.

The author of the annotation seemed to understand, calling the distribution "skewed".  A Bell Curve is not skewed.

This is a pity because the designer might have selected a different chart type if she wasn't so enamored by the bell curve object.

The data tells us about users of 30-day unlimited passes in the New York City subway system: how many trips do they typically make?  The card costs $81 while each trip costs $2 so anyone taking fewer than 40 trips in those 30 days would have been better off buying individual tickets.  The "average" user took 56 trips.  The range of trips taken was very wide, perhaps surprisingly so.

Several key pieces of information has been left off the chart.  What is the total number of riders?  Without this, there is no way for readers to understand 15,185.  What is the smallest (and largest) number of trips taken by any rider?  Visually, it appears that the horizontal axis does not start at zero.

It would have been better to show a cumulative distribution with percentages of riders on the vertical axis.  On such a chart, we can read off the median and any percentiles.  In other words, it would be much more informative.

As it stands, I like very much the annotation of the 56 trip and the 100 trip points: they are great aids to help decipher the chart.  It would be great to indicate the 40 trip point too.

For those more technically inclined: the graph also begs the question of whether it is an actual or modelled curve.  It looks too smooth to be actual data.  If it is a model, then it is definitely not a normal distribution.  What could it be?  A spline?

Reference: "In Decade of Unlimited Rides, MetroCard Has Transformed How the City Travels", New York Times, July 16 2008.

Joining the fun

We hope this is indication that the British paper Guardian (with one of the best websites out there) is joining the fun.  It appears that they have quietly debuted an interactive graphics feature.  The first edition addressed the oil price crisis.

This time-series chart has much to be commended:


The use of inflation-adjusted figures seems obvious but we don't see much of these in the press.  Highlighting the peaks and providing annotation (when moused over) is an excellent touch.  The gridlines and axis labels (especially the year axis) are thankfully restrained.  We don't see the need for the unadjusted series (blue line), however.  The fact that the gap grew larger the more time we went back told us little, as it invited readers to read into it more than what it truly was, the time value of money.

Later on, they used an oil barrel object to illustrate the components of retail oil price.  The height of the cylinder is indeed proportional to the data plotted.  If only they colored the end of the cylinder gray instead of green!  As it stands, the green portion has about the same area as the red.


Reference: "Interactive: oil price", Guardian, July 14 2008.

Seth on bar charts

Seth followed up his post about graphics with a specific post about pie charts versus bar charts.  He prefers pie charts.  We happen to agree with his unhappiness of grouped bar charts.  Unfortunately he compared an univariate pie chart (depicting point-in-time data) with a multivariate bar chart (iluustrating time-series data).

Here we present a different example, derived from a NYT article on diabetes in America.  The original chart is a series of pie charts, one for each age group, and one for the aggregate data.


The junkart version uses a bar chart.  Readers can get a more precise comparison of the prevalence rates across age groups because it is easier to judge lengths than areas.  This has been scientifically proven by the likes of Cleveland.

Dirty trick, you might say because the original chart actually prints the data in each pie.


So now there is no mistaking the data.  This raises a philosophical question: why bother graphing the data if the reader needs to read the data in order to understand the chart?  We call this the self-sufficiency test.  The graphical elements of a pie chart can't stand on their own.

Reference: "Diabetes - underrated, insidious, and deadly", New York Times, July 18 2008.

Right metrics

Yesterday's post focused on the purely graphical aspects of NYT's very rich graphic on CEO compensation.  Today, we take a look at the data being plotted.  Aleks already jumped the gun, pointing out one deficiency of the stock price metric.

Recall the metrics were percent change of total CEO compensation (2006-2007), and percent change in company stock price (2006-2007).


The graphic attempted to simultaneously address two sets of comparisons: the relationship between compensation and stock price changes within one company; and the relationship of each company against "similarly sized" companies on compensation, and on stock price separately.

This graphic violates Godin's (Golden?) Rule #1, Only One Message, with which we generally agree.  In trying to accommodate both comparisons, it managed to confuse readers.  In particular, as pointed out yesterday, the primary comparison of compensation against stock price is hard to discern as the scale was determined by the second comparison (between companies).

The issue Aleks pointed out is that some CEOs are paid by stock; thus, their compensation would rise and ebb as the stock price rises and ebbs.  The correlation would then indicate the structure of the pay package, rather than the (presumed) pay for performance.

Stock price, in fact, is a poor indicator of company performance, especially short-term price changes such as the one-year changes used here.  Further, we have a problem of mismatched timing: pay (excluding the stock component) moves much much more slowly than stock prices; besides, while stock prices experience positive and negative changes, pay changes are skewed positive!  All these make direct comparison of these two metrics ill-advised.

If shareholder value is still the desired metric, then one should use a longer time-series.  This will crowd out the comparison with similarly sized companies but make the graphic more useful.

One final curiosity: according to this data set, Steve Jobs did charity work for Apple during that year; he received no stock or option grants and a nominal $1 salary.  Is this real?


Bound to extremes

The New York Times continued to push the envelope by printing super-complicated data graphics (while the Economist regrettably seemed to have picked the USA Today route... more on that in a future post).  The following graphic was used to illustrate the relationship between CEO compensation and their company's stock performance.


The two dotplot lookalikes depicted the percent change in CEO pay and the change in companies stock price, in both cases, from 2006 to 2007.  The size of the dots indicates the relative value of the CEO's pay.  The gray dots depict "similarly sized" companies for comparability.

In this post, I will focus on the comparison between change in pay and change in stock price for a given CEO.  In particular, the calibration of the axis/scale is problematic.  The scale is automatically determined by an algorithm; as one switches from one CEO to another,  the graphs take on different ranges, use different axis labels, and the zero-percent points shift.

Nyt_ballmerpay2 This means that the two charts have different scales.  In this example, each tick mark advances 6% in the top chart but 12% in the bottom chart.

Since the zero points do not line up, the distance between the zero and the orange dot loses meaning:  the 2.5x longer distance in the top chart actually represented the same percentage change as in the bottom chart (31% versus 28%).

In order to respect the grid-lines (white lines), the tick marks fall onto stray percentages (24%, 36%, 48%, etc.).  That's unfortunate.

What's the culprit?  This chart is "bound to extremes".  In other words, the range of the depicted data is used to determine the plot area.  The bottom chart had zero on the left edge because all the stocks depicted rose between 2006 and 2007.  It is often better to use domain knowledge to determine the plot area.  Extreme values should be omitted if they don't add to the message.  Oftentimes, by leaving extreme values in the picture, we squash the rest of the data.

This is also why programs like Excel do a poor job picking a scale.

As an aside, the use of bubbles is almost always troubling. Bubbles do not have a scale so the only information we get is relative size.  However, we can't estimate areas properly so we get the relative size wrong.  Sometimes, even the chart designer may get stumped.  In the chart of Steve Jobs, you would think his bubble (total compensation $1) would be dwarfed by all the other bubbles, as in the WSJ chart we showed the other day.  Not so.


Thanks to Todd B. for submitting this chart.

Reference: "Executive Pay: the bottom line for the those at the top", New York Times, April 5 2008.

Seth Godin on charts

Long-time reader John S. alerted us to three charting tips given by marketing guru Seth Godin.

  1. One Story
  2. No Bar Charts
  3. Motion

Like John, I agree with One Story most of the time.  However, we don't agree completely with Seth's rationale:

If the facts demand nuance, don't use a graph, because you won't get nuance, you'll get confusion.

It is true that there are a great many confusing charts; it is even more true that more complexity leads to more confusion.  The more data is plotted, the more difficult to control the message.  That's why we advocate simplicity.  Recently, we considered complex charts used for exploration or as catalogs.  This sort of "infographics" is not intended for sales and marketing.  I wonder if Seth had these in mind...

However, a well-designed chart need not cause confusion, even if it is nuanced.  Gelman's chart of social and economic tendencies (here) is a great example of a nuanced chart with one main story but many subsidiary stories, if the reader chooses to look deeper.

The advice of No Bar Charts is misguided.  Seth said:

The correct use of a bar chart is to show how several items change over a period of time. This, of course, demands nuance.

No, and no.  If we want to show items changing over time, use a line chart.  The slope of the line gives additional information, that is, the growth rate.  (For example, here.)  It is a tough audience indeed who consider a single time series to be "nuanced", i.e. confusing according to tip #1.

There are indeed situations where bar charts work poorly: see here.  I particularly dislike grouped bar charts, much used in market research.  For many such situations, line charts or dot charts do a better job.

Motion can indeed be powerful.  We have shown some examples of great dynamic graphics, for example, the obesity map.  Our early review of Gapminder pointed to its use of motion as well.

But motion is difficult to execute well.  Motion is a type of nuance, and true to Seth's words, nuance can be distracting if not done properly.

Reference: "The three laws of great graphics", Seth Godin, July 10 2008.

It's tiny

We get it:

Trillion = very very big!!    Billion = tiny?

Reference: "Commodities regulator under fire", Wall Street Journal, July 2008.

Divided nation

Professor Gelman generally believes the red state, blue state paradigm is too simplistic to describe the American electorate.  He has been sharing some of his work on his blog, and has just published a book about this topic.  Recently he produced the following chart, which is gimmick-looking but crystal clear in its message.


Here, economic and social ideology are plotted on a scatter chart, with positive values indicating conservatism and negative values liberalism.  Further, each state is represented twice on the chart, the red point for the Republicans and the blue for Democrats within the state.

This is a cluster analyst's dream data set.  The absolute separation of the Republican cluster and the Democrat cluster is astounding: imagine a diagonal line perfectly classifying all points.

We should not miss a host of details:

  • as Andrew pointed out, "the big thing we see from the graph ... is that Democrats are much more liberal than Republicans on the economic dimension: Democrats in the most conservative states are still much more liberal than Republicans in even the most liberal states."  This is clear from the wide gap on the horizontal axis.
  • there is a small degree of overlap on the social ideology axis so the nation is closer together on that front.
  • but wait a minute, the scale on the social axis is not the same as that on the economic axis.  This means that the extremes are more extreme on the social axis: the difference between MS and VT is roughly 0.8 on the social scale while the largest difference on the economic scale is roughly 0.5.  (here, I am assuming that the scales are comparable to each other)
  • there is high correlation between social and economic ideologies: the points are well-aligned along the 45-degree line
  • especially on social issues, the Democrats are divided within (the elongated shape of the blue cluster).

Reference: Gelman, "Ranking states by conservatism/liberalism of their voters", June 30 2008.