A matter of timing
May 11, 2008
A reader Carly C. from Streetsblog created the following chart and wanted to know if there are better ways to present the data. She already disliked the double axes and thought of various options including using relative scale.
Generally speaking, dual axes in which each axis takes its own scale is like a football team with two "good" quarterbacks rotating under center, or two "great" CEOs sharing power. We have never seen those situations work out.
When we have two quantities under comparison, we like to put them on the same scale. In this case, converting the scale from absolute numbers to relative would do the trick.
The data paint a powerful story: as bike volume increased over time, bike accidents decreased. The stitching together of two lines at year 1999 was an artifact of manipulating the scales. What Carly had in mind can be accomplished using an index set at 100 in 1999. This would lead to the chart shown left. The substance of this chart and Carly's original is the same but the revised one has a single axis.
Indexing time series data is a widely used technique. Each issue of the Economist, for example, contains many such charts. This type of chart, however, suffers from a critical and under-appreciated problem: the visible pattern frequently and critically depends on timing. Specifically, it makes a huge difference which year is selected as the baseline (index=100).
A lot of mischief is possible by picking a special baseline. Take for example, I created the same chart three times, using 1998, 1999 and 2000 respectively as baselines. When 1999 w
as 100 (middle chart), a criss-cross pattern showed up between 2001 and 2002, leading readers to conclude that the gap between growth in volume and growth in accidents developed during 2001. In the other two charts, the gap appeared around 2000. Also, the bottom chart exhibited a clear growing gap (after dumping the disagreeable data before 2000).
Unfortunately, this is a feature of such charts; whether or not timing distorts the information presented depends on how rugged the underlying data is. Put another way, these charts can be affected by outliers. (In this example, there were sharp changes in bike volumes in 1998-2000.)
Reference: Streets Blog
PS. [5/12/2008] How opportune was Andrew's post on R graphics default headaches. I was too lazy to figure out the defaults and let R figure out the dimensions (poorly); with Jake's suggestions, the new set of charts looked much better.
Any time I see a chart like this where the bottom of the y-axis is not zero, I distrust it.
One needs to be able to see the difference in the final two values with respect to zero. In other words, the opening might be only 0.02% on one graph but 70% on another but look exactly the same due to the select of the y-axis range.
Posted by: Grant Hutchins | May 12, 2008 at 12:39 AM
Or chart crashes per 100 bike trips?
The two axis chart has an important advantage of tangibility. Some people find it much easier to trust a chart if they can find tangible numbers such as 6000 bike crashes, as compared to an intangible number such as bike crash index of 120.
Posted by: Michael | May 12, 2008 at 02:38 AM
Grant, I think in this post all the charts that need to have y-axis beginning from zero actually have it there (the original graph).
The "zero-level" of index charts here is 100, and that's centered in the middle of the graph. Value 0 has no meaning in these charts and displaying it would just be confusing.
I agree with Michael that displaying a metric calculated from these two time series could be a good option, although that hides the interesting info that absolute values of both variables are changing, rather than just one of them.
Posted by: TH | May 12, 2008 at 04:39 AM
To avoid the baselining and tangibility issues, one could use a panel chart, where the two series occupy parallel panels in the chart. Each panel has its own scale, without normalizing, so the reader can see actual values, and in separate panels there's no way the lines will cross and lead to spurious conclusions.
Panel charts in Microsoft Excel
http://peltiertech.com/Excel/ChartsHowTo/PanelUnevenScales.html
Posted by: Jon Peltier | May 12, 2008 at 07:08 AM
Jon: or two charts?
Posted by: Hadley | May 12, 2008 at 10:28 AM
How about a scatterplot? The points could be labeled with their year since there are only 8 data points. This makes the conclusion that accidents go down as volume goes up even more clear.
Posted by: Michael Galloy | May 12, 2008 at 01:22 PM
Hadley -
I preferthe panel chart over multiple charts, because the panel keeps the different parts of the chart together in a single object.
Posted by: Jon Peltier | May 12, 2008 at 11:23 PM
I rarely used charts of indexed timeseries, but tried one recently after reading this post and learned that they can be treacherous! With all the media coverage of rising fuel prices, I got hold of some data for Sydney retail petrol prices and wholesale crude oil and gasoline prices. Rather than doing the sensible thing and converting to an equivalent unit ($ per L), I thought indexed timeseries would be a shortcut. I didn't think it through. The chart showed wholesale prices increasing much more than retail, suggesting that retail prices could increase further. Of course, since retail prices are wholesale + margin, without significant increases in the margin, the retail growth rate should only be a fraction of wholesale and the proper chart in common units showed no divergence. Next time I'll be more careful!
Posted by: Sean Carmody | May 28, 2008 at 06:58 AM
Any thoughts on this chart. Here is what should be a close approximation to the underlying data, although it uses spot crude oil prices and I suspect the chart in question uses futures prices.
Posted by: Sean Carmody | Jun 10, 2008 at 07:59 PM
These things take advantage of the human psychological weakness for pareidolia, or "seeing the Virgin Mary in a tortilla". Or, a related weakness, which is our story brain, that makes a narrative out of random events, or privileges poor explanations that fit a story, over better explanations that diss the story.
Ironically, a technique which may be thought of as cheating-- finding the least-squares fit between the two curves and adapting the scales to use that-- actually reproduces the "approved" way of demonstrating correlation, which is to find the least squares straight line through a scatter plot of the data. I don't know what to think about that :-)
Kaiser has discussed similar issues in "The eyeball test".
Posted by: derek | Jun 11, 2008 at 06:02 AM
@derek: thanks for the excellent word, pareidolia, I'll have to remember that one. I must admit I am always highly suspicious of these kinds of shifyed/scaled charts although they are extremely popular in finance.
Posted by: Sean Carmody | Jun 12, 2008 at 09:56 PM
I was too lazy to figure out the defaults and let R figure out the dimensions (poorly); with Jake's suggestions, the new set of charts looked much better.
Posted by: generic viagra | Jan 08, 2010 at 10:19 AM