When to use the start-at-zero rule
Apr 24, 2014
A response to a tweet forwarded to me. The person tweeting complained that FiveThirtyEight uses charts that don’t start the vertical axis at zero. The example given was this:
In this post, I want to clear some confusion around the "start-at-zero" rule.
This rule is an absolute must only for column (or bar) charts but is not intended for line charts. Here is a bar chart with the axis starting at 60% instead of 0:
I highlighted the columns for 1993 and 1996. Visually, the height of one column is twice that of the other column. And yet the axis labels tell us that the difference is 65% versus 62.5%.
***
The reason for the start-at-zero rule is to avoid exaggerating meaningless differences.
To judge whether a change is meaningful or not, in time-series data like this, we have to use history to understand the general variability in college enrollment rates. Based on what we can see in this data (about 20 years), the college enrollment rate hovers between 60 and 70 percent. There is no data between 0 and 60 percent. Those are irrelevant values for this data series. This is why starting at zero is counterproductive.
Here is the line chart starting at zero:
This display has the unintended effect of squashing meaningful changes over time by inserting a lot of empty space below the line.
A column chart starting at zero looks like this:
This is a fix on the truncated column chart from above. But it also squashes meaningful changes over time. A column chart is just a poor choice to illustrate this dataset.
For those who don't like the line chart, consider using a dot plot:
I'm mostly in agreement here but with one caveat: I think there may be cases where the "base" y-axis value is not 0.
Immediately, I'm thinking of metrics plotted on a percentage scale where we are interested in how much less than 100% they are. In the supply-chain domain there are examples in forecast accuracy, or order fill-rates which would ideally hit 100% and it's the degree to which they are less than 100% that is of interest, not the distance from 0. In these cases I think the y-axis should "end at 100%" and I'm not concerned at all with where it starts.
Or how about temperature scales, where the 0 is essentially arbitrary (unless you are using degrees Kelvin) ?
I imagine there are other examples where the base/reference /key-value for the y-axis (what should this be called) is non-zero too.
Posted by: Andrew Gibson | Apr 24, 2014 at 09:08 AM
Where the 0 is essentially arbitrary, you shouldn't be using a bar chart at all.
Posted by: derek | Apr 24, 2014 at 12:53 PM
In the first example you've cited, Andrew, would it not be best to plot the variance using 100% as the basleline value and you're data points appearing above or below the baseline as needed? This could be accomplished with equal effect using a column or line chart, although a line is probably preferable if used for plotting time series values.
Posted by: Chad Smith | Apr 24, 2014 at 07:30 PM
Chad - I could plot it with 100% as the baseline but then I'm not really plotting the metric I started with, I'm (at least visually) representing it's difference from 100%.
Posted by: Andrew Gibson | Apr 25, 2014 at 09:28 AM
Derek - you're right of course, perhaps that's a bad example. Let's stick with the percentage scale idea.
Forecast Accuracy is a VERY common business metric (actually defined as 1 - PercentageError). It's defined that way, I think, so that bigger is better, but what I really care about is the absolute and relative size of the error NOT the relative size of "accuracy". I can't really change the metric, it's deeply embedded in business usage.
Sometimes this is for time-series data where I use a line-chart and let both limits float. Where I'm comparing results across categories, my approach has been to force 100% as the maximum of the y-axis and let the minimum float.
Surely this isn't the only example of a non-0 base ?
Posted by: Andrew Gibson | Apr 25, 2014 at 09:38 AM
The whole point of a zero base for bar/column (and area) charts is that the length of the bar encodes the totality of a value, starting from nothing. The length of the bar tells you nothing at all of it doesn't start from zero.
If you can't start the bar at (a meaningful) zero, then a bar chart is not the appropriate way to chart your data.
In the case of a variance chart, regardless what label you put on the base line, you are still in reality encoding that data from a zero base, where the value against which you are measuring variance is the zero mark.
There are plenty of cases for non-zero bases for most other chart types - most often you will *not* want a zero base for a line chart, scatter plot or box plot (though obviously if zero is, incidentally, appropriate to the data set it should be used).
Posted by: jlbriggs | Apr 25, 2014 at 01:12 PM
Appreciate the comments from everyone as it's helping me clarify my own thinking. Mainly that this example that jumped to the front of my mind is probably not a widespread issue.
This particular metric, forecast accuracy, defined as 1- [ForecastErrorRate], varies in the range (-inf, 100%) with typical values in the (20%, 100%) range. I did not invent the metric it's a business construction so that bigger numbers are better.
As a stats guy I would prefer to see an error rate which I could plot against a 0 base without causing anyone offense and get reasonably accurate perception of absolute and relative values in the plot.
However, most business users do not recognize/understand the error metric and really want to see the accuracy values. I most definitely do NOT want to force a 0 base for this on any chart because it is meaningless.
By forcing 100% as the top of the axis, the gap between the top-axis and the plotted data (whatever chart construct I use) now becomes meaningful in absolute/relative terms. This seems better to me than leaving the axis to float at both ends, but I'm certainly open to other suggestions.
Posted by: Andrew Gibson | Apr 29, 2014 at 09:19 AM
@andrew - perhaps you can link to an example of a chart such as what you are describing.
Posted by: jlbriggs | Apr 29, 2014 at 12:38 PM
I have most often seen budget accuracy comparisons plotted as either
a) a variance chart, showing % above/below budget for each period (usually bars but also lines or areas)
b) a line plot with a line for actual and a line for budget
c) a bullet graph, or similar style bar chart
Posted by: jlbriggs | Apr 29, 2014 at 12:42 PM
@ Andrew - I'd also be curious to see an example if possible. I'm also working with budget data and would like to see the presentation of the metric you've described.
Posted by: Chad Smith | Apr 30, 2014 at 10:53 AM
Sorry for the delay. I think the problem may be in my communication. While I know it sounds like it, "Forecast accuracy" is not a typical variance measure.
We can define it as:
1 - [(weighted) Mean Absolute Percentage Error]
= 1 - SUM(ABS(Forecast-Actual))/SUM(Actual)
There are many variations on this, and its the subject of heated debate on forecasting forums but they all share similar characteristics:
- perfect forecasts would have no error and return 100%
- there is no effective lower bound on the metric, negative numbers of any scale are possible
- it's really the error that we are interested in relative values for even though business users insist on looking at "accuracy".
I'll post an example next.
Posted by: Andrew Gibson | May 08, 2014 at 09:37 AM
I can't seem to add graphics into the comments section so I'll post it on my blog and drop a link in here.
Posted by: Andrew Gibson | May 08, 2014 at 10:02 AM
OK - I added a post on my blog so I can cover this in more detail. In particular I expanded on the characteristics of the forecast accuracy metric as I think that is key to my problem. (I also thought of a handful of other supply chain metrics with the same issue)
Link below.
Visualizing Forecast Accuracy. When not to use the "start at zero" rule ?
I appreciate the feedback from all the contributors here. Comments are welcome.
Posted by: Andrew Gibson | May 08, 2014 at 11:36 AM
All, I apologize for posting in an old comment thread but I am quite interested by the following comments:
"Or how about temperature scales, where the 0 is essentially arbitrary (unless you are using degrees Kelvin) ?
" and
"derek
Where the 0 is essentially arbitrary, you shouldn't be using a bar chart at all."
I have exactly this problem with temperature scales.
Does Derek mean that temperature should never be used as a Y-axis of a bar chart unless it is displayed in Kelvin? I can understand why that might be but I've never had it put to me.
As a scientist/engineer I have charts of temperature, typically in degrees Celcius usually at a magnitude of ~180 degC across different categories. I have measurements with error bars 95% confidence at ~1 degree. Being able to see variations of a few degrees are important. Setting the zero of the graph to 0 degrees celcius (or worse, 0 kelvin) hide all the differences.
I guess ideally I should be using box plots or something but they are not really in common usage within the sort of audiences I typically present to and probably would cause me to spend more time explaining them rather than talking about the results.
I could use a dot plot/scatter plot I suppose but on some of the larger charts (20+ un-ordered categories) these can be really hard to read.
Posted by: Ell | Sep 30, 2016 at 05:10 AM