Thoughts on Daniel's fix for dual-axes charts
Mar 14, 2023
I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:
In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.
Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.
How is this chart different from the typical dual-axes chart, like the first graph in this post?
Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).
His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.
The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.
The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.
***
What are we staring at?
The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.
***
There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.
While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.
***
Daniel's fix reminds me of a different, but very common, chart.
This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.
We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.
In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"
Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.
This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.
Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.