One of the consequences of the open data, big data movement is that everyone is an amateur data analyst. To wit, Andrew Gelman earlier linked to a Kaggle competition to forecast coronavirus cases.

This is both unnerving and exciting; it's unclear if the benefits of more eyes outweigh the costs of misinformation. But there isn't a way to push the genie back in the bottle.

I don't plan on spending much time with the currently available data about the coronavirus cases, mainly because there is too much missing context, such as the amount of testing, what types of people get tested, how cases and deaths are defined (as related to the novel coronavirus), what measures are taken that impact the counts, age and other co-variants, etc.

The one analysis I did from the other day addresses the narrow question of whether there are signs yet that the containment measures yielded results in the region of Lombardia. Zooming in to a single region is helpful as the variations due to definitions and policies is limited. **The key takeaway from that analysis is that the growth curve of reported cases is not exponential. **

This note is written for those attempting to fit exponential growth curves.

***

A typical process of fitting exponential curves is to **plot the data with a log y-axis**. The supposed benefit of looking at a log plot is that the implied growth rate can be eyeballed from this chart. The hockey-stick curve of case counts looks like a straight line in the log scale. The slope of this best-fit line leads to the growth rate of the exponential model.

Here is what happens when I obtained a "trend line" in Excel after transforming the case counts to a log10 scale.

For this illustration, I used the entire data series (Feb 25 - Mar 19, 2020) to fit the model. Excel reports that this is a tight-fitting model, with R-squared of 98%. [See my previous post for why using the entire data series to fit a model, as I did above, isn't recommended.]

The measure of goodness of fit (R-squared) comes from aggregating individual daily errors. In the following chart, I highlighted two such errors, the gaps between the model estimate and the actual case count on two selected days. I picked those two days because the lengths of the two red lines are almost identical.

For a normal linear regression fit, it's standard practice to plot these errors and conclude that they are pretty uneventful.

The log scale makes huge errors look small.

It's extremely easy to forget that we are looking at a log plot here. Log transforming the data has the effect of pulling big numbers closer to zero, and pushing small numbers further from zero. If the two red lines were plotted in the original scale, you'll see that the error on March 19 is much, much larger than the error on Feb 26!

This misreading of the error sizes is the same visual misperception that makes any log scale prone to misinterpretation.

***

The above problem is not limited to the phase of model fitting. It is common today to draw two (or more) growth curves in log scale and compare them. This is a typical visualization:

This one from Spanish outlet El Pais (link) was featured by Alberto Cairo (link) recently so I have it handy - but my comment applies to all variants of this chart which can be found *everywhere*. The El Pais design is distinguished by a side-by-side presentation of the growth curve in log scale and in linear (everyday) scale.

The log scale complicates comparing two growth curves.

When comparing two growth curves, say Spain and Italy, comparing the straight-line slopes is fine (with the caveat that we are then assuming exponential models for both countries).

We get in trouble when we interpret differences in slope over time. Using the above graph, one might say that in the first ten days, the Spanish case counts were lagging behind the Italians but after that, there was a cross-over and the Spanish cases grew faster than the Italian cases.

All those words represent our interpretation of the **gaps between two growth curves**. The optical illusion described in section 1 of this post applies here equally. Gaps to the left side of the time-line are artificially magnified by the log transform while gaps to the right side are artificially compressed.

The further out it goes on the time axis, the bigger the compression. You have to train your head to think in log scale to reverse the visual distortion.

**The reality is only experts who have been trained to think in log scale can properly interpret this type of chart.**

## Recent Comments