One of the consequences of the open data, big data movement is that everyone is an amateur data analyst. To wit, Andrew Gelman earlier linked to a Kaggle competition to forecast coronavirus cases.
This is both unnerving and exciting; it's unclear if the benefits of more eyes outweigh the costs of misinformation. But there isn't a way to push the genie back in the bottle.
I don't plan on spending much time with the currently available data about the coronavirus cases, mainly because there is too much missing context, such as the amount of testing, what types of people get tested, how cases and deaths are defined (as related to the novel coronavirus), what measures are taken that impact the counts, age and other co-variants, etc.
The one analysis I did from the other day addresses the narrow question of whether there are signs yet that the containment measures yielded results in the region of Lombardia. Zooming in to a single region is helpful as the variations due to definitions and policies is limited. The key takeaway from that analysis is that the growth curve of reported cases is not exponential.
This note is written for those attempting to fit exponential growth curves.
***
A typical process of fitting exponential curves is to plot the data with a log y-axis. The supposed benefit of looking at a log plot is that the implied growth rate can be eyeballed from this chart. The hockey-stick curve of case counts looks like a straight line in the log scale. The slope of this best-fit line leads to the growth rate of the exponential model.
Here is what happens when I obtained a "trend line" in Excel after transforming the case counts to a log10 scale.
For this illustration, I used the entire data series (Feb 25 - Mar 19, 2020) to fit the model. Excel reports that this is a tight-fitting model, with R-squared of 98%. [See my previous post for why using the entire data series to fit a model, as I did above, isn't recommended.]
The measure of goodness of fit (R-squared) comes from aggregating individual daily errors. In the following chart, I highlighted two such errors, the gaps between the model estimate and the actual case count on two selected days. I picked those two days because the lengths of the two red lines are almost identical.
For a normal linear regression fit, it's standard practice to plot these errors and conclude that they are pretty uneventful.
The log scale makes huge errors look small.
It's extremely easy to forget that we are looking at a log plot here. Log transforming the data has the effect of pulling big numbers closer to zero, and pushing small numbers further from zero. If the two red lines were plotted in the original scale, you'll see that the error on March 19 is much, much larger than the error on Feb 26!
This misreading of the error sizes is the same visual misperception that makes any log scale prone to misinterpretation.
***
The above problem is not limited to the phase of model fitting. It is common today to draw two (or more) growth curves in log scale and compare them. This is a typical visualization:
This one from Spanish outlet El Pais (link) was featured by Alberto Cairo (link) recently so I have it handy - but my comment applies to all variants of this chart which can be found everywhere. The El Pais design is distinguished by a side-by-side presentation of the growth curve in log scale and in linear (everyday) scale.
The log scale complicates comparing two growth curves.
When comparing two growth curves, say Spain and Italy, comparing the straight-line slopes is fine (with the caveat that we are then assuming exponential models for both countries).
We get in trouble when we interpret differences in slope over time. Using the above graph, one might say that in the first ten days, the Spanish case counts were lagging behind the Italians but after that, there was a cross-over and the Spanish cases grew faster than the Italian cases.
All those words represent our interpretation of the gaps between two growth curves. The optical illusion described in section 1 of this post applies here equally. Gaps to the left side of the time-line are artificially magnified by the log transform while gaps to the right side are artificially compressed.
The further out it goes on the time axis, the bigger the compression. You have to train your head to think in log scale to reverse the visual distortion.
The reality is only experts who have been trained to think in log scale can properly interpret this type of chart.
I was surprised you didn't include a discussion of proportional change as opposed to absolute change here. Both types of change can be important. If you are looking to measure the human toll of the virus, the absolute change (e.g. 651 deaths in Italy yesterday) should reasonably be what you want to model most accurately - the difference between 10,000 and 12,000 is more important than the difference between 100 and 120. But if you want to understand the nature of the system, proportional change may be a more useful way of looking at the data. Due to the nature of disease spread, it is reasonable to conceptualize the system as one where exponential growth is expected, and deviation from that exponential growth might reasonably be interpreted as something meaningful - mode/observation mismatch of 20% is similarly important, whether that's 100 v.s 120, or 10,000 vs. 12,000.
Posted by: Bretwood Higman | 03/23/2020 at 02:25 PM
BH: In the last part, you anticipated a post that currently sits in my head. It will probably appear this week or next. When there is a discrepancy between a model and the observed data, the modeler has to make a judgment call: how much of the gap is due to a mis-specified model and how much of it is due to poorly measured data? It can be some of each. Nevertheless, it's important to recognize that the exponential curve is an analytical solution to a theoretical setup so there is some basis for it to be "true".
Posted by: Kaiser | 03/23/2020 at 02:55 PM