An exposed seam in the crystal ball of coronavirus recovery
Reviewing the charts in the Oxford Covid-19 study

The hidden bad assumption behind most dual-axis time-series charts

[Note: As of Monday afternoon, Typepad is having problems rendering images. Please try again later if the charts are not loading properly.]

DC sent me the following chart over Twitter. It supposedly showcases one sector that has bucked the economic collapse, and has conversely been boosted by the stay-at-home orders around the world.

Covid19-pornhubtraffic


At first glance, I was drawn to the yellow line and the axis title on the right side. I understood the line to depict the growth rate in traffic "vs a normal day". The trend is clear as day. Since March 10 or so, the website has become more popular by the week.

For a moment, I thought the thin black line was a trendline that fits the rather ragged traffic growth data. But looking at the last few data points, I was afraid it was a glove that didn't fit. That's when I realized this is a dual-axis chart. The black line shows the worldwide total Covid-19 cases, with the axis shown on the left side.

As with any dual-axis charts, you can modify the relationship between the two scales to paint a different picture.

This next chart says that the site traffic growth lagged Covid-19 growth until around March 14.

Junkcharts_ph_dualaxis1

This one gives an ambiguous picture. One can't really say there is a strong correlation between the two time series.

Junkcharts_ph_dualaxis2

***

Now, let's look at the chart from the DATA corner of the Trifecta Checkup (link). The analyst selected definitions that are as far apart as possible. So this chart gives a good case study of the intricacy of data definitions.

First, notice the smoothness of the line of Covid-19 cases. This data series is naturally "smoothed" because it is an aggregate of country-level counts, which themselves are aggregates of regional counts.

By contrast, the line of traffic growth rates has not been smoothed. That's why we see sharp ups and downs. This series should be smoothed as well.

Junkcharts_ph_smoothedtrafficgrowth

The seven-day moving average line indicates a steady growth in traffic. The day-to-day fluctuations represent noise that distracts us from seeing the trendline.

Second, the Covid-19 series is a cumulative count, which means it's constantly heading upward over time (on rare days, it may go flat but never decrease). The traffic series represents change, is not cumulative, and so it can go up or down over time. To bring the data closer together, the Covid-19 series can be converted into new cases so they are change values.

Junkcharts_ph_smoothedcovidnewcases

Third, the traffic series are growth rates as percentages while the Covid-19 series are counts. It is possible to turn Covid-19 counts into growth rates as well. Like this:

Junkcharts_ph_smoothedcovidcasegrowth

By standardizing the units of measurement, both time series can be plotted on the same axis. Here is the new plot:

Redo_junkcharts_ph_trafficgrowthcasegrowth

Third, the two growth rates have different reference levels. The Covid-19 growth rate I computed is day-on-day growth. This is appropriate since we don't presume there is a seasonal effect - something like new cases on Mondays are typically larger than new cases on Tuesday doesn't seem plausible.

Thanks to this helpful explainer (link), I learned what the data analyst meant by a "normal day". The growth rate of traffic is not day-on-day change. It is the change in traffic relative to the average traffic in the last four weeks on the same day of week. If it's a Monday, the change in traffic is relative to the average traffic of the last four Mondays.

This type of seasonal adjustment is used if there is a strong day-of-week effect. For example, if the website reliably gets higher traffic during weekends than weekdays, then the Saturday traffic may always exceed the Friday traffic; instead of comparing Saturday to the day before, we index Saturday to the previous Saturday, Friday to the previous Friday, and then compare those two values.

***

Let's consider the last chart above, the one where I got rid of the dual axes.

A major problem with trying to establish correlation of two time series is time lag. Most charts like this makes a critical and unspoken assumption - that the effect of X on Y is immediate. This chart assumes that the higher the number Covid-19 cases, the more people stays home that day, the more people swarms the site that day. Said that way, you might see it's ridiculous.

What is true of any correlations in the wild - there is always some amount of time lag. It usually is hard to know how much lag.

***

Finally, the chart omitted a huge factor driving the growth in traffic. At various times dependent on the country, the website rolled out a free premium service offer. This is the primary reason for the spike around mid March. How much of the traffic growth is due to the popular marketing campaign, and how much is due to stay-at-home orders - that's the real question.

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

The comments to this entry are closed.