Dampened by Google
De-noising data

Once more, superimposing time series creates silly theories

After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.

The contentious issue is that X and Y might appear correlated but in fact, what we are observing is that both data series are strongly correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.

Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.

The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.

Redo_milesdriven_1The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.

The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.

This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.




One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.

Redo_milesdriven_3Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.

What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.




Feed You can follow this conversation by subscribing to the comment feed for this post.


Can you expand on what you did to the data for the final example?


sure, I'll put up more details tomorrow.

The comments to this entry are closed.