## De-noising data

##### Jun 12, 2013

One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal. The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.)  The residuals are  the raw data minus the trend; in other words, they are the data after removing the trend.

If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.

Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.

Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.

***

After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.

The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).  You can follow this conversation by subscribing to the comment feed for this post. That's a fantastic technique. Does the trend have to be a LOESS or moving average or similar fit, or can it be a log or linear trend?

Which of the two variables supplies the trend, or do they each get a trend (each of which may be different)? How do you know which part is the trend and which part is the noise? derek: I de-trended each data series separately. This method like any other assumes a model: that there is residual correlation between X and Y that isn't related to time. If I'm doing this seriously, I'd look at different methods as well to corroborate the findings. It doesn't matter what method you use to estimate the trend.

anon: not sure what you are asking exactly. For this problem, the model is that there is correlation between X and Y and that we aren't interested in each series' correlation with time, so the correlation with time is noise. I'm confused by what you're doing here. Aren't you a priori assuming that the trend is being driven by time and there's no causal link? If labour force participation is driving mileage how would this appear different from the two being time driven, assuming that time is a major factor in labour force participation. To put it another way, let us suppose we have a thing X that is dependent on time and a thing Y that depends on X so Y(t) = aX(t) + e where e is error or another thing that influences Y and a is the co-efficient linking Y(t) to X(t). Under your method wouldn't you simply strip out the connection between X and Y are conclude, erroneously, that X and Y are simply time driven and there's no real link between X and Y? Jack, here's how I see what Kaiser's doing: you see mileage going broadly up with time and participation giong broadly up with time, and when you plot them against each other the correlation looks good. But that could be just due to the broad trend. If they weren't just effects of time, you would still see a correlation when you take the big broad trend out.

The fact that the correlation stops looking so good when you take the broad trend out suggests they weren't coupled to each other, so you can reject the hypothesis that the data shows they are (maybe you can't reject the hypothesis that they are, but you can't use this as evidence that they are)

I would suggest that you're not yet out of the woods if you still see a correlation after you've taken the secular trend out. Maybe they're being driven by a periodic cycle in time, such as an annual cycle. If you think that might be the case, you can try to remove the periodic time signal as well, and see if that makes the correlation go away.

I think this is related to what statistians do when they adjust participation to take out the annual cycle of employment (I can't remember the correct phrase). It's just that we're doing it to test whether two variables are correlated, by eliminating time, instead of doing it to test whether one variable is changing in secular time, by eliminating periodic time. So, the logic here is:

-If two series are actually correlated, their residuals should be correlated.

-If only the two trends are correlated but not the residuals, we can assume that the series are both correlated to the same set of factors, but not to each other.

Is it? @jack: "Aren't you a priori assuming that the trend is being driven by time and there's no causal link?"

Not at all.

In fact, the hypothesis being tested in this method is exactly the opposite: that there *is* a correlation independent of the aspect of time.

But most importantly, the analysis is done to find out what the relationship is, not to reinforce an assumption - if time was not the significant factor, you would see the correlation in the plot above. While I applaud the sentiment here, it's worth noting that things can be more complicated than this, and that there are circumstance where lack of correlation between the detrended residuals can be misleading too. Although it's not likely to be relevant in this specific instance, looking just at detrended residuals can cause you to miss a relationship between non-stationary variables that are integrated of order one. (There's a decent intuitive explanation of what that means here.) conchis: Thanks for the article. I agree that there is no simple rule that applies to all cases.

The comments to this entry are closed.