## Two lines dropping

##### Nov 01, 2011

Reader Ron D. was not pleased to see this dual-axis chart purporting to show a cause-effect between the decline in union membership and the drop in the proportion of income earned by middle-class households (defined as the middle 60% of households). Click here to read the original article. They credit CAP's David Madland and Karla Waters for this chart.

Using dual axes is a well-tested way of creating correlation where there may be none. Playing with the scales will do that for you. I wrote about this issue here.

However, the correlation in this data cannot be denied, as the scatter plot below shows. Note that the scatter plot is much better at revealing correlational patterns than a chart with multiple time-series lines. (Here's an example of two lines that display a spurious correlation.)

If one were to ask for a linear regression line, one will obtain a very high R-squared indeed (over 0.9). The problem is with the interpretation of this correlation. Any two data series that move with time will be highly correlated with each other, just because each series is highly correlated with time. Despite what you might believe after reading Freakonomics, regression -- especially in social science data -- cannot prove causation.

The writers at Think Progress show no such restraint, from the title "The American middle class was built by unions and will decline without them." to the sentences "these assaults have successfully decreased union membership over time... this has had a detrimental effect on the American middle class."

Note: these statements may in fact be true; I'm just pointing out that the chart does not buttress the assertions.

***

It's often hard to elevate a correlation to a causal effect. We have to try different tests. One such test for this data set is: if a change in union membership causes a change in middle-class incomes, then we'd expect that  the annual changes of one to be correlated with the annual changes of the other (at least in direction, better in magnitude).

So, in a year in which union membership declined a lot, one should expect to see middle-class incomes also drop substantially.

The next scatter plot contrasting these annual differences suggests that causation is probably absent. At this smaller time scale, one just doesn't see any correlation at all. Annual declines in the proportion of union membership has been around 2-4% for most of this period but shifts in middle-class incomes have been ranged widely in terms of direction as well as magnitude.

P.S.  Andrew suggested connecting the lines. Here are the charts with the lines:

What appears to be a very strong correlation on the left chart does not look that well-coordinated on the right chart!  (The lines connect the dots in chronological order.)

You can follow this conversation by subscribing to the comment feed for this post.

So, in a year in which union membership declined a lot, one should expect to see middle-class incomes also drop substantially.

I wouldn't expect to see that at all. The connection wouldn't be nearly as intimate as a year.

Derek: that's why I said it is one test. You're welcome to test for multi-year effects and let us know what you find. One has to be cautious about doing too many tests (due to multiple comparisons) plus one should have a plausible theory for why the effect takes 3 years not 4 or 6.
Further, if we establish that there is a 3-year lag in (auto)correlation, what does that tell us? To me, that is a weak case to argue for causality. The longer one has to wait to see correlation, the more time there is for all kinds of other factors to influence the target variable.

Does the lagged test actually demonstrate causation? I can see how it would demonstrate the causal direction (if there was one). But I don't see that this test rules out the existence of a third, hidden factor that causes union membership to decline first, then incomes at a lag? Say, for instance, decline of the manufacturing sector, or decline in manufacturing employment.

Gary: No it can't prove causation. The idea is you want testable hypotheses to come out of the *assumption* of causality. One test or even a bunch of tests cannot be conclusive but if several such hypotheses fit the data, then we have stronger belief in the assumed causality.
In much the same way, if the first scatter plot above were to show little correlation, most people would conclude there cannot be causality but if one nitpicks, one can argue it only reduces the chance of causality.
The point I'm making is that the people behind the report have no right to make causal statements based on the chart of two lines.
Your standard of "ruling out the existence of causation" is much too stringent for any kind of statistical analysis.

I believe, if one were to play with the scales, that one could show the drop in percentage of children born to married mothers is causing the decline in middle class income, using the same set of fancy statistics and complete lack of science to justify causality.

http://www.heritage.org/research/reports/2010/09/marriage-america-s-greatest-weapon-against-child-poverty

Kaiser:

Just a small comment: I think the scatterplot would be improved if you connect the dots sequentially with light lines and then label the start and end points. This doesn't always work but given the steady trend, I think it will work well in this example.

I have to disagree with you entirely here on the use of the annual change. Income is a complicated thing and no one would suggest that it's entirely dependent on one factor, nor would you expect for drops in union membership to result in income changes within a year every year.

Since there is an obvious relationship between unions and salary, and that relationship ought to be in exactly the direction shown in the data, this looks like a pretty good argument for declining union membership playing a role in declining middle-class income.

I get your point about two things potentially being more correlated with time rather than each other. But I would argue that your analysis also has a major flaw. Assuming that the change in union membership should have a change in income in the same year. What is that assumption based on? Your gut? We'll that's no better than the other graph. In fact it very likely could be worse.

John and Ryan: It appears that you haven't read the comments above and my response to them. The point of the blog is to show how serious analysts should seriously interrogate their data and avoid making causal statements based on a simple x-y correlation plot. This post is not a scholarly article proving or disproving the relationship. If you think the change in union membership takes X years to have an effect, you can replicate my analysis looking at the data with X years of lag. I encourage you to do that work and post your results here.

Generally, if you look at lags of 1, 2, 3, 4, 5,...,infinity years, it is not unusual to be able find one lag that will "prove" a preconceived notion of correlation. Thus, a solid theory is needed to support any such analysis. Therefore, I also encourage you to post your theory as to how many years it takes for the effect to take hold, and why that many years are needed, and not more or less.

What an excellent dissection of this correlation! I'm curious as to how this correlation can be proved or falsified. For example, what if it can be shown that in countries where unionization hasn't been battered the way it has in the US haven't seen the same rise in income disparity between average workers and the wealthy? I know correlation is not causation, but what if one can draw multiple cross correlations in such a manner? Does that strengthen the case?