Andrew points out an error in a new book authored by Dan Kahneman, Cass Sunstein, et. al. The authors said that while correlation does not imply causation, "wherever there is causality, there is correlation." In other words, if X causes Y, then X and Y must be correlated.

I like to say statistics is common sense systematized, but here is an example that shows statistical analysis requires deep thinking.

That statement is true if there is nothing else in the system but X and Y. In any real-world system, there are more than two relevant factors.

Since the summer is heating up much of the U.S., let's talk about room temperature. One should expect room temperature (Y) to be positively correlated with outside temperature (X): as outside heats up, inside temperature rises.

Not so if we introduce an air conditioner (C) to this system. The air conditioner regulates the room temperature (Y) to some pre-set number, say, 70 F (21 C). In other words, the effect of C is to keep Y around 70F *no matter what the outside temperature (X) is*.

The air conditioner destroys the correlation between X and Y. If we only collect data on outside and inside temperatures, we shall see no correlation! Despite outside temperature clearly affecting room temperature in a linear way, the data will be uncorrelated.

To learn the true relationship between X and Y, we must measure the lurking variable (C). If we have all three data series, then a regression will assign the proper effects of X and C on Y.

***

If you are comfortable with algebra, read Cosma's post here for a formal discussion.

If you're looking for intuition, keeping reading.

If we receive a data set with X and Y (outside and inside temperature), we can plot a scatter plot or do a regression, and we'll discover no correlation between X and Y. Using the erroneous thinking in the new book, we'd conclude that X and Y could not have a cause--effect relationship since they are not correlated.

And you've just learned that is a wrong conclusion. If we also have data on air conditioning (C), and we throw that into the regression, we will discover the correct relationships, that inside temperature is positively correlated with outside temperature, and negatively correlated with air conditioning.

In practice, whether the analyst draws the right conclusion comes down to whether the analyst knows about C, and whether data on C can be obtained. What separates a good analyst from an average one is the "numbersense" of anticipating what missing data might be useful to shed light on the data you already have.

Also, in any practical system, X and Y are not zero-correlated! Rather, the correlation of X and Y is artificially attenuated because of the lurking variable C. The alarm bell is very likely to ring loudly if the analyst sees zero correlation. If the expected correlation exists, though, that's when the analyst may neglect to consider lurking variables.

P.S. [5-27-2021. A number of correspondents have brought up the issue of linearity so I'm adding some comments here. The strict statistical definition of "correlation" is linear. This may be different from our folk understanding of the word. The left chart below shows a positive linear correlation between X and Y:

When the correlation is linear, we can fit a straight line through the scatter plot of X and Y. The chart on the right shows a specific example of non-linear correlation. For example X may be the amount of sun and Y is the plant growth.

If the standard correlation formula is applied (such as the one included in Excel), you get zero correlation between X and Y on the right chart. Roughly speaking, the two sides of the chart cancel each other out.

The lesson is that no correlation means no linear relationship but not no relationship. The same advice is frequently dished out for regression analysis. If fit is low for linear regression, it doesn't follow that there is no relationship between X and Y.

So one can say that if X is a cause of Y, and the effect of X on Y is shaped like the inverted U, then X and Y will show zero correlation (as defined by the linear correlation formula). Nonetheless, this particular issue is present whether or not we impose a causal interpretation on the X-Y relationship.

What I said in the last paragraph of the main post still applies. In real life, we rarely see a perfect inverted U, or perfect other shapes that would result in zero correlation.

"In other words, if X causes Y, then X and Y must be correlated.

...

That statement is true if there is nothing else in the system but X and Y."

This only holds if by correlation you/they mean more generally any relationship or association. If you are using the narrower statistical meaning of linear relationship than it is trivially falsifiable with a variety of nonlinear relationships, even in the absence of other variables.

Posted by: Mike | 05/27/2021 at 09:59 AM

Mike: Good point. I generally think of "correlation" as admitting any functional relationship revealed in a scatter plot of X and Y, in parallel with thinking "regression" is more general than fitting a straight line. It's a good reminder that the typical correlation printed by a software measures linear correlation, just like the basic regression assumes a linear model.

Posted by: Kaiser | 05/27/2021 at 10:25 AM

I have been following your blog for a while now (and also Andrew's one) but I have never written a comment... until now. I am doing it to let you know that I am very happy you took the time to explain in simple words what Andrew talked about. I don't have an advanced education in math or stats but I enjoy learning more about those topics. But sometimes it's hard to follow Andrew's post and the comments there. So your post written in simple terms with simple and more in depth explanations is very welcome. Thanks again.

Posted by: Clur | 05/28/2021 at 07:47 AM

Clur: thank you for leaving the kind note. It's gratifying to hear I'm having the desired effect on my readers.

Posted by: Kaiser | 05/28/2021 at 10:55 AM