Andrew Gelman, Columbia professor, wrote an important post about causal thinking (link) that I highly recommend reading. While he approaches the topic from a researcher's perspective, his framing of the issue is very practical, as I will demonstrate in this post.
Gelman's main point is the two modes of causal thinking:
- Forward causality is asking the question, if we change X, how does that change Y? This is typically answered by setting up a scientific experiment, varying X and observing Y while keeping other factors constant. When an experiment is not possible, we try to mimic it.
- Reverse causality is asking the question, if Y happens, what factor(s) X may have caused it? This is a very difficult problem to address. Gelman favors an approach where we first generate hypotheses of what X might be, then verify if the hypotheses hold.
In the past, forward has been called "effect of causes" while reverse has been called "causes of effect." Those terms are cute but quite confusing so I prefer the forward/reverse language.
As Gelman explained, many statisticians consider reverse causality an unworthy question, unscientific, indeterminable, unserious. First issue is that even if one or more hypotheses hold, it is not clear that the real cause isn't still lurking. But Gelman is right in saying that a lot of real world problems are of the reverse causality type, and it's an embarrassment for us to ignore them.
In this post, I describe two examples of casual reasoning in the real world.
Most business problems are reverse causal. Take for example P&G who spends a huge amount of money on marketing and advertising activities. The money is spread out over many vehicles, such as television, radio, newspaper, supermarket coupons, events, emails, display ads, search engine, etc. The key performance metric is sales amount.
If sales amount suddenly drops, then the executives will want to know what caused the drop. This is the classic reverse causality question. Of course, lots of possible hypotheses could be generated... TV ad was not liked, coupons weren't distributed on time, emails suffered a deliverability problem, etc. By a process of elimination, one can drill down to a small set of plausible causes. This is all complex work that gives approximate answers.
The same question can be posed as a forward causal problem. We now start with a list of treatments. We will divide the country into regions, and vary the so-called marketing mix, that is, the distribution of spend across the many vehicles. This generates variations in the spend patterns by vehicle, which allows us to estimate the effect of each of the constituent vehicles on the overall sales performance.
Now, consider law enforcement or spying agencies building vast databases. The forward question is asking whether the habit of searching for the word "bomb" in a search engine increases the chance of someone later committing terrorist acts. This predictive question is highly challenging and the problem of false positives is very real.
The reverse problem is after a terrorist strikes, trying to understand why he/she did it. This question is aided by having a massive database of all interactions, and made much, much easier by knowing the identify of the terrorist, after the fact.
Something else is going on behind the scenes, which might explain why scientists are more interested in the forward question. It seems to me reverse questions are usually not "statistical" in nature; they are questions of trying to explain something that happened. This is the case in the example of the sales drop as well as the example of traceback investigations.
Chapters 2, 4 and 8 of Numbersense explore how data analysts think about causation although causal thinking is everywhere in the book. You can get the book here. Or try your luck in the Book Quiz (link).