*Warning: this post is statistics-*heavy*.*

Science fiction is faction (i.e. fact + fiction) before faction exists. It's taking pieces from science textbooks and mixing in figments of the imagination. That is what I have in mind when reading a recent article in Target Marketing magazine.

They started with the business problem: if a customer goes directly to the retailer's website to make an order, the retailer could not know if said customer read its catalog or not. A lot of money is spent creating and mailing glossy catalogs to households. Marketers believe that catalogs drive such "unmatched" Web orders but how does one prove such an assertion?

Then they offered a solution:

To see the effects of your catalog mailings on online ordering, run a correlation analysis using Microsoft Excel's Data Analysis Toolpak.

Okay, what variables are to be correlated?

You'll need two data sets: order counts by day for the catalog and unaccounted-for Web orders by day for the same period.

Now what?

What results is a modest table with a handful of numbers, the most important of which is the correlation coefficient, a number between zero and one that indicates the degree to which two variables are linearly related.

Just what the textbook ordered, plus bonus points for noting *linear* correlation. The figments of the imagination started creeping in:

To get the real answer to the question: "How much does my catalog drive Web orders?" you must square the correlation coefficient to produce the coefficient of determination -- a measure of the proportion of each other's variability that two variables share.

If, for example, a correlation coefficient of 0.9 say there's a high level of linear relation, squaring the coefficient says that 81 percent of the variability is shared between phone and Web orders. So, in this example, 81 percent of Web orders are directly related to phone orders. And if phone orders are driven by the catalog, so must 81 percent of Web orders.

These two paragraphs are complete nonsense. Allow us to briefly recap key ideas on simple linear regression while we separate fact from fiction.

__Fact 1__: squaring the correlation coefficient produces the coefficient of determination (more commonly called r-squared).__Fiction 1__: squaring this particular correlation coefficient produces nothing of this sort.__Takeaway 1__: R-squared measures how well the linear model fits the observed data. A better-fitting model should produce predictions that are more correlated with the observed values. In this case, we want the predicted catalog orders to be close to the actual catalog orders. This correlation is what should be squared, not the correlation between catalog orders and unmatched Web orders.__Fact 2__: R-squared measures how much of the variability in catalog orders is explained by unmatched Web orders.__Fiction 2__: R-squared measures the proportion of "each other's variability that two variables share".__Takeaway 2__: In regression analysis, we distinguish between the response variable (catalog orders) and the predictor (unmatched Web orders). The predictor is used to explain the variability in the response. There is no such thing as "shared variability" between two variables. In correlation analysis, the two variables are put on equal footing. In other words, one cannot start with a correlation analysis and end with a regression output -- only in science fiction.__Fiction 3__: R-squared allows us to split the sample into the proportion with a direct relationship and the proportion that doesn't. In this example, it allows us to conclude that 81% of (unmatched) Web orders are related to phone orders while the remaining 19% do not.

__Takeaway 3__: As noted under Fact 2, R-squared splits the variance in phone orders into two parts. It does not split the orders themselves. R-squared measures the model not the data.__Fact 4:__ It is important to specify the underlying logical relationships between variables under study, and every effort must be made to ensure its validity.

__Fiction 4__: At the end, we learnt the following logic: a) phone orders are highly correlated with catalog orders (since "your phones ring because you mail catalogs") so phone orders are the same as catalog orders. b) unmatched Web orders are highly correlated with phone orders so unmatched Web orders are the same as phone orders. c) Catalogs drive phone orders and so catalogs drive unmatched Web orders.

This mind-bending logic we address in order:

__Takeaway 4a__: They use "phone orders" as a proxy for "catalog orders" since "phones ring because you mail catalog". If that is so, then there won't be any Web orders and what's the point of looking for catalogs driving Web orders? Even worse, an order that came on-line is an order that did not come through the call center. So what exactly is Excel correlating?

__Takeaway 4b__: Completely unrelated things can have high correlation; a famous example is burglaries and full moons. High correlation certainly does not imply equivalence.

__Takeaway 4c__: Correlations are not usually transitive: I am like Alan because we are both impatient; I am like Alice because we are both talkative; now, Alan is like Alice?

In short, this is a great example of "knowing just enough to be dangerous".

Reference: "Making a match", Target Marketing Magazine, March 2008.

you forgot this one...

Fiction: Excel is a great tool for serious data analysis.

(see http://www.practicalstats.com/xlsstats/excelstats.html for an overview)

Posted by: SilentD | Apr 25, 2008 at 10:42 AM

Really great post. Thanks for walking through that. I run into this type of thing quite often when I work with my firm's marketing department. Their analysis often sets off my "faction sniffer" but sometimes it takes some real thinking to really pin point what the errors are.

I really appreciate the thoughtful and more statistical nature of this post. Well done.

Posted by: JD Long | May 08, 2008 at 11:34 AM