Apr 23, 2008

Statistical science fiction

Warning: this post is statistics-heavy.

Science fiction is faction (i.e. fact + fiction) before faction exists.  It's taking pieces from science textbooks and mixing in figments of the imagination.  That is what I have in mind when reading a recent article in Target Marketing magazine.

They started with the business problem: if a customer goes directly to the retailer's website to make an order, the retailer could not know if said customer read its catalog or not.  A lot of money is spent creating and mailing glossy catalogs to households.  Marketers believe that catalogs drive such "unmatched" Web orders but how does one prove such an assertion?

Then they offered a solution:

To see the effects of your catalog mailings on online ordering, run a correlation analysis using Microsoft Excel's Data Analysis Toolpak.

Okay, what variables are to be correlated?

You'll need two data sets: order counts by day for the catalog and unaccounted-for Web orders by day for the same period.

Now what?

What results is a modest table with a handful of numbers, the most important of which is the correlation coefficient, a number between zero and one that indicates the degree to which two variables are linearly related.

Just what the textbook ordered, plus bonus points for noting linear correlation.  The figments of the imagination started creeping in:

To get the real answer to the question: "How much does my catalog drive Web orders?" you must square the correlation coefficient to produce the coefficient of determination -- a measure of the proportion of each other's variability that two variables share.

If, for example, a correlation coefficient of 0.9 say there's a high level of linear relation, squaring the coefficient says that 81 percent of the variability is shared between phone and Web orders.  So, in this example, 81 percent of Web orders are directly related to phone orders.  And if phone orders are driven by the catalog, so must 81 percent of Web orders.

These two paragraphs are complete nonsense.  Allow us to briefly recap key ideas on simple linear regression while we separate fact from fiction.

Fact 1: squaring the correlation coefficient produces the coefficient of determination (more commonly called r-squared).

Fiction 1: squaring this particular correlation coefficient produces nothing of this sort.

Takeaway 1: R-squared measures how well the linear model fits the observed data.  A better-fitting model should produce predictions that are more correlated with the observed values.  In this case, we want the predicted catalog orders to be close to the actual catalog orders.  This correlation is what should be squared, not the correlation between catalog orders and unmatched Web orders.


Fact 2: R-squared measures how much of the variability in catalog orders is explained by unmatched Web orders.

Fiction 2: R-squared measures the proportion of "each other's variability that two variables share".

Takeaway 2: In regression analysis, we distinguish between the response variable (catalog orders) and the predictor (unmatched Web orders).  The predictor is used to explain the variability in the response.  There is no such thing as "shared variability" between two variables.  In correlation analysis, the two variables are put on equal footing.   In other words, one cannot start with a correlation analysis and end with a regression output -- only in science fiction.

Fiction 3:  R-squared allows us to split the sample into the proportion with a direct relationship and the proportion that doesn't.  In this example, it allows us to conclude that 81% of (unmatched) Web orders are related to phone orders while the remaining 19% do not.

Takeaway 3: As noted under Fact 2, R-squared splits the variance in phone orders into two parts.  It does not split the orders themselves.  R-squared measures the model not the data.

Fact 4: It is important to specify the underlying logical relationships between variables under study, and every effort must be made to ensure its validity.

Fiction 4: At the end, we learnt the following logic: a) phone orders are highly correlated with catalog orders (since "your phones ring because you mail catalogs") so phone orders are the same as catalog orders.  b) unmatched Web orders are highly correlated with phone orders so unmatched Web orders are the same as phone orders.  c) Catalogs drive phone orders and so catalogs drive unmatched Web orders.

This mind-bending logic we address in order:

Takeaway 4a: They use "phone orders" as a proxy for "catalog orders" since "phones ring because you mail catalog".   If that is so, then there won't be any Web orders and what's the point of looking for catalogs driving Web orders?  Even worse, an order that came on-line is an order that did not come through the call center.  So what exactly is Excel correlating?

Takeaway 4b: Completely unrelated things can have high correlation; a famous example is burglaries and full moons. High correlation certainly does not imply equivalence.

Takeaway 4c: Correlations are not usually transitive: I am like Alan because we are both impatient; I am like Alice because we are both talkative; now, Alan is like Alice?


In short, this is a great example of "knowing just enough to be dangerous".


Reference: "Making a match", Target Marketing Magazine, March 2008.

 

Apr 09, 2008

An embarrassment

I find it embarrassing for the Economist to print an article like this one.  (Do they have a statistics editor?)

Econ_smoking

The subtitle asserting "causality" is offensive.  It is alleged that smoking bans in bars have "caused" more road accidents because people are forced to drive longer distances to find those bars that still allow smoking.

To assert causality so starkly for an undesigned observational study is unprofessional.  I doubt that the authors of the study they cited even went so far.  At best, they probably found a correlation.

Another problem is the practical significance of the finding.  There is a 13% increase in fatal accident rate in a "typical county containing 680,000 people".  There are two problems with this statement:

  • When I check the Census data, there are only about 85 counties in the entire U.S. with at least 680,000 people.  What do they mean by "typical"?
  • 13% is said to be an increment of 2.5 fatal accidents, presumably per year.  The crane accident in Manhattan a few weeks ago killed at least five people.  I just don't believe that one can prove definitively that such a tiny difference is not due to chance so even the correlation, let alone the causality, is suspect.

It appears that the paper is locked up in pre-publication.  If you have seen it, let us know if the authors actually asserted causality.

Reference: "Unlucky Strikes", The Economist, April 3 2008.

Apr 04, 2008

Believe it or not

Via Social Science Statistics blog, I found this article in the Times about baseball's longest hitting streaks.  The authors ran 10,000 simulations of "baseball seasons using historical data to come up with a probability distribution of the longest hitting streak in each season.  They showed the following chart.

Nyt_streaks The record was 56 consecutive games with hits in a season, which in some circles is seen as unbeatable.  These authors -- "in a fit of scientific skepticism -- found that in any season, the simulated longest streak ranged from 39 to 109, with the median at 53 games.  They concluded that "the unlikely becomes likely".


That is sure to turn some heads.  I have a question for them as I can't make sense of these numbers.  A median of 53 meant that 50% (or 5000 out of 10,000) simulated seasons ended up with a hitting streak exceeding 53 games.  Empirically, according to here, Dimaggio's was the only one to go over 53.  Using the authors' time line of 1871 to 2005, that would be 134 seasons.  One out of 134 is 0.75% probability.  0.75 versus 50... sounds like something has gone wrong.

The article doesn't give enough details on the simulation so it is hard to understand what is going on.  I hope I am not misinterpreting their analysis.


 

Source: "A Journey to Baseball's Alternate Universe", Samuel Arbesman and Steven Strogatz, Mar 30 2008.


PS. As readers pointed out, each simulation is of all the seasons.  So the histogram is saying that the particular sequence of 134 seasons that we lived to see is not a rarity considering all the possibilities.  I'm not sure this is telling us much.  It doesn't address the question of how likely the 56-game record would be beat in the future.  It can't address this question because the particular sequence is now already set; the alternative universes are irrelevant because we can't jump from one universe to another mid-stream.

Also, readers want to have each hitter's probability be modeled rather than using the historical average; in other words, factor in opposing pitcher, home/away, etc.

I'll throw in another... there must have been an assumption of independence between one game to the next.  One would think the pressure would be so much higher on the hitter once he gets to 45, 50, 53 etc. games and it would be inappropriate to assume the hitting probability would remain the same.

Along those lines, why should the hitting probability be treated as fixed, rather than modeled as a probability distribution, which would account for variance as one of the readers suggested?

For more discussion, see this Wall Street Journal discussion.
 

Dec 25, 2007

Doctoring charts

Reader Chris P. alerted us to a fascinating post from Errol Morris' blog, which presents results in graphical form from a readers' poll related to this other post.  This other post deals with a pair of photographs taken during wartime, previously discussed by Susan Sontag and others.  Sontag believed the pair documented a before-and-after setting: it was alleged that the photojournalist shifted some cannon balls from their natural position between takes. 

Morris polled his readers asking them in which order they thought the photos were taken ("on before off", "off before on", "undecided"), and which factors were used to make the decision.  He presented results in two formats, first plotting frequencies in bar charts and then plotting proportions in pie charts.  He preferred the pie chart construct.

Nyt_sontag

Most here would share Chris' reaction: "Oh my.  What people do with Excel."

The biggest problem with these pie charts is the unreasonable baseline.  This is one of those polls that allow respondents to pick any number of factors and clearly, the pie chart creator used the 1,151 responses as the baseline, as opposed to 910 people who voted.  Consider these two statements:

  • 52% of respondents who decided "on before off" listed "sun shadow" as a decision factor
  • 30% of the decision factors submitted by respondents who decided "on before off" were "sun shadow"

It is tough to figure out what the second statement means.  It is as if the respondent who selects more than one factors gets more than one votes in the final tally.  To put it differently, the 30% is meaningless unless one also knows how many decision factors were selected by each respondent, on average and in distribution.  The 52% is independent of such consideration.

Combining the data given in the bar charts and pie charts, one discovered that 469 out of 910 respondents could not decide which photo was taken before the other; besides, these respondents on average expressed 0.9 opinions on the decision factors whereas the respondents who made a decision expressed 1.6 opinions.


A simple illustration to show the key decision variables by type of respondents is shown below.  Redo_sontag_2From this chart, one sees that the number and position of the cannon balls were crucial to at least 50% of those who came to a conclusion.  Sun shadow were much more important to those who decided "on before off" while those who decided "off before on" noticed character artistic, shelling and rocks.  Most other factors did not differentiate the three groups.

Source: "Not Your Mum's Apple Pie Chart", Errol Morris, Dec 18, 2007.


 

Dec 18, 2007

Hits and misses 2

In the previous post, we discussed how charts need to address the key question posed by the data.  In this case, the journalist was trying to show that police shots often go errant, and are largely unpredictable even when the distance of the target is given.

Redo_bullets2 In the comments, there is interest in seeing the hit rate v. distance chart.  Because the data came to us in buckets, we do not have enough to continue the analysis.  If one were to guess, the real curve would start out with 100% accuracy at distance 0, fall sharply to a plateau in the 20-40% range at modest distances, and then drop again at large distances, decaying to zero.

Andrew Gelman has conducted this analysis for a similar problem, that of predicting accuracy of golf putts based on distance from the hole.  Here are two key charts from his paper (joint with Deborah Nolan):

Redo_bullets3

The left chart is our hit rate chart above, except the golf data set is larger, allowing a curve fitting.  The right chart is the fitted curve which is a "model" for the true relationship between accuracy and distance from the hole.  The model fitted the data well.

Redo_bullets4 Gelman and Nolan didn't just find any best fitting line through the data.  They started out with a trigonometric model (shown on the right), with the angle of the putt as a random variable.  With this setup, they wrote down the formula for computing the probability that the putt will fall in, that is, the proportion of success.  The angle is assumed to follow a normal distribution with the standard deviation being an unknown parameter.  The standard deviation is estimated from the available data.

Of course, the human body is a bit harder to model than the hole in the ground but this procedure could very well apply.

For more details, check out the paper (PDF).  This example is also found in their book on teaching statistics.

Source: Gelman and Nolan, "A Probability Model for Golf Putting".

Nov 30, 2007

Digging deeper

Two items from other places caught my eye this week as they directly relate to some things we discussed on this blog.

First, I second Andrew's suggestion of a recent NYT article for teaching the concept of margin of error, or how to read political poll coverage intelligently.  Towards the end of this piece is a small gem:

Some pundits began by saying the horse race numbers were close but then tried to marshal evidence that they were not. On ABC's own Web site, Chris Cillizza, wrote: "Among women in the Post poll, Obama actually leads Clinton 32 percent to 31 percent among women. Voters 45 years of age or older are similarly divided, choosing Clinton by a 27 percent to 26 percent margin over Obama. Ditto for those who earn $50,000 or less a year; 29 percent for Clinton, 29 percent for Obama."

Mr. Cillizza failed to mention that if the margin of sampling error is plus or minus five percentage points for all of the likely Democratic caucus goers, then it is even higher for subgroups like women.

In a recent post, I call this the "oft-used device of subgroup support of a hypothesis".  This example illustrates the fallacy more clearly.  It's the "let dig deeper since we haven't found the gold yet" phenomenon.  Such analysis suffers from two serious statistical problems.  The article deals with the sample size problem: the margin of error at the subgroup level is by definition larger; what this means is the bar for statistical significance has been raised; and rare is the case where such analysis could lead to any further insights.  (Of course, I am assuming the original poll was not designed to be analyzed at the subgroup level.)

The other issue -- more difficult to explain and omitted in the article -- is the multiple hypothesis problem.  It is well known that if we dig around long enough, we may get so dizzy that anything that glitters will look like gold.  In other words, false positives.  Like the sample size problem, the remedy is to raise the bar for statistical significance even higher.  In practice, this frequently wipes out the rationale for such analysis.

I will address the other interesting item in a new post.

Nov 25, 2007

A dangerous equation

Graduation rates at 47 new small public high schools that have opened since 2002 are substantially higher than the citywide average, an indication that the Bloomberg administration’s decision to break up many large failing high schools has achieved some early success.

Most of the schools have made considerable advances over the low-performing large high schools they replaced. Eight schools out of the 47 small schools graduated more than 90 percent of their students.

Nyt_smallsch This graphic included in the NYT article  lent support to the "small schools movement".  In particular, note the last sentence of the above quotation: it incorporates the oft-used device of subgroup support of a hypothesis, in this case, the subgroup of eight top-performing schools.

Such analysis is "dangerous", according to Howard Wainer, who discusses this and other examples of misapplication in a recent article in American Scientist, entitled "The Most Dangerous Equation".  He alleged that billions have been wasted in the pursuit of small schools.

The issue concerns sample size.  Dr. Wainer and associates analyzed math scores from Pennsylvania public schools.  Wainer_mathscoresAverage scores for smaller schools are based on smaller number of students, and therefore less stable (more variable).  More variability means more extremes.  Thus, by chance alone, we expect to find more smaller schools among the top performers.  Similarly, by chance alone, we also expect to find more smaller schools among the worst performers. 

The scatter plot lays out their argument. Focusing only on the top performers (blue dots), one might conclude that smaller schools do better.  However, when the bottom performers (green) are also considered, the story no longer holds.  Indeed, the regression line is essentially flat, indicating that scores are not correlated with school size.

This is all nicely explained via the standard error formula (De Moivre's equation) in Dr. Wainer's article.  Here is a NYT article from the mid 1990s describing this same phenomenon.

File this as another comparability problem.  Because estimates based on smaller samples are less reliable, one must take extra care when comparing small samples to large samples.

Dr. Wainer is publishing a new book next year, called "The Second Watch: navigating the uncertain world".  I'm eagerly looking forward to it.  His previous books, such as Graphic Discovery and Visual Revelations, both part of the Junk Charts collection.

Sources: "The Most Dangerous Equation", American Scientist, November 2007; "Small Schools Are Ahead in Graduation", New York Times, June 30 2007.


P.S. Referring back to the NYT chart above, one might wonder at the impossible feat of raising graduation rates across the board simply by breaking up large schools into smaller ones.  This topic was taken up here, here and here.  When evaluating the "small schools" policy, it is a mistake to discuss only the performance of small schools; any responsible analysis must look at improvement over all schools.  Otherwise, it's a simple matter of letting small schools skim off the cream from larger schools.

 

Nov 17, 2007

Wordsmiths

I know there are more than a few wordsmiths among our readers and this entry is for you.

Data is a troublesome word.  It is the plural of datum.  And yet, I find it unnatural to say "here are my data" instead of "here is my data".  Similarly, "datum point" is more grammatical than "data point" -- and perhaps both are redundant words -- because we should use a singular noun to modify another noun (e.g. company ranking, not companies ranking; potato chip, not potatoes chip; etc.)


In a recent piece on RSS News, Ian Schagen convinces me to treat data as singular, without remorse.  Among his many arguments, he points out that agenda is the plural of agendum and yet we have no qualms using agenda as singular.  So from now on, data is singular.

Ian has the last word:

In fact an amusing pastime is to read papers and articles by people, trying to use 'data' as a plural because they believe this to be 'correct', who slip into the proper English usage when their attention wavers.  I've often seen both usages in a single sentence!

Source: "Why 'data' is singular", RSS News.  (Unfortunately, this seems to be only available to members in paper copies.)

Nov 06, 2007

The eyeball test

This set of graphs was used by the NYT to discuss changes in U.S.  spending patterns over time.  For this post, I am focusing on the bottom left and bottom right graphs.  One shows spending on energy as a percent of GDP; the other, on "nonresidential structures" (aka, commercial buildings).

Nyt_spending

At first glance, spending on energy and that on commercial buildings look very similar in shape (see above or below left).  Alas, this "eyeball test" doesn't work very well with time series data.  Lets investigate further.

Redospend1_2

"Standardizing" the data (above right) tells us whether the swings are unusual or not in the history of the data.  So in the 1980s, commerical building spend spiked to more than three times the standard deviation above the historical average.  Generally speaking, the standardized unit of 3 is taken to mean highly unusual. 

Notice that the peaks of the left graph had equal heights but on the right graph, energy spending peaked only above two while commerical building spend rose above three.  This is because energy spending has been more volatile historically so it takes larger jumps (or plunges) to count as "unusual" movements.  This information is hidden in the unstandardized version.

Further, since we are concerned with long-term trends, lets take a look at five-year moving averages (below right): in other words, each time point is the average of the preceding five years worth of data. 

Redospend2

The fluctuations have been smoothed out and the peaks are no longer as high.  Glancing at this chart, we may still conclude that the spending patterns are quite similar -- especially in the period prior to 1995.

But is that really the case?  Zooming in on the 1980s, we may mistakenly think the two lines are "close together" if our eyes read the horizontal distance and/or area between the curves, rather than focusing on the vertical distance.  The arrows on the bottom left chart depict this difference.  To make things clearer, the bottom right chart plots the vertical distances between the two lines.

Redospend3

Observe that the difference expanded to above 1 unit in the late 1980s.  A difference of one unit is very large in the standardized scale (of "unusualness") since 0 is business as usual and 3 is "highly unusual".

Eyeballing the two time series would lead us to believe that the two series are similar but we run the risk of underestimating the differences as illustrated here.


Source: "Auto Sector's role Dwindles, and Spending Suffers", New York Times, Nov 3 2007.

Oct 30, 2007

Super Crunchers

Supercrunchers Here's something different, a mini book review of Ian Ayre's "Super Crunchers".  This book can be recommended to anyone interested in what statisticians and data analysts do for a living.  Ian is to be congratulated for making an abstruse subject lively.

His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization
He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.

Regression is a statistical workhorse often used for prediction based on historical data.  Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response.  (In particular, the chapter on randomization covers the topic well.)  Using regression to analyze data collected from randomized experiments allows one to establish cause-effect. 

In the following, I offer a second helping for those who have tasted Ian's first course:

  • Randomized experiments represent an ideal and are not typically possible, especially in social science settings.  (Think about assigning a group of patients at random to be "cigarette smokers".)  When these are not possible, regression uncovers only correlations, and does not say anything about causation.
  • Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
  • Regression is only one tool in the toolbox.  It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules.  Regression has the strongest theoretical underpinning but some of the others are catching up.  (Ian did describe neural networks in a latter chapter.  It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
  • If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care.  The size of the data may even overwhelm the computation.  Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
  • One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31