Super Crunchers
Oct 30, 2007
Here's something different, a mini book review of Ian Ayre's "Super Crunchers". This book can be recommended to anyone interested in what statisticians and data analysts do for a living. Ian is to be congratulated for making an abstruse subject lively.
His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization. He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.
Regression is a statistical workhorse often used for prediction based on historical data. Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response. (In particular, the chapter on randomization covers the topic well.) Using regression to analyze data collected from randomized experiments allows one to establish cause-effect.
In the following, I offer a second helping for those who have tasted Ian's first course:
- Randomized experiments represent an ideal and are not typically possible, especially in social science settings. (Think about assigning a group of patients at random to be "cigarette smokers".) When these are not possible, regression uncovers only correlations, and does not say anything about causation.
- Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
- Regression is only one tool in the toolbox. It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules. Regression has the strongest theoretical underpinning but some of the others are catching up. (Ian did describe neural networks in a latter chapter. It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
- If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care. The size of the data may even overwhelm the computation. Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
- One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.
"It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)"
All predictive techniques in data mining are "regressions", of a sort. What varies is the functional form of the output, and the performance function being optimized. For instance, the most commonly-used neural networks (multi-layer perceptrons) are more or less collections of overlapping logistic functions, which attempt to minimize squared error.
-Will
Posted by: Will Dwinnell | Oct 30, 2007 at 08:53 AM
Regression and other "traditional" statistics were developed for situations where data was relatively scarce, and the goal was to extract as much meaningful information as possible from the hard-won data. Data mining seems to be oriented more towards the opposite situation, where the volume of data is so overwhelming that special techniques are needed to make sense of it.
Posted by: SilentD | Oct 30, 2007 at 10:25 AM
Russ Roberts had a very interesting interview with Ayres on the EconTalk podcast. Surprisingly, Roberts questions the value of correlations, regressions, etc. There are very few examples where statistical analysis provided a definitive answer, and changed minds in the process. His point is that most of the phenomena studied in economics and the social sciences are the result of processes so complex that it is very difficult to isolate single causes. A lot of analysis just winds up confirming the researcher's biases.
Posted by: John S. | Oct 30, 2007 at 06:51 PM
For the counterpoint, there's Ehrenberg's article "Regression is useless" (can't find cite at the moment) and Ted Goertzel's "Myths of Murder and Multiple Regression". http://crab.rutgers.edu/~goertzel/mythsofmurder.htm
Regression is a powerful statistical tool and I am not going to contend it should not be used, but in the wrong hands -- and there are lots of wrong hands -- it's like giving a teenager a Corvette and a credit card.
Posted by: zbicyclist | Oct 30, 2007 at 11:55 PM
John: thanks for the link. Roberts certainly made his disdain for statistical analysis quite clear, didn't he? I have two problems with this critique: first, what is the better alternative? second, if researchers are intellectually dishonest - ignoring decades of statistical wisdom to run thousands of regressions looking for structure - the problem is not with the tool but with the toolsmith.
The podcast and the other citations do provide a more balanced view of the topic. I also recommend the writings of David Freedman at Berkeley.
For a different kind of counterpoint, I am reading "Gut Feelings" by Gigerenzer. He presents "evidence" showing that intuition beats regression or logical analysis all the time. Reading these back to back points out the trouble with popular science publishing today: these authors all present only one side of the story, unknowing readers may think it's the only side.
Posted by: Kaiser | Oct 31, 2007 at 11:22 PM
Describing data analysis as "number crunching" rubs me the wrong way, as I noted in a blog post last year. The suggestion of a mindless, mechanical approach that can somehow yield useful answers seems so wrong-headed.
Posted by: Nick Barrowman | Nov 11, 2007 at 09:04 PM
The critics of Ayres' book -- perhaps one might consider calling them "Luddite critics" -- fail to grasp one thing:
The "supercrunching" techniques WORK. Randomization WORKS. Regression analysis WORKS.
Properly organized and structured, of course.
It was not surprising to me, nor should it be to anyone else, that many of the examples that Ayres uses are from the business world. We should all recall the great line from Dan Ackroyd in Ghostbusters when Murray and his fellow paranormal "scholars" are booted from the university and wondering about their future. Says Ackroyd: "I've worked in the private sector. They expect results."
These techniques deliver results.
Posted by: Karl K | Nov 14, 2007 at 10:42 PM
I love reading an article that will make people think. Also, thanks for allowing me to comment!
Posted by: u binary | Jan 09, 2014 at 07:59 AM
When you park your car for any amount of time leave it in parking lots and garages that have security or a parking lot attendant watching out for it. CREDITS: The Book Thief stars Sophie Nélisse, Geoffrey Rush, Emily Watson, Ben Schnetzer, Nico Liersch, Sandra Nedeleff, Kirsten Block, and Joachim Paul Assböck. Not so long ago, there weren't so many passionate souls carrying around bags full of expensive equipment all over the world.
Posted by: Preorder Thief | Jan 14, 2014 at 02:45 AM