Oct 30, 2007
Here's something different, a mini book review of Ian Ayre's "Super Crunchers". This book can be recommended to anyone interested in what statisticians and data analysts do for a living. Ian is to be congratulated for making an abstruse subject lively.
His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization. He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.
Regression is a statistical workhorse often used for prediction based on historical data. Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response. (In particular, the chapter on randomization covers the topic well.) Using regression to analyze data collected from randomized experiments allows one to establish cause-effect.
In the following, I offer a second helping for those who have tasted Ian's first course:
- Randomized experiments represent an ideal and are not typically possible, especially in social science settings. (Think about assigning a group of patients at random to be "cigarette smokers".) When these are not possible, regression uncovers only correlations, and does not say anything about causation.
- Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
- Regression is only one tool in the toolbox. It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules. Regression has the strongest theoretical underpinning but some of the others are catching up. (Ian did describe neural networks in a latter chapter. It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
- If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care. The size of the data may even overwhelm the computation. Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
- One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.