Predictive Analytics by Eric Siegel (link) was published earlier this year. Siegel is a consultant and organizer of a series of popular industry conferences, which I attend with some regularity. I recommend this book for readers who want to understand the current state of “data science” at a deeper level than the New York Times’s but still nonmathematical. If you want to measure against my own writing, then Siegel spends more time addressing “how” than I typically do on this blog. He also has a fondness for lists, tables, quotations, pictures, and turns of phrases—I did mention lists, and lists within sentences.
Reading this book is like worming through the crowd at a conference center in New York, or Boston, or San Francisco, where Siegel’s meetings are usually held. Siegel is like your Kevin Bacon, your link to practitioners of the art/science of data-driven business decision-making. (I just gave the definition, rather than the name, of the field, which has as many names as cheongsams worn by Maggie Cheung in In the Mood for Love. Siegel selected “predictive analytics,” while you may also see “data science,” “data mining,” “machine learning,” “statistical learning,” “knowledge discovery,” “statistical modeling,” “business analytics,” etc.)
Predictive Analytics paints an accurate picture of the applications and discourse circa 2000s. Chapter 2 describes a model used by the retailer Target to find potential customers who are pregnant women—this application was later picked up by Charles Duhigg in The Power of Habit (link), and a New York Times Magazine article, which I examined in a previous post. (Disclosure: I have further comments on pregnancy targeting in my forthcoming book.) Also in Chapter 2 is a description of how Hewlett Packard analyzes data on its employees, a nice companion piece to the recent New York Times profile of Google’s SVP of “people operations.” See my post here. The third example in Chapter 2 concerns local police using data to find criminals.
Social media analytics, possibly the hippest corner of the industry, gets star billing in Chapter 3. Two researchers summarized Livejournal blog posts into an “Anxiety Index” which they claimed predicted S&P 500. Chapter 4 contains an extensive description of a popular technique known as “decision trees” applied to financial risk management. For those who read Chapter 2 of Numbers Rule Your World, this section provides more technical details on risk scoring. Some keywords to look out for are overfitting (called overlearning), using test datasets to evaluate accuracy, and Occam’s razor.
Chapter 5 covers ensemble models, a relatively new technique with broad applications. What this means is instead of the traditional route of developing one “best” predictor, conduct a poll of a set of predictors. Sort of a wisdom of crowds approach. The winning team in the Netflix Prize—in which teams competed to improve the “accuracy” of Netflix queue recommendations—used an ensemble.
Chapter 7 introduces net lift models, which is an unresolved but important area of business analytics. Take an example of Time Warner Cable wanting to send special offers to “vulnerable” customers hoping to retain them. Traditional predictive models find those customers who are most likely to cancel their service. The trouble is that special offers are very expensive, and some of those customers would not require this incentive in order to renew. A “net lift” model is more accurate in only targeting those customers who are likely to cancel plus likely to renew only if Time Warner makes a special offer. Technically, the latter problem is much harder to solve.
Needless to say, the price of the book is a fraction of the conference fee. While the overall tone is optimistic, Siegel does not shy away from discussing the limitations of data analytics. This I find to be a virtue, a relief from the relentless hype that has enveloped this field of work. I’d like to end this review with a quote (p. 201):
Commanding a computer to learn is like teaching a blindfolded monkey to design a fashion diva’s gown. The computer knows nothing. It has no notion of the meaning behind the data, the concept of what a mortgage, salary or even a house is. The numbers are just numbers. Even clues like “$” and “%” don’t mean anything to the machine. It’s a blind, mindless automation stuck in a box during its first day on the job.