Predictive Analytics
by Eric Siegel (link) was published earlier this year. Siegel is a consultant and
organizer of a series of popular industry conferences, which I attend with some
regularity. I recommend this book for readers who want to understand the
current state of “data science” at a deeper level than the New York Times’s but still nonmathematical. If you want to measure
against my own writing, then Siegel spends more time addressing “how” than I
typically do on this blog. He also has a fondness for lists, tables, quotations,
pictures, and turns of phrases—I did mention lists, and lists within sentences.
Reading this book is like worming through the crowd at a
conference center in New York, or Boston, or San Francisco, where Siegel’s
meetings are usually held. Siegel is like your Kevin Bacon, your link to
practitioners of the art/science of data-driven business decision-making. (I
just gave the definition, rather than the name, of the field, which has as many
names as cheongsams worn by Maggie
Cheung in In the Mood for Love.
Siegel selected “predictive analytics,” while you may also see “data science,”
“data mining,” “machine learning,” “statistical learning,” “knowledge
discovery,” “statistical modeling,” “business analytics,” etc.)
***
Predictive Analytics
paints an accurate picture of the applications and discourse circa 2000s.
Chapter 2 describes a model used by the retailer Target to find potential
customers who are pregnant women—this application was later picked up by Charles
Duhigg in The Power of Habit (link), and a New York Times Magazine article, which I
examined in a previous post. (Disclosure: I have further comments on pregnancy
targeting in my forthcoming book.) Also in Chapter 2 is a description of how
Hewlett Packard analyzes data on its employees, a nice companion piece to the
recent New York Times profile of Google’s SVP of “people operations.” See my
post here. The third example in Chapter 2 concerns local police using data to
find criminals.
Social media analytics, possibly the hippest corner of the
industry, gets star billing in Chapter 3. Two researchers summarized
Livejournal blog posts into an “Anxiety Index” which they claimed predicted
S&P 500. Chapter 4 contains an extensive description of a popular technique
known as “decision trees” applied to financial risk management. For those who
read Chapter 2 of Numbers Rule Your
World, this section provides more technical details on risk scoring. Some
keywords to look out for are overfitting (called overlearning), using test
datasets to evaluate accuracy, and Occam’s razor.
Chapter 5 covers ensemble models, a relatively new technique
with broad applications. What this means is instead of the traditional route of
developing one “best” predictor, conduct a poll of a set of predictors. Sort of
a wisdom of crowds approach. The winning team in the Netflix Prize—in which
teams competed to improve the “accuracy” of Netflix queue recommendations—used
an ensemble.
Chapter 7 introduces net lift models, which is an unresolved
but important area of business analytics. Take an example of Time Warner Cable
wanting to send special offers to “vulnerable” customers hoping to retain them.
Traditional predictive models find those customers who are most likely to
cancel their service. The trouble is that special offers are very expensive,
and some of those customers would not require this incentive in order to renew.
A “net lift” model is more accurate in only targeting those customers who are
likely to cancel plus likely to renew only if Time Warner makes a special
offer. Technically, the latter problem is much harder to solve.
***
Needless to say, the price of the book is a fraction of the
conference fee. While the overall tone is optimistic, Siegel does not shy away
from discussing the limitations of data analytics. This I find to be a virtue,
a relief from the relentless hype that has enveloped this field of work. I’d
like to end this review with a quote (p. 201):
Commanding a computer to learn is like teaching a
blindfolded monkey to design a fashion diva’s gown. The computer knows nothing.
It has no notion of the meaning behind the data, the concept of what a
mortgage, salary or even a house is. The numbers are just numbers. Even clues
like “$” and “%” don’t mean anything to the machine. It’s a blind, mindless
automation stuck in a box during its first day on the job.
Recent Comments