Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics.
In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.”
Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette.
Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers.
(A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer?
(B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer?
Those subjects who answered (A) made the right judgment, in accordance with the base rate of 70 percent.
The answer to (B) should be the same, since it shouldn't matter whether the random person is named Dick or not, and the generic description provides no useful information to determine Dick’s occupation. However, those subjects who answered (B) edited the chance down to about 50-50. The experiment showed that access to Dick’s description led people astray – to ignore the base rate. Note that the base rate here is the prior probability.
***
What are the practical applications of the KT experiment for business data analysts?
tl;dr
Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with Kahneman-Tversky experiments.
1. Adding more variables can make your predictions worse
Let's start with what kind of additional information is provided by Dick’s description. The sample size has not changed – it’s still one. The data expanded only in the number of variables (or features). Specifically, these eight additional variables:
X1 = age
X2 = gender
X3 = martial status
X4 = number of children
X5 = ability level
X6 = motivation level
X7 = expected level of success in field
X8 = popularity among colleagues
In today’s age of surveillance data, it is all too easy for any analyst to assemble more variables. The KT experiment shows that having more variables does not imply you have more useful information. Worse, those extra variables may distract you from the base rate, leading to worse predictions.
2. Machines are even more susceptible than humans
If humans are prone to such mistakes, should we use machines instead? Sadly, machines will perform worse.
Machines allow us to process even more variables at even greater efficiency. Instead of eight useless variables, you can now add 800 or even 8,000 useless variables about Dick. The machines will then inform you which subset of these variables “pop.” The more useless data you add in, the higher the chance you will encounter an accidental correlation.
Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with ground-breaking Kahneman-Tversky experiments.
thank you for this post, I am also struggling with this issue. You mention "big data" here, perhaps just for the click value of the term, but this problem has been true from long before the rise of "big data" as a buzzword. the problem is not with machines or machine learning, but rather the human expectation that complicated explanations are better. There is a book from the 1970s, How Real is Real, by Paul Watzlawick, that described some interesting experiments with people classifying slides of biopsies of cancer cells. Simple and correct theories were discarded in favor of complicate but erroneous theories. I would refer you to the citation for more details when i find it.
BTW, the beginning of your post is somewhat difficult to understand. Just after citing the Lewis book (which I have here open in front of me), your intro to the "vignette" is confusing. Would suggest that you make clear that test subjects are presented with either (A) or (B) -
Posted by: Jose Antonio Sobrino Reineke | 04/04/2019 at 03:02 PM
JASR: Tackling your last point first, I presented one version of the experiment in which two groups were presented (A) or (B). What is even more shocking is that when both (A) and (B) were presented to the same subjects, they still adjusted their predictions in the same way! In Lewis's book, I believe he disclosed this additional insight when describing a different experiment - and cited Kahneman as saying that it was Tversky's idea to present both statements to the same subjects, and he didn't believe people would still make the error - but they did!
The Big Data connection is intentional and very real. In that world, adding more data frequently means adding more variables not increasing sample size. And machines are supposed to overcome human errors. The point of the title is that this type of error has been recognized since the 1970s, long before Big Data, but it is relevant today.
Posted by: Kaiser | 04/04/2019 at 04:28 PM