The third chapter of SuperFreakonomics is simpler in structure than the other two chapters, containing just two parts, one dealing with the Kitty Genovese murder and the other with the research of Chicago economist John List, a colleague of Levitt.
The chapter touches upon a number of keystone experimental results from psychology, exposed to anyone who have taken PSY 101. The Kitty Genovese case is a real-life example of the "bystander effect", the tendency of human beings to offer no help to strangers because they are expecting others to offer help.
The Ultimatum and Dictator games are used to study human rationality and the attitude toward inequity.
The Stanford prison experiment and the Milgram experiment at Yale were also briefly touched upon, these being studies of "obedience to authority", used to show that people have a capacity to do really bad things when given roles as authority figures (Stanford), or subject to authority figures (Yale).
Here are, again, my thoughts on statistical topics as I read the chapter:
pp.100-1: Raises an important point that randomized controlled experiments in criminology that are also ethical and politically acceptable are very difficult to design. The thought experiment of a randomized experiment they concocted to "know whether putting more people in prison really lowers the crime rate" doesn't make much sense to me:
Pretend you could randomly select a group of states and command each of them to release 10,000 prisoners. At the same time, you could randomly select a different group of states and have them lock up 10,000 people, misdemeanor offenders perhaps, who otherwise wouldn't have gone to prison. Now sit back, wait a few years, and measure the crime rate in those two sets of states. Voila!
In case you are interested in running randomized experiments, here are a few pointers:
p.115: L&D wrote: "In Virginia, List cruised the trading floor and randomly recruited customers and dealers, asking them to step into a back room for an economics experiment."
This hits on one of my pet peeves. Whenever I see an assertion of "random selection" in an observational study, I want to ask what mechanism was used to ensure randomness. Think, for example, when the NYC subway police tell us they "randomly" inspect bags, how is it precisely that they enforce such "randomness"? Are they picking on every Nth passenger? Do they carry a pseudo-random generator?
Not knowing the precise selection rules is not an excuse for assuming random selection. In this example, I'd like to know how List "randomized" the recruiting process.
pp. 118, 122: would love to see sample sizes cited alongside the results from the Ultimatum/ Dictator experiments, all reporting fairly large differences between groups. I'm sure they were large enough, but statisticians always worry about the so-called "law of small numbers" (attributed to Kahneman and Tversky). When sample size is too small, even very large differences may occur by chance.
pp.120-3: Discusses the shortcomings of the experimental economics field, which has become quite influential (see Dan Ariely's book). Many of these issues are exactly the things statisticians worry about... selection bias, nonresponse bias, generalizability, observer effect, etc. Well worth reading and pondering.
What I don't get about this section is how they can be so negative on these "lab studies" while simultaneously so admiring of John List's research. As far as I can tell, List also ran "lab studies". The main difference is that in some cases, his subjects were not students but traders taken from baseball-card trading floors. These subjects knew they were part of an experiment and were instructed to do certain things.
But I agree with their general points about lab research, and also think List's research is useful. I hope List and other researchers will reach out to the market research and political polling communities because many of the problems they face are not new problems; they have been studied for a long time, and these communities can help each other.
pp.122-3: They continue the takedown of lab research, citing a researcher saying "lab experiments have the power to turn a person into a 'stupid automaton' who may exhibit a 'cheerful willingness to assist the investigator in every possible way by reporting to him those very things which he is most eager to find'." This is where they introduce the Stanford and Yale experiments as proof of "forced cooperation".
When I was taught about these experiments (admittedly many years ago), they were evidence of the human capacity to do bad things but L&D's point here is contrarian: they say these bad things could only happen in labs but not in real life. This is a new interpretation to me.
p.124: am glad to hear about "warm-glow altruism" research, would love to learn more.
pp.125-: these pages are a take-down of the New York Times reporting on the Genovese murder. Interesting story, looking for a rejoinder.
The third section of Chapter 2 of SuperFreakonomics relays the apparent success of a British bank analyst in predicting suspected terrorists using bank data. My overall reaction to this section can be read on Eric McNulty's blog here.
Eric is the editorial director of the International Institute for Analytics, which has the ambition to connect together practitioners and researchers in the field of business analytics. Tom Davenport, who has done some great work documenting the burgeoning field of business analytics, is its research lead. (See book 1 and book 2).
I used L&D's example to illustrate the "secrets" of predictive modeling, which is widely used in businesses to perform tasks ranging from credit scoring to targeting marketing offers.
A few highlights, in the order they appear in the piece:
To keep to a respectable length, I had to cut out several sections from the original draft. They appear below:
Gaming the algorithms
Why should suicide bombers buy life insurance? asked Levitt and Dubner in the chapter's title. Horsley discovered that people who own life insurance policies are less likely to be suspected terrorists than the average bank customer, all else being equal. Noting that life insurers do not cover claims in case of suicide, the authors speculated that future suicide bombers could proactively buy life insurance to confuse Horsley's predictive model.
Indeed, if they took such an action, the model would (mis-)classify these crafty criminals as regular customers. This is why the details of predictive algorithms should be treated as trade secrets.
Consumer advocates misfire when they exhort banks to open up the "black boxes" of credit scoring. When it became known that the authorized-user status was a positive indicator, EBay-style marketplaces sprang up in which people with high credit scores sold their desirable status to the highest bidders, thus distorting the entire system. Eventually, the modelers dropped this feature from the credit scoring algorithms.
Levitt and Dubner refrained from revealing the best indicator of suspected terrorists "in the interest of national security". This "Variable X" has predictably led to rampant speculation. There are two reasons to keep a straight face while the commotion ebbs and flows.
As explained, Horsley's model is far from "accurate" if accuracy includes identifying most, if not all, of the suspected terrorists.
Moreover, a key feature of "X" is its specificity: as Levitt and Dubner pointed out, very few bank customers exhibit this behavior, few enough to narrow the suspicious list to 30 names out of 50 million. It is highly probable that any correlation that exists between "X" and being a suspected terrorist is "spurious", to use a statistical jargon. All predictive models attempt to make generalizations from history, and any algorithm which targets an extremely specific trait risks not being general enough. When this happens, the model will be ineffective in predicting the future, even if it performs well on back-testing.
Back to basics
In the discussion thus far, I entertained the conceit that Horsley was attempting to predict "suspected terrorists". In building his model, he started with a list of suspects, rather than proven terrorists. This decision stemmed from the dearth of known terrorists, which, as I noted before, is what makes this problem hard. Levitt and Dubner justified this choice: "Granted, none of these men were proven terrorists; most of them would never be convicted of anything. But if they resembled a terrorist closely enough to get arrested, perhaps their banking habits could be mined." (Later, they went from talking about suspected terrorists to suicide bombers; surely not all terrorists blow themselves up.)
The analysis of batting averages clearly shows that almost all terrorist suspects are probably innocent. Thus, an algorithm tailored to looking for suspected terrorists will yield suspects who are mostly innocent people falsely accused of extremely serious crimes. Horsley effectively modified the objective of his predictive model when he switched from predicting terrorists to predicting suspects.
At the earliest stage of the development of predictive models, you must be crystal- clear in defining the business problem and setting the relevant target(s) of prediction. Confusion on this most basic issue will almost certainly result in disappointing models that solve the wrong problem.
The second part (pp.66-87) of Chapter 2 is concerned with how to use hospital data to compare the "skill" of doctors. A doctor who has "skill" creates greater than average improvement in the outcomes of his or her patients, after controlling for other factors, such as the type of patients. The technical hurdle comes from the non-random assignment of patients to doctors (by triage nurses) so that, for example, the best doctors may get the patients with lower-than-average chances of survival. If we compare the average survival of patients by doctor, the difference could reflect the "skill" of the doctors, or the survivability of the assigned patients, or some combination of both. This, as L&D points out, is a form of selection bias. We would like to control for patient assignment, and isolate the "skill" factor.
The structure of this section closely resembles a statistical analysis process, is intelligently laid out, and so makes for enjoyable reading. The steps in this process include:
I have trouble understanding exactly what their methodology is (This shows why equations are much better than words when it comes to describing methodology but if we want to learn methods, we wouldn't be reading Freakonomics.) The Notes references a working paper by Mark Duggan and Levitt, and I could not locate it on-line, nor at either of their home pages. So I will summarize what they indicated in the book, and then add some comments on what I think other key issues are for this problem.
The Methodology (primarily pp.78-79)
They have clearly thought a lot about this problem, and I am not disagreeing with the method. I just don't comprehend it completely. It would be interesting to see some exploratory data on the doctor-patient matching, which could give color to the nonrandom assignment issue.
I now list some other issues that L&D don't address directly but are worthy of attention:
Other thoughts on the rest of Part 2:
p.62 -- What Alan Krueger conducted sounds like a matched "case-control" study. Cases were the martyrs, and controls were men matched by age. This type of study is usually analyzed using odds ratios, e.g. for the poor family factor, this would be ((0.28)/(0.72))/((0.33)/(0.67)) = 0.79 (reciprocal is 1.27). The odds of a martyr coming from a poor family is about 80% the odds of a non-martyr coming from a poor family. Based on this calculation, I suspect the factor is not statistically significant. Good study but not the results we would hope for, and it confirms that there is no easy way to predict martyrdom.
p.63 -- L&D expresses surprise at finding that terrorists have above average education and social status. Note that people who perpetrate credit card scams are often PhDs.
pp.65-6 -- Useful walkthrough of how to compute the total cost of terrorist attacks, beyond just counting dead bodies.
p.68 -- They argue that 9/11 exposed the lack of "surge capacity" in our emergency rooms in hospitals. "If there had been a thousand victims, would they even have gotten inside?" I'm not sure about the wisdom of designing emergency rooms to accommodate extremely rare events like 9/11. Would like to see an economic analysis of cost and benefit.
They follow with some eye-opening information on the design of emergency rooms, narrated by Craig Feied. The bit on air recirculation inside a hospital is startling. And patients do die from ailments they pick up after entering the hospital.
pp.70-72 -- Hospitals have poor data back in those days, and Feied, the ER modernizer, had to get his hands dirty collecting the data, which is so very true of most data projects. The following sentence begs to be translated back to technical jargon: "Their system would deconstruct each piece of data from every department and store it in a way that allowed it to interact with any other piece of data, or any other 1 billion pieces."
p.73 -- I am not sure I want to meet Mr. Feied. They say "when challenged, he wouldn't rest until he found a way to charm, or, if need be, threaten his way to victory."
In looking at the details of Chapter 1, I neglected to discuss its theme. The part of Freakonomics that appeals to me concerns how data is harnessed to answer interesting questions. Beneath the stories, Chapter 1 is primarily concerned with the collection of data, rather than the analysis of data. Indeed, what count for analysis consist of a few sample averages (e.g. how much the "typical prostitute" earns?) and a few subgroup comparisons (e.g. the relative costs of different sex acts).
Turning now to Chapter 2 (the "terrorism" chapter). I find the material here much richer for the statistically-minded reader, and well worth my time.
The chapter has a tri-partite structure: the first section deals with a dazzling assortment of statistical factoids, in a presentation that will either infuriate or engage the statistician, as I will explain below; the second section looks at how ER doctors can be compared even though the assignment of patients is not randomly determined; and the third section describes how one British mystery person uses bank data to find suspected terrorists.
As I indicated, Chapter 2 Part 1 (pp. 57-62) will either infuriate or engage you. A large variety of statistical factoids are examined, from which I list three representatives:
If this were a statistics book, the author will use these examples to illustrate the notion of "spurious correlations". Within the Muslim community, there is a correlation between certain birthdays and higher incidence of disabilities. However, this is a spurious correlation because the day of birth does not cause disabilities; what is happening is that those birthdays are correlated with fasting mothers, and fasting causes some babies to grow up with disabilities.
L&D take a different approach; they play up the correlations for effect. They say things like "it is no exaggeration to say that a person's entire life can be greatly influenced by the fluke of his or her birth." (p.58) In the case of soccer leagues, they say "birth timing may push a marginal child over the edge." (p.62).
For this situation, I would stress that birthday is a useful indicator of a child's likelihood to make the league but it is not a cause. The reason why the birth month distribution is skewed is that the kids born in Jan, Feb or March are older and stronger than those born in Oct, Nov or Dec and therefore are more likely to earn the coach's favor.
The discussion of economics Nobelists is still stranger. L&D cite the researchers' conclusion that "one of us is currently contemplating dropping the first letter of her surname", adding that the "offending" name was Yariv. Why would any economist want to change his or her name to begin with "A"? The only reason I know is the belief that having a last name beginning with "A" causes one to have a greater chance to win a Nobel.
It is clear that L&D knows the difference between causation and correlation so I think this is an attempt to make the material interesting. By using this presentation, it forces me to delve into what's a cause and what's not; therefore, I find it engaging. Others may find it infuriating.
Other thoughts on Part 1:
p.59 -- If the women who survived the Spanish flu pandemic then suffered "terrible luck" "over their whole lives", are L&D saying it would have been better for them to have died from the flu?
p.61 -- I'm not sure how this sentence escaped Levitt's attention; this is an egregious error:
Most youth [baseball] leagues in the U.S. have a July 31 cutoff date. A U.S.-born boy is roughly 50 percent more likely to make the majors if he is born in August instead of July. Unless you are a big, big believer in astrology, it is hard to argue that someone is 50 percent better at hitting a big-league curveball simply because he is a Leo rather than a Cancer.
Likelihood to make the majors is not the same as likelihood to hit a big-league curveball! Indeed, in such a competitive field, the difference in batting averages between a kid who makes the majors and one who narrowly misses out is likely to be a matter of hundredths or even thousandths. While on average, the August class may have a 50 percent higher likelihood of making the majors, the batting average of the August class is extremely unlikely to be 50 percent higher than that of the July class.
(The last sentence also shows that they realize date of birth is not a cause. That's why I think the presentation style is deliberate.)
p.62 -- In a reference to the above baseball example, L&D make the side comment that in determining a boy's chance of making the majors, other factors may be "infinitely more important than timing an August delivery date". Are they thinking about the birthday as a cause or a correlation? I can't decide. (Trying to time the delivery would correspond to believing that being born a Leo rather than a Cancer would help, which seems to contradict the bit on p.61.)
p.61-2 -- On p.61, they talk enthusiastically about Anders Ericsson who argues that stars are made, not born. L&D even wrote an article called "A Star is Made". On p.62, they disclose two almighty factors that are much more important than "birth effects" for being able to play in the majors: being born a male, and having a father who played in MLB. But aren't both those factors born, not made?
p.62 -- They end with this assertion: "So if your son doesn't make the majors, you have no one to blame but yourself; you should have practiced harder when you were a kid." I learn a couple of things from this: (1) their readers are men; (2) training harder causes me to have a higher chance of making the majors, which causes my son to have a higher chance of making the majors.
Will write about the rest of Chapter 2 in a future post.
Many readers will, or have, read SuperFreakonomics. I'm making my way through the book, and keeping a log of my thoughts. Here is how one statistician takes in Chapter 1 (the "sex" chapter).
p.20 -- was surprised to learn that women used to have shorter life expectancy than men. I have always thought women live longer. This factoid is used to show that throughout history, "women have had it rougher than men" but "women have finally overtaken men in life expectancy". I'm immediately intrigued by when this overtaking occurred. L&D do not give a date so I googled "female longevity": first hit said "it appears that women have out survived men at least since the 1500s, when the first reliable mortality data were kept."; the most recent hit cited CDC data which showed that U.S. females outlived males since 1900, the first year of reporting. In the Notes, L&D cite an 1980 article in the journal Speculum, published by the Medieval Academy. In any case, the cross-over probably occurred prior to any systematic collection of data so I find this minor section less than convincing.
p.20 -- L&D tell us "In China,... females are still far more likely than males to be abandoned after birth, to be illiterate, and to commit suicide." How should one interpret such statistics? My hunch is that among countries with similar literacy rates as China, it is probably true that females are more likely to be illiterate than males. If so, is the gap in China significantly larger than in other countries? The UN data is easy to find: overall, male, female adult literacy -- China: 91, 95, 87; Singapore: 93, 97, 87; Malaysia: 89, 92, 85; Phillippines: 93, 93, 93; Thailand: 93, 95, 91; Mexico: 91, 92, 90; Indonesia: 90, 94, 87, etc. In no way is the inequity in adult literacy in China special. The comment on suicides makes more sense as in most countries, men are more likely to kill themselves but it's the reverse in China.
p.21 -- L&D cite "For American women twenty-five and older who hold at least a bachelor's degree and work full-time, the national median income is about $47,000. Similar men, meanwhile, make more than $66,000, a premium of 40 percent." I'm assuming $66,000 is a median income as well. A ratio of two median incomes is not very useful; it tells us nothing about the distributions of the male and female incomes (which are very skewed). A more useful statistic is the percentile of $47,000 in the male income distribution: in other words, the mid-rank female earns less than X% of male counterparts.
p.21 -- They are chatting about causes of the male-female wage gap. "Even within high-paying occupations like medicine and law, women tend to choose specialties that pay less (general practitioner, for instance, or in-house counsel). And there is likely still a good amount of discrimination. This may range from the overt -- denying a woman a promotion purely because she is not a man -- to the insidious." I wish they made the duality of the cause--effect linkage clearer. The first factor claims women selects low-paying jobs while the second factor says high-paying jobs (their hiring managers) selects men. This is a common hiccup in causal inference research: which direction does the arrow of causality point?
p.22 -- They make the argument that Title IX boosted the appeal of coaching jobs for women's sports teams. To prove this, they say only 6 out of 13 WNBA teams had female head coaches as of 2009. For some reason, next they tell us ten years ago, only 3 of 14 WNBA teams had female head coaches. Are they saying the prestige of WNBA coaching jobs has declined in appeal over time? I'm confused.
pp.23-4 -- They cite several statistics of the weekly wages of prostitutes in Chicago, in historical dollars as well as in current dollars. First there was a girl who took in $25 a week in old dollars, and $25,000 a year in current dollars. This girl was described as "at the very low end of what Chicago prostitutes earned". So I'm expecting to learn the higher wages others make. The next sentence reads: "a woman working in a 'dollar house' (some brothels charged as little as 50 cents; others charged $5 or $10) took home an average weekly salary of $70, or the modern equivalent of about $76,000 annually." I just couldn't figure out how the words inside the parentheses relate to the rest of the sentence. A "dollar house" doesn't sound like a place where a lot of money is made.
p.23 -- A study estimated that "1 out of every 110 women in that age range [15-44] was a prostitute". This type of statistic is designed to make us think someone in this restaurant (or train, etc.) is a prostitute. But most often, it is misleading. The number is computed by dividing the number of prostitutes by the number of women. It assumes that every woman has the same chance of being a prostitute which is obviously not true. L&D realize this and add: "1 out of every 50 American women [in their twenties] was a prostitute." This doesn't go far enough. Later, on p.32, they inform us that "prostitution is more geographically concentrated than other criminal activity", which means that the chance that a twentysomething is a prostitute is highly dependent on where she lives.
pp.27-8 -- Has a very nice description of why survey research has many limitations, especially when it comes to asking questions about sensitive subjects, like sex, stealing, racism and so on. A precautionary tale for reading polling and market research data.
pp.28-9 -- Pondering how, and why, Venkatesh's method is better. Are former prostitutes more likely to elicit the truth about prostitution than others? If one wants to learn about male chauvinism, would male workers be more likely to get to the truth than female workers? (It's unclear if the former prostitutes were paid; they use the word "hired". The prostitutes being studied were paid.) This highlights the importance of understanding the motivations (and resulting biases) of data collectors. The bias introduced by paying participants is well known in the survey arena but tolerated in order to have an acceptable response rate.
p.29 -- They cite statistics about "the typical prostitute in Chicago." In what ways are the subjects of the study "typical" and in what ways are they not typical? The sample size was 160. They don't say much about the selection process of the subjects, except that they all came from three South Side neighborhoods. Would like to know more about the selection.
p.29 -- "At least 3 of the 160 prostitutes who participated died during the course of the study." Don't use the phrase "at least"! It sounds sloppy, and it is sloppy as "at least 3" includes "everyone". This is a documented study with a small sample; they should know exactly how many died.
p.30 -- After much buildup, we get to their surprise: "Why has the prostitute's wage fallen so far?" I'm looking for the data, what does it mean by "so far"? All we have is the assertion "the women's wage premium pales in comparison to the one enjoyed by even the low-rent prostitutes from a hundred years ago." On the previous page, we learn that modern "street prostitutes" earn $350 per week. On p.24, we learn that in the past, Chicago prostitutes took in $25 a week, "the modern equivalent of more than $25,000 a year". Unfortunately, neither of these two numbers is comparable to $350. Dividing $25,000 by 50 weeks (approx.) gives $500 per week. So the drop is $150 off $500, or 30%. But... this is a comparison of wages from prostitution, not of "wage premium". On p.29, the modern study found "prostitution paid about four times more than [non-prostitution] jobs." On p.23, they say "a tempted girl who receives only $6 per week working with her hands sells her body for $25 per week" so we can compute the historical ratio as $25/$6 = 4.17 times. So, I must have gotten the wrong data.
pp.30-31 -- some interesting comparison stating that only 5 percent of men today lose their virginity to a prostitute but 20 percent for those born in the 30s. Just be reminded of their earlier warning about truthfulness in research studies involving sensitive topics.
p.32 -- They assert "prostitution is more geographically concentrated than other criminal activity: nearly half of all Chicago prostitution arrests occur in less than one-third of 1 percent of the city's blocks." I have several problems with this sentence. What is the concentration of other criminal activities? Arrests are not the same as prevalence. And, a few pages later (p. 41), they will make the startling claim that "a Chicago street prostitute is more likely to have sex with a cop than to be arrested by one."
p.33 -- A table of sex acts and their average prices. It's important to establish the sample sizes underlying the average prices. The researcher documented 2,200 sex acts, and the least frequent act accounted for 9% of those, so about 200 acts. To establish the margin of error around those averages, I'd also need the spread of the individual prices.
p.40 -- They compare a real estate agent to a pimp. Some data is used to justify the claim that the Internet has reduced the power of real estate agents while the internet "isn't very good -- not yet, at least -- at matching sellers to buyers". Therefore, the impact of a pimp is larger than that of a real estate agent. Would like to see a study of Internet substituting pimps. As it stands, this is an assertion without proof.
p.46 -- Some of the language is overdone. They say the men "blew away" the women in a version of an SAT-style math test with twenty questions. What does "blowing away" mean? Scoring 2 more correct questions out of 20.
pp.47-8 -- Tackle a study on the wage change of men or women who underwent sex change operations. As they point out, this study really doesn't answer the question of what might happen if men are randomly made into women, or vice versa. The problem is this is not a random selection. The study found men who became women lost a third of their previous wages. This would imply they did not keep their prior jobs. But does this job change show women gravitate to poorer-paying jobs, or that higher-paying jobs select men? The direction of causation crops up again, and we are no closer to the answer.
The rest of the chapter -- They discuss Allie, a high-end prostitute. This section has little interest for a statistician since it is a sample of one.
Please do let me know if this sort of review is useful or not.
PS. Andrew has some thoughts here.