If you're reading this blog, you probably have heard that correlation does not imply causation. Apparently, health-beat reporters and medical researchers in peer-reviewed journals know this too. They often explicitly warn us that the studies only show correlation, and they couldn't explain why the results are as they are. All is well if they stop there. But they don't.
All too often, they succumb to "causation creep". All of a sudden, they interpret the results in a way that presumes causation.
Here are two particularly egregrious examples that appeared on the Vitals blog at MSNBC.
In the first example (link), we are told:
Researchers followed 397 children from pregnancy (sic) through their first year of life, and found that those living with dogs developed 31 percent fewer respiratory tract symptoms or infections, 44 percent fewer ear infections and received 29 percent fewer antibiotic prescriptions.
And then comes the disclaimer:
the researchers acknowledged that couldn't account for all such factors [other than living with dogs that can explain the finding], and noted that they found a correlation, not a cause-and-effect relationship.
Finally, causation creep happens:
For heathlier kids, get a cat or dog, study suggests.
This is a cause-effect statement, there is no getting around it. It's saying that if you get a cat or a dog, your kids will be healthier. If the researchers truly believe that they found a correlation, then drawing this causal conclusion is unconscionable. Chances are, they just thought this is a fun thing to say (or at least the headline writer thinks so), allowing causation to creep in.
PS. The first person who commented on the article suggests that perhaps families that own pets also spend more time outdoors (playing with pets) and it could be the outdoor exposure that causes the observed effect. We don't know why but one thing is for sure: families who keep pets are not at all the same as families who don't keep pets.
PPS. It may be cruel at this point to point out that the so-called "cat" effect is 2 percent fewer antibiotics, which amounts to nothing. Also, no urban families were present in this analysis.
The second example (link) is if anything worse.
We are told:
Katzmarzyk and colleagues analyzed information from five earlier studies involving more than 167,000 adults that looked at the link between sitting and risk of dying from any cause over the next four to 14 years... About 27 percent of deaths in the studies could be attributed to sitting, and 19 percent to television viewing, the researchers said.
The researchers noted their study assumed a cause-effect link between sedentary behavior and risk of dying, which further research should validate, they said.
Finally, a blossom of colorful statements all of which presume causation, which they assumed without proof:
Sit less than 3 hours a day, add 2 years to your life (sic)
Reducing the daily average time that people spend sitting to less than three hours would increase the U.S. life expectancy by two years (sic)
The study adds to a growing body of evidence suggesting that sitting itself is deadly. (sic)
These researchers should realize they are not doing Freakonomics. It's true Levitt and Dubner also sometimes succumb to causation creep, especially in the more casual pieces. But in the original book, where they described how abortion policy could have caused crime to fall, they went through a vigorous explanation to rule out many other possible explanations. None of the studies cited here (or many other similar ones) have the care that is required to claim causation. We should note that the abortion finding has been debunked - that just goes to show how hard it is to prove causation with data that is conveniently collected.
Carl Bialik, i.e. the Numbers Guy at WSJ, wrote a nice piece (link) trying to explain something that is very difficult to explain to a general audience... the notion of statistical significance. He discusses this in relation to the experiments that have supposedly proved the existence of the Higgs boson.
I won't repeat his entire piece here. I have these thoughts while reading the piece:
The physicists talk about LEE - the look elsewhere effect. In statistics, we call this "multiple comparisons". We are typically looking for something out of the ordinary, say the agent that causes an illness. But patients do fall ill by themselves. So when the event occurs, we have to determine if it is explained by something extraordinary or is it just a normal event? We want to reduce the chance of a false positive finding. The harder you look, the more likely we will discover something that turns out to be false.
In Chapter 2 of Numbers Rule Your World, I talked about epidemiologists using questionnaires to help unlock the source of e-coli infections. The theory is that if food X is the cause of the outbreak, then a much higher proportion of those patients who fell ill (the cases) should have consumed X than those people who did not fall ill (the controls). The key issue is what is food X? The "multiple comparisons" issue is that if the epidemiologists asked about every possible variation of all food items, to the depth of say Brand Y green-red spinach with 2-inch stems packed in 10-ounce clear plastic bags, then they run the risk of identifying the wrong culprit.
Imagine finding the average case and the average control and seating them on the same stage. We ask them how many eggs they have eaten last week. Case said 2 and control said 3. That's not a big enough difference to say eggs caused the case to get ill and not the control. We then ask about hamburgers. Case said 1 and control ate none. Imagine going from food to food to food. Eventually we will chance upon some food item in which one side ate a lot more/less than the other side. Does this difference prove that that food item caused the e-coli infection? Or is it that these two groups of people have certain differences in eating habits regardless of their e-coli status?
Still not convinced. Imagine finding the average person wearing a bracelet/wristband and the average person who isn't wearing one. Ask them about what food they ate and go through all the same food items. If the list is long enough, we will surely find something that differentiates them. Is the difference caused by the wristband? That's the danger of "look elsewhere".
You will notice that we don't have a "solution" for this problem. All we do is to make the standard of evidence tougher. We just accept that this is a workplace hazard.
As Carl pointed out, the medical context is in some ways diametrically opposite to the particle physics context. In epidemiology, data is extremely scarce, and we must look for very big differences to be comfortable about the result. In particle physics, we have lots of data that are generated in controlled experiments, and we are looking for tiny differences. This explains why the standard for accepting a finding is so much higher in physics than in medicine.
The other reason is that we believe that the laws of physics are immutable so even tiny deviations can disprove a theory. Biology (economics, psychology, etc.) does not have immutable laws - in fact, there are lots of causes acting together for almost anything, and when it comes to psychology (unless you don't believe in free will), even the same person is likely to act differently when presented with the same scenario twice. This is to say, small variations do not destroy such theories.
Two cautionary tales appeared in press recently, serving notice to all "data scientists" (as statisticians are fancifully called these days). It's hard work to earn the status of a "science".
Via the New York Times comes the story of Dr. Robert Spitzer (link). As a young psychiatrist in the 1970s, he successfully pushed the profession to narrow the definition of homosexuality as a disorder. He observed wrily that many gay people are happy, and therefore only those who are depressed should be diagnosed.
In 2001, he presented new findings that show that homosexuality can be "cured" by reparative therapy. This was his method:
He recruited 200 men and women, from the centers that were performing the therapy, including Exodus International, based in Florida, and Narth. He interviewed each in depth over the phone, asking about their sexual urges, feelings and behaviors before and after having the therapy, rating the answers on a scale.
He then compared the scores on this questionnaire, before and after therapy. “The majority of participants gave reports of change from a predominantly or exclusively homosexual orientation before therapy to a predominantly or exclusively heterosexual orientation in the past year,” his paper concluded.
He strenuously defended the study for years, after it got published in his friend's journal without going through the typical peer-review process. (The article was published with commentaries by peers, which according to the NYT, were "merciless".)
At 80, he is coming forward to apologize and retract the study. Bravo to him for doing this. But one wonders how the industry of science failed to expose this failing much sooner. Is it because of the stature of the researcher? Is it conformance? Is it because he circumvented the usual peer-review process? ...
The reporter said the biggest problem with the study was self-interested subjects lying about sensitive issues like these. Actually, no. The biggest problem is the absence of a control group - gay men and women who did not receive such therapy. It boggles my mind that a study done in 2001 would have only cases and no controls. The case-control methodology has been in use since the 1950s/60s.
If you think that was bad, hold your nose before you read this Wall Street Journal article about cancer studies (link).
Here is a sample of the stinky sentences (my italics in all cases):
After publishing a paper on a rare head-and-neck cancer, [Dr. Mandic] learned the cells he had been studying were instead cervical cancer...
Dr. Mandic entered a largely secret fellowship of scientists whose work has been undermined by the contamination and misidentification of cancer cell lines...
Cell repositories in the U.S., U.K., Germany and Japan have estimated that 18% to 36% of cancer cell lines are incorrectly identified.
Dr. Tarin has spent 25 years working with that cell line--or so he thinks. A body of research suggests that MDA-MB-435 isn't breast cancer; many scientists now believe...[it's] melanoma... Dr. Tarin disagrees.
The prevailing attitude [among scientists] is that the other lab's cell line may be contaminated but not mine.
Nearly 40 years later, ... found 1,000 citations of the same contaminated cancer lines revealed in Dr. Gartler's 1966 findings, which have since been replicated many times using more advanced techniques. "They [the scientists] are either crooks or stupid."
As data scientists like to say, "garbage in, garbage out". But who among us is courageous enough to voluntarily consign decades of our own research to the dustbin?
Andrew cites statistician Don Berry who explains what "lead time bias" is, and why survival time is always the wrong metric to use in evaluating health outcomes. Survival time is the time from diagnosis to death. By doing more screening and diagnosing earlier, survival time will magically increase even if the patient's life expectancy stays put.
I ignored Andrew's warning and spent some time reading the Philipson, et. al. paper (link). Time that I want back but couldn't. To save you the trouble, I will discuss a few gaping holes other than the howler already identified by Berry - there are many other less significant issues.
The title of the paper purports to address the following "causal" relationship:
The reader immediately discovers that the authors analyzed a different "causal" relationship:
It may appear that the substitutions are harmless: spending on cancer care is a proxy for overall healthcare spending; survival gains for cancer patients is a proxy for overall health benefits. The authors hid the useful information in the Appendix (available online). In Table 3, we learn that spending on cancer care is only a single-digit percentage of total health care spending in almost every country. Besides, the total deaths by the 13 types of cancers counted in their study constitute only 31 percent of the total cancer deaths in the U.S. (using the 2011 statistics from this report - PDF). This list of included cancer types excludes the biggest killer (lung, over 150,000 deaths) while it includes testis which caused 350 deaths in 2011.
So, even if the analysis is correct, the result cannot be generalized to talk about cost and benefit of all health care spending. This is an instance of "availability bias": even though cancer makes a lot of news, most health care spending has nothing to do with cancer, and so we can't use cancer care spending as a proxy.
In assessing the value of cancer care spending, the authors decided to use a modeled change in death rates, rather than the actual observed data. Neither in the paper nor in the appendix is the actual model reported, nor is there any information on goodness of fit. However, we don't need to know the model to know it doesn't fit.
Take a look at the fourth column of Table 1 in the Appendix. This column shows the predicted deaths avoided or incurred in the U.S. (given the additional spending in the U.S. relative to "Europe").
Let's do a sanity check on these numbers. For colorectal cancer, the model claims that the extra spending has avoided 282,000 deaths over the 23 years (1982-2005), or roughly 12,300 deaths per year. According to the cancer death statistics, about 50,000 deaths from colon cancer actually occurred in the U.S. in 2011. That means the model claims that colon cancer deaths would have been 25% higher were it not for the extra spending. What is the miracle drug that caused this gigantic improvement? What prevents this amazing new treatment from crossing the Atlantic?
Maybe you believe in miracles. Then, take a look at stomach cancer. Here, the negative number seems to imply that the additional spending has induced 225,000 stomach cancer deaths over 23 years. That sounds really horrifying. Given that stomach cancer killed 10,300 Americans in 2011, the model claims that the extra spending has doubled the number of deaths from stomach cancer!
Simply put, their model makes no sense.
Now, go back to Table 3 in the Appendix and read the note. It says that missing data for percentage of health care spending that is cancer related is imputed as 6.5% (30% higher than the U.S. assumption of 5% which came from a totally different source), and we find that Iceland, Norway, Slovkia and Slovenia (40% of the countries) are all imputed.
The problem here is that the authors are not consistent in their treatment of missing data. In the main paper, they explain again and again that their sample of data is restricted by data availability (i.e. they didn't impute values for missing data). For example, the choice of the 10 European countries is because "only ten reported data consistently over the 1983-99 period". This means no Italy, no Spain but you have Wales and Scotland (but no England), and also Slovakia and Slovenia (why are they comparable to U.S.?).
Why those particular 13 cancers? Because "data were consistently available from both the European and US survival databases". This means including testis cancer and excluding lung cancer. Insteading of imputing values for lung cancer, they just drop the cancer type that causes the most deaths.
Why look at survival differences only for patients diagnosed from 1995 through 1999? You guessed it. It's because only in those periods can they find consistent data.
Given that they use models throughout the research, and they imputed values for proportion of spending on cancer treatment, they could have tried to impute values in these other decisions, and then the result could perhaps be generalized.
Dropping data because some variables are missing should be justified clearly. It's too easy to cherry-pick your dataset this way.
How about another non-sensical assumption? The average value of an additional year of life of someone who's dying from cancer is $150,000 to $360,000. They describe this as "standard figures for an extra year of life" and call the lower end of the range "conservative". Only 5% of Americans earn over $100,000 per year. The median personal income is less than $40,000. (From Wikipedia, for 2004, I think). Enough said.
It's sad that this paper gets publicity only because it makes a conclusion that is against "conventional wisdom". The clear evidence so far has been that while the U.S. spends twice as much on health care as other "wealthy" nations, our life expectancy is lower, and the bottom of the class. (See here, for example.)
The chart shown on the right is as clear as it can be. (I discussed this chart on Junk Charts.) The situation with science journalism is very dire, in my opinion, when outlets are chasing clicks and sales by publicizing bad studies that have eye-catching headlines.
The New York Times featured a story about customer targeting recently. In particular, it describes an application of predicting which of Target's female customers may be pregnant. Pregnancy is considered a major life event during which customers may be more willing to shift their spending from one retailer to another.
I recommend reading the article to get a sense of what companies do with our data these days. Bear in mind it's written by a journalist who has a good but not firm grip on the details of statistical modeling.
In particular, I'd like to shed some light on the last two paragraphs of the article, which I reprint here:
On my way back to the hotel, I stopped at a Target to pick up some deodorant, then also bought some T-shirts and a fancy hair gel. On a whim, I threw in some pacifiers, to see how the computers would react. Besides, our baby is now 9 months old. You can’t have too many pacifiers.
When I paid, I didn’t receive any sudden deals on diapers or formula, to my slight disappointment. It made sense, though: I was shopping in a city I never previously visited, at 9:45 p.m. on a weeknight, buying a random assortment of items. I was using a corporate credit card, and besides the pacifiers, hadn’t purchased any of the things that a parent needs. It was clear to Target’s computers that I was on a business trip. Pole’s prediction calculator took one look at me, ran the numbers and decided to bide its time. Back home, the offers would eventually come. As Pole told me the last time we spoke: “Just wait. We’ll be sending you coupons for things you want before you even know you want them.”
Charles Duhigg, the author, does the same thing that other reporters do when it comes to writing about predictive models: there is no sense that these models can make any errors. The reason why Duhigg didn't receive "sudden deals on diapers or formula" at the Target store was interpreted as an instance of accurate prediction--that the computer figured out that he was on a business trip. The reason why he later would receive these offers "back home" was also interpreted as an instance of accurate prediction.
When earlier in the piece, Duhigg discussed the decision to dilute the message of the marketing materials targeted at women predicted to be pregnant, mixing in non-pregnancy-related products, the tactic was portrayed as a way to deal with the remarkable accuracy of predictive models. They even tell unsuspecting dads that their daughters are pregnant before they tell their parents!
It's unfortunate that the coverage of statistical modeling has been laced with such hype. I hate to pop the bubble but most predictions made by such models are simply wrong. These models may work on average but that doesn't mean individual predictions will be right. (In Chapter 4 of Numbers Rule Your World, I discuss why businesses may deliberately make certain types of errors if they want to maximize profits.)
What's more embarrassing? To send a brochure filled with pregnancy-related products to women who are not pregnant, or sending the same brochure to women who are indeed pregnant but are surprised that Target knows. You see, mixing in random products serves to hide the inaccuracy of the underlying predictions, the false positives.
Here's a simple way to see this: let's say 10 percent of the female customers are pregnant at any time. In order to find even 6 percent of the 10 percent,Target's model will predict,say, 12 percent of the women to be pregnant. Right there, there will be 6 percent incorrect predictions at the minimum. The model that I just posited is incredibly accurate: the base rate of pregnancy is 10 percent while the base rate of pregnancy within those targeted is 6/12 = 50 percent.
Duhigg's experience at the store cannot be explained by the pregnancy prediction model because his baby is already 9 months old. In those confusing sentences cited above, he presumed that Target has several other predictive models running, including one that predicts whether the customer is on a business trip, one that relates the purchase of pacifiers to buying formula or diapers, and one that predicts what the customer would buy at home by analyzing what the customer bought while traveling.
Chances are Target doesn't have all those models. Remember Duhigg himself told us that marketers have determined the 2nd trimester of pregnancy as the moment to target young couples. The implication is that it is very difficult to change their buying habits when the kid is 9 months old.
I just finished Emanuel Derman's new book, "Models Behaving Badly", which is a good introduction to the philosophy of statistical models. The topic has been swirling in my head after also having read this article by economist Dani Rodrik, who reflected on the recent walkout by some Harvard students of their introductory economics course.
In Rodrik's view, the students were right to protest the economics profession because the economic models being taught in the classroom are too simplistic. He paints a particularly eye-opening - and damning - scenario: in the undergrad classroom, as well as in public, the economist admits no doubts about his ideologies (such as "free trade", "free market") but in his "advanced graduate seminar on ... theory", the same professor would debate with skeptics, leading to a "heavily hedged statement" after "a long and tortured exegesis". The statement would begin with "if the long list of conditions I have just described are satisfied, .."
I could imagine Derman entering that graduate seminar, and declare everything as nonsense. (Derman currently teaches in the Financial Engineering program at Columbia, and previously worked on Wall Street as a "quant" building economic models, after spending his graduate career working with models of the physical world.) "Models Behaving Badly" is about how economic models can go off-track, how frequently they do, and why modelers must behave modestly. Derman would argue that Rodrik's "long list of conditions" are almost never satisfied.
There is a crucial difference between the assumptions made by the Black-Scholes Model and the assumptions made by a souffle recipe. Our knowledge about the behavior of the stock markets is much sparser than our knowledge about how egg whites turn fluffy.
He goes on to argue, perhaps unexpectedly, that the Black-Scholes Model is "the best model in all of economics". He aims his criticism squarely at the sacred cow of financial economics, the "Efficient Market Hypothesis".
Rodrik does not believe that the economics profession needs better models. He claims "Macroeconomics and finance did not lack the tools needed to understand how the crisis arose and unfolded." The fault of the profession was to have trusted the wrong models (ones assuming efficient and self-correcting markets). He believes that this bad choice of models is facilitated by "excessive confidence in particular remedies - often those that best accord with their own personal ideologies."
It isn't clear to me how Rodrik proposes to resolve the ideology problem. In fact, his citation of another economist Carlos Diaz-Alejandro perfectly captures the heart of the issue: "by now [1970s] any bright graduate student, by choosing his assumptions... carefully, can produce a consistent model yielding just about any policy recommendation he favored at the start."
The diesease is more than ideological. Reading behind the lines, I think these models are far too complex for their own good. They cannot be falsified with observed data. They can be made to support any ideology. This leads me to two observations:
Returning to the protesting Harvard students, Rodrik describes the discontent of the undergrad economics syllabus: "it is as if introductory physics courses assumed a world without gravity, because everything becomes so much simpler that way."
In making this analogy, Rodrik is giving economic models the status of models in physics. He's saying that there are simplified models in both disciplines which don't fit reality well, but there are complex models in both disciplines which work well.
Derman would beg to differ. Originally trained as a physicist, he now freely admits that "financial modeling is not the physics of markets". He spends a great portion of the book showing why economic models can never aspire to the status of physics models.
Reading Rodrik's analogy, one senses that he has yet to arrive at Derman's port. Rodrik continues to make parallels between physics and economics. But I know of no introductory physics course that assumes a world without gravity - the major omission is Einstein's relativity. There is, in fact, a huge difference between Newton's theory of mechanics and, say, the Capital Asset Pricing Model. Students who learn Newton's laws can explain how the world works without ever knowing any relativity theory. Newton's theory can stand on its own. Not so the simplistic economics models. As Derman points out, simple economics models are easily invalidated by observed data.
My own view, informed by years of building statistical models for businesses, is more sympathetic with Derman than Rodrik. There is no way that economic (by extension, social science) models can ever be similar to physics models. Derman draws the comparison in order to disparage economics models. I prefer to avoid the comparison entirely.
The insurmountable challenge of social science models, which constrains their effectiveness, is that the real drivers of human behavior are not measurable. What causes people to purchase goods, or vote for a particular candidate, or become obese, or trade stocks is some combination of desire, impulse, guilt, greed, gullibility, inattention, curiosity, etc. We can't measure any of those quantities accurately.
What modelers can measure are things like age, income, education, past purchases, objects owned, etc. Nowadays, we can log every keystroke you type on your smartphone (link). That models are even half-accurate is due to the correlation of these measured quantities with the hidden drivers of our behavior but this correlation is only partial.
Now add to that, the vagaries of human behavior.
P stands for pandemic. And this article nicely describes the predicament of policymakers as they grapple with the early stages of a possible outbreak in a new strain of influenza. At best, they are basing their policies on educated guesses, with the emphasis on guessing since data is in short supply.
This situation is akin to the one described in Chapter 2 of Numbers Rule Your World. At this point of the investigation, they only identified one cluster of cases, in the U.S. that can be traced back to pigs. "We don't want to overplay or underplay... we're trying to get that right." according to an official at WHO. Nice goal but unfortunately, unrealistic. The reality is that there are winners and losers on either side of this zero-sum decision. Public health advocates have little to lose from false alarms - they have the "better safe than sorry" mentality. Powerful business lobbies have much to lose from false alarms - and their voices are being heard: WHO has been warned not to call this "swine flu", according to the journalist.
Here are three posts by Delong who likes the model:
Here are three by Cowen, who dislikes the model:
I will leave aside the macroeconomics (no expertise). What I care about is how one should, and should not, critique "models".
Since a model is an abstraction, a simplification of reality, no model is above critique.
I consider the following types of critique not deserving:
1) The critique that the modeler makes an assumption,
e.g. "it fudges the distinction between real and nominal interest rates". Making assumptions is not inherently bad, making bad assumptions is a problem but not all assumptions are bad.
2) The critique that the modeler makes an assumption for mathematical convenience,
e.g. "Don't assume they are the same, just to squash the two curves onto the same graph." Almost all assumptions are made for mathematical convenience. The inappropriate use of the Gaussian assumption, that most unpardonable of sins according to Taleb, is almost always invented to render the math tractable. But not all assumptions that simplify the math are bad assumptions.
3) The critique that the model omits some feature,
e.g. "those aggregate curves are not invariant to expectations", because this critique is no different from saying the modeler makes an assumption (see #1) More, what is a "bad" assumption? How does one determine which assumptions are bad among the set of all possible assumptions?
4) The critique that the model doesn't fit one's intuition,
e.g. "the model leads you to believe that interest rates are more important than they probably are". The model should fit reality (the data); it doesn't need to fit anyone's intuition.
5) The critique that the model fails to make a specific prediction,
e.g. "this distinction really matters when you're trying to predict the macro effects of 'window breaking'". No model, especially a macroeconomic model that can issue a large number of predictions, will ever predict everything. Not all predictions are equally important. One must agree on which predictions are the most important to get right, and make judgment based on the entire list of predictions.
Above all, a serious critique must include an alternative model that is provably better than the one it criticises. It is not enough to show that the alternative solves the problems being pointed out; the alternative must do so while preserving the useful aspects of the model being criticized.
This whole debate reminds me of the climate change model controversies. I am not aware of an alternative model from those who dislike the consensus model. Until they offer such an alternative, their critique cannot be taken seriously, I'm afraid.
The underlying belief -- on both sides of the macroeconomic divide, it appears -- that someone's model can be proved "wrong" and thus discarded for all eternity is as dubious as the belief that a model can be proved "right".
One question you might want to ask is: how big a risk is it for you?
It should be clear that the level of risk is different for different people. For example, I don't eat ground turkey so my risk is the risk of CDC identifying the wrong culprit (a false positive), meaning that I might eat something else that would get me sick. If you do consume ground turkey, your risk is that of buying a pack of ground turkey that happens to be contaminated -- plus the risk of a false positive.
According to the official press release, about 10% of Americans consume ground turkey in any given week. This means about 1/3 of Americans consume ground turkey in any given 4 weeks (1 - (0.9)^4 assuming independence from week to week), 6/10 in any 8 weeks.
Because fresh food is perishable, the risk depends on time. Assuming a temporary source of contamination, the batches of bad food would make its way through the food cycle in a fixed amount of time. (Permanent sources would be much easier to identify because you can just test samples from each machine to find the culprit.)
The following time-line chart from CDC is instructive:
In the book, I discuss the perils of statistical modeling with so few cases, the chance of false positives, the difficulty of establishing a cause-effect relationship, the incentives of health agencies, the logic of food recalls, and so on.