A GMO labeling law has arrived in the US, albeit one that has no teeth (link). For those who don't want to click on the link, the law is passed in haste to pre-empt a more stringent Vermont law. The federal law defines GMO narrowly, businesses do not need to put word labels on packages (they can, for example, provide an 800-number), and violaters will not be punished.
One of the arguments against GMO labeling is that it is unscientific because (some) scientists are 100% certain that GMO foods are safe. (e.g. this Boston Globe editorial)
Any good scientist knows that scientific "truths" are true until they are proven otherwise. Science is a continuous process of making hypotheses, and finding data to confirm or reject them. The Bayesian way of thinking is very useful here. Being true is a probability - more confirmatory data increases the probability that a given hypothesis is true.
So why is GMO labeling good science?
In fact, I'd go so far as to say that there is no science without GMO labeling.
How is nutritional science done today? What is the research that tells us coffee is good, butter is good, salt is bad, etc.? Granted, this is a shaky field that has issued lots of false results. But the usual form of analysis goes like this: conduct a large survey of consumers and ask them about their diet (e.g. how much red meat do you eat each week?); obtain information about their health status, either through the same survey, a different survey, or direct measurements if they are part of a research study; then correlate the dietary data and the health data.
Now, imagine you want to study if eating GMO foods affects your health, either positively or negatively. Your survey question will be something along the lines of "How much GMO foods did you eat last week?"
Without GMO labeling, there is no way to conduct such research. This is why GMO labeling is good science. Not labeling GMO is bad science - actually it mandates no science.
I am outsourcing this post to Aaron Carroll, whose Upshot column eviscerates the recent claim that eating meat will give you cancer, or that eating meat is the same as smoking cigarettes. While the media is partly culpable for spreading misinformation, WHO (World Health Organization) is the ultimate responsible party here.
We all know (or so I hope), the plural of anecdote is not data.
It is time to add to this: The plural of observational study is not randomized experiment.
On Labor Day, our new Statbusters column appeared. This one concerns a popular news story from some weeks ago, saying science has proven that there are four types of drunks. The four refers to four "clusters" formed by running a cluster analysis algorithm. But four is decided by the analyst. Some algorithms won't run unless the analyst specifies the number of clusters; other algorithms generate the best structure for every number of clusters. This method is great for exploring and understanding the data but cannot confirm that there are precisely four types of drunks!
The media often removes the uncertainty of science in the name of "popularizing." The entire article is here.
I only read nutrition studies in the service of this blog but otherwise, I don't trust them or care. Nevertheless, the health beat of most media outlets is obsessed with printing the latest research on coffee or eggs or fats or alcohol or what have you.
Now, the estimable John Ioannidis has published an editorial in BMJ titled "Implausible Results in Human Nutrition Research". John previously told us about the crisis of false positives in medical research.
Oops, here are some statistics on nuitrition "science":
In 52 attempts at using randomized experiments to validate findings from observational studies, the number of times the findings were replicated: 0
In the NHANES questionnaire (the basis of all those findings), two-thirds of the participants provided answers that imply an energy intake that is "incompatible with life". I haven't read this paper; seems like worthwhile reading.
There are at least 34,000 papers on PubMed with keywords "coffee OR caffeine" which means this one nutrient has been associated with almost any interesting outcome.
Almost every single nutrient imaginable has peer reviewed publications associating it with almost any outcome. A statistician should never give the advice "If at first you don't succeed,..."
Many findings are entirely implausible (and still get published in top journals)... for example, the idea that a couple of servings a day of a single nutrient will halve the burden of cancer is clearly "too good to be true," even more so for anyone who is familiar with this literature
"Big datasets just confer spurious precision status to noise"
Randomized experiments offer hope but are woefully undersized (like requiring 10 times the current sample).
Just to nail home the point, John concludes: "Definitive solutions will not come from another million observational papers or a few small randomized trials."
[After communicating with Frakt, Humphrey and Dean Eckles, I realize that I was confused about Frakt's description of the Humphrey paper, which does not perform PP analysis. So when reading this post, consider it a discussion of ITT versus PP analysis. I will post about Humphrey's methodology separately.]
The New York Times plugged a study of the effectiveness of Alcoholics Anonymous (AA) (link). The author (Austin Frakt) used this occasion to advocate "per-protocol" (PP) analysis over "intent-to-treat" (ITT) analysis. He does a good job explaining the potential downside of ITT, but got into a mess explaining PP and never properly addressed the downside of PP. It's an opportunity missed because I fear the article confuses readers even more on an important topic.
The key issue at play is non-compliance in a randomized experiment. If some patients are assigned to AA treatment and others are assigned to some other treatment, typically some subset of patients will "cross-over," (or drop out altogether), and usually such cross-over is associated with the outcome being measured--for example, a patient assigned to AA treatment felt that AA was not working and aberrantly switched to the other treatment; or vice versa.
ITT and PP differ in how they deal with the subset of non-compliers. In ITT, you analyze everyone in the experiment based on their initial assignment, ignoring non-compliance. In PP, you drop all non-compliers from the study, and analyze the subset of compliers only. (Each analysis is "extreme" in its own way.)
Between these two, I usually preferred ITT. The PP analysis answers the question: "If everyone complied with the treatment, what would be its effect?" I don't find the assumption of zero non-compliance realistic. ITT answers a different question: "Of those who take are given the treatment, what would be the expected effect?" This effect is an average of those who complied and those who did not comply, weighted by the proportion of compliers.
Frakt lost me when he said:
In a hypothetical example, imagine that 50 percent of the sample receive treatment regardless of which group they've been assigned to. And likewise imagine that 25 percent are not treated no matter their assignment. In this imaginary experiment, only 25 percent would actually be affected by random assignment.
First of all, the arithmetic does not work. If we ignore assignment as he suggested in the first two sentences, then the patients can either have received treatment or not. But 50 percent plus 25 percent leaves 25 percent of the patients unaccounted for.
Here is an illustration of what I think Frakt wanted to get across:
Of the 50% assigned to the treatment, 90% (45 out of 50) complied and 10% crossed over. Of the other half initially assigned to no treatment, 60% (30 out of 50) crossed over to the treatment. All in all, 75% of the study population received treatment and 25% did not... regardless of their initial assignment.
In an ITT analysis, all patients in the table are analyzed. We compare the top row with the bottom row. By contrast, in a PP analysis, we only analyze the patients along the top-left, bottom-right diagonal, namely, the 65% of the patients who complied with the assigned treatment. So, we compare the top left corner with the bottom right corner.
The important question is whether this 65% subset constitutes a random sample. Frakt implies it is: "only 25 percent [i.e. 65 percent in my example] would actually be affected by random assignment." Maybe when he said "affected by", he didn't really mean random; because it should be obvious that treatment is no longer randomized within the 65% subset.
If the 65% subset were randomly drawn from the initial population, we should still have equal proportions of treated versus non-treated but in fact, we have 70% treated versus 30% not treated. Said differently, the not-treated patients are more likely to cross over than the treated patients.
Cross-over isn't something that happens randomly. Patients are assessing their own health during the experiment, and thus, the opting out is frequently related to the observed (albeit incomplete) outcome.
In the article, Frakt states that the study of Humphreys et. al. "corrects for crossover by focusing on the subset of participants who do comply with their random assignment". I call this "filtering" rather than "correcting".
Does analyzing this subset lead to an accurate estimate of the treatment effect? I don't think so.
By filtering out the cross-overs, the researchers introduce a survivorship bias. If the cross-overs do so because they are unhappy about their assigned treatment, then these patients, if forced to continue the original treatment, are likely to have below-par outcomes compared to those who did not cross over. In a PP analysis, this subset is removed. Practically, this means that the treatment effect (PP analysis) is too optimistic.
Frakt is careless with his language when it comes to discussing the downside of PP analysis. He says (my italics):
it’s not always the case that the resulting treatment effect is the same as one would obtain from an ideal randomized controlled trial in which every patient complied with assignment and no crossover occurred. Marginal patients may be different from other patients...Despite the limitation, analysis of marginal patients reflects real-world behavior, too.
"Not always" leaves the impression that PP analysis is usually right except for rare situations. Note how he uses the word "limitation" above (paired with "despite"), and below, when discussing ITT analysis:
For a study with crossover, comparing treatment and control outcomes reflects the combined, real-world effects of treatment and the extent to which people comply with it or receive it even when it’s not explicitly offered. (If you want to toss around jargon, this type of analysis is known as “intention to treat.”) A limitation is that the selection effects introduced by crossover can obscure genuine treatment effects.
The choice of words leaves the impression that ITT is more limited than PP when both analyses suffer from problems arising from the same source: patients with worse outcomes are more likely to cross over.
Many readers of the NYT article link to a much longer article in The Atlantic. It appears that the scientific evidence on AA is very weak.
Are science journalists required to take one good statistics course? That is the question in my head when I read this Science Times article, titled "One Cup of Coffee Could Offset Three Drinks a Day" (link).
We are used to seeing rather tenuous conclusions such as "Four Cups of Coffee Reduces Your Risk of X". This headline takes it up another notch. A result is claimed about the substitution effect of two beverages. Such a result is highly unlikely to be obtained in the kind of observational studies used in nutrition research. And indeed, a glance at the source materials published by the World Cancer Research Fund (WCRF) confirms that they made no such claim.
The headline effect is pure imagination by the reporter, and a horrible misinterpretation of the report's conclusions. Here is a key table from the report:
The conclusion on alcoholic drinks and on coffee comes from different underlying studies. Even if they had come from the same study, you cannot take different regression effects and stack them up. The effect of coffee is estimated for someone who is average on all other variables. The effect of alcohol is estimated for someone who is average on all other variables. The average person in the former case is not identical to the average person in the latter case. So if you add (or multiply, depending on your scale) the effects, the total effect is not well-defined.
In addition, you can only add (or multiply) effects if you first demonstrate that the two factors do not interact. If there is interaction, the effect of alcohol is different for people who drink less coffee relative to those who drink more. The alcohol effect stated in the table above, as I already pointed out, is for an average coffee drinker. Conversely, the protective effect of coffee may well vary with alcohol consumption.
The reporter also misrepresented the nature of the analysis. We are told: "In the study of 8 million people, cancer risk increased when they consumed three drinks per day. However, the study also found that people who also drank coffee, offset some of the negative effects of alcohol."
The reporter made it sound like a gigantic randomized controlled study was conducted. This is a horrible misjudgment. WCRF did not do any study at all, and certainly no researcher asked anyone to drink specific amounts of alcohol or coffee. The worst is the comment on people who drank coffee as well as alcohol. I can't find a statement in the WCRF report about such people. It's simply made up based on the false logic described above.
At one level, the journalist misquoted a scientific report. At another level, the WCRF report is rather disappointing.
The authors of the executive summary repeatedly use the language of causation. For example, "There is strong evidence that being overweight or obese is a cause of liver cancer." Really? Show me which study shows obesity "causes" liver cancer?
Take one of their most "convincing" findings: "Aflatoxins: Higher exposure to aflatoxins and consumption of aflatoxin-contaminated foods are convincing causes of liver cancer." The causation is purely an assumption of the panel who reviewed prior studies. In Section 7.1, readers learn that this cause-effect conclusion comes from "four nested case-control studies and cohort studies" for which "meta-analyses were not possible". So not a single randomized trial and no estimation of the pooled effect.
What is nicely done in the report is the inclusion of "mechanisms" which are speculative explanations for the claimed causal effects. It's great to have thought carefully about the biological mechanisms. Nevertheless, these sections are basically "story time" unless researchers succeed in establishing those unproven links.
Some behind-the-scenes comments on my recent article on New York's restaurant inspection grades; it appeared on FiveThirtyEight this Tuesday.
The Nature of Ratings
This article is about the ratings of things. I devoted a considerable amount of pages to this topic in Numbersense (link) - Chapter 1 is all about the US News ranking of schools. A few key points are:
All rating schemes are completely subjective.
There is no "correct" rating scheme, therefore no one can prove that their rating scheme is better than someone else's rating scheme.
A good rating scheme is one that has popular acceptance. If people don't trust a rating scheme, it won't be used. (This is a variant of George Box's quote: "all models are false but some are useful".)
Think of a rating scheme as a way to impose a structure on unwieldy data. It represents a point of view.
All rating schemes will be gamed to death, assuming the formulae are made public.
Based on that, you can expect that my goal in writing the 538 article is not to praise or damn the city's health rating scheme. My intention is to describe how the rating scheme works based on the outcomes. I want to give readers information to judge whether they like the rating scheme or not.
The restaurant grade dataset is an example of OCCAM data. It is Observational, it has no Controls, it has seemingly all the data (i.e. Complete), it will be Adapted for other uses and will be Merged with other data sets to generate "insights". In my article, I did not do A or M.
Hidden Biases in Observational Data
Each month (or week, check), the department puts up a dataset on the Open Data website. There is only one dataset available and the most recent copy replaces the previous week's dataset. The size of the dataset therefore expands over time.
Anyone who analyzes grade data up to the most recent few months is in for a nasty surprise. As the chart on the right shows, the proportion of grades that are not A, B or C (labeled O and gray) spikes up by 10 times the normal amount during the last two months. This chart is for an August dataset, and is not an anomaly. It's an accurate description of the ongoing reality.
On first inspection, if a restaurant is given a B or C, the restaurant has the right to go through a reinspection and arbitration process. During this time, the restaurant is allowed to display the "Grade Pending" sign. It appears that it can take up to four months for most of the B- or C-graded restaurants to be finished with this process. Over this period of time, many of the pending grades will flip to one of A, B or Cs. The chance that they will flip to B or C is much higher than the average restaurant (i.e. for which we don't know they have a Grade Pending).
Indeed, the proportion of As in the most recent two months is vastly biased upwards as a result of the lengthy reinspection process.
For this reason, I removed the last two months from my analysis.
How might this bias affect your analysis?
If you drop all Pending grades from your analysis (while retaining the A, B, and C grades), you have created an artificial trend in the last two months.
If you keep the last available grade for each restaurant, you have not escaped the problem at all. In fact, you introduce yet another complication: B- and C- graded restaurants have older inspection dates than the A-graded restaurants. Meanwhile, those Pending grades are still dropped.
If you automatically port this data to a mapping tool, or similar, you are displaying the biased data and the unknowing users are misled. In fact, the visualization no longer can be interpreted.
IMPORTANT NOTE: The data is NOT WRONG. Data cleaning/pre-processing does not just mean finding bad data. Much of what statisticans do when they explore the data is to identify biases or other tricky features.
The Nature of Statistical Analysis
[Captain Hindsight here.] Of course, I didn't know or guess that the Grade Pending bias would be a problem. I did the first analysis of the data using a July dataset, and by the time I was drafting the article for FiveThirtyEight, it was already August so I "refreshed" the analysis with the latest dataset. That's when I discovered some discrepancies that led me to the discovery.
This is the norm in statistical analysis. Every time you sit down to write something up, you notice additional nuances or nits. Sometimes, the problem is severe enough I have to re-run everything. Other times, you just decide to gloss over it and move on.
As others binge watch Netflix TV, I binge read Gelman posts, while riding a train with no wifi and a dying laptop battery. (This entry was written two weeks ago.)
Andrew Gelman is statistics’ most prolific blogger. Gelman-binging has become a necessity since I have not managed to keep up with his accelerated posting schedule. Earlier this year, he began publishing previews of future posts, one week in advance, and one month in advance.
Also, I have been stubbornly waiting for the developers of my former favorite RSS reader to work out an endless parade of the most elementary bugs, after they launched a new site in response to Google Reader shutting down. Not having settled on a new RSS tool has definitely shrank the volume of my reading.
I only managed to go through about a week’s worth of posts because the recent pieces interest me a lot.
Debunking the cannabis causes brain abnormalities paper (link)
Gelman links to Lior Pachter's review of what he calls "quite possibly the worst paper I've read all year".
This bit deserves further mocking: when the researchers fail to achieve conventional 5% significance, they draw conclusions based on "trend towards significance". This sleight of hand happens frequently in practice as well, where the phrase directional result is utilized.
When an observed effect, as in this case, is not statistically significant, the implication is that the signal is not large enough to distinguish from background noise. When the researcher then says “but I still see a signal”, said researcher is now ignoring the uncertainty around the point estimate, pretending that the noise doesn’t exist. The researcher is in effect making a decision using the point estimate. Anyone who has taken Stats 101 should know not to use a point estimate.
One great tenet of statistical thinking is the recognition that the observed data sample is merely one of many possible things that could have happened. The confidence interval is an attempt to capture the range of possibilities, and the much-maligned tests of significance represent an attempt to reduce such analysis to one statistic. It achieves simplicity at the expense of nuance.
This cannabis study is also a great example of what I’ve been calling “causation creep”. The authors are well-aware that they have merely found an instance of correlation (not even but just for the sake of argument), but when they start narrating their finding, they cannot help but use causal language.
The title of the paper is "Cannabis use is quantitatively associated with...", and yet the lead author told USA Today: "Just casual use appears to create changes in the brain in areas you don't want to change."
Causal creep is actually endemic in academic publishing of observational studies, and I don't want to single these authors out.
Gelman has been on this one for a while. The offensive paper looked at the correlation between hurricane damage and the gender of the names we give these hurricanes. I didn’t find it worth spending my time studying this line of research but I’m assuming that the problem is considered interesting because they claim to have found a “natural experiment” in that the gender is effectively “randomly assigned” to the hurricanes as they appear.
I have been quite irritated over the years by this type of research, encouraged by the fad of Freakonomics. Even if they did find a natural experiment, what is that experiment about? Instead of spending research hours on correlating damage with naming conventions, why not spend the precious time looking for real causes of hurricane damage? You know, like weather patterns, currents, physical phenomena, human-induced climate changes, human decisions to live in high-risk areas, etc.?
I should note that much of Steven Levitt’s original work that launched this field deal with real problems, like crime rates and . It’s just that many of his followers have gone astray.
Matt Novak debunks an article in Vox which repeats the assertion by the tech industry that new technologies have been adopted much more quickly in recent years than in the past. Vox is not the only place where you see this assertion. We have all seen variations of the chart shown on the right.
Novak puts on a statistician's hat and asks how the data came about. This type of chart is particularly prone to errors since many different studies across different eras are needed.
What Novak found: the invention date of older technologies (like TV and radio) were defined by their invention in the laboratory while recent technologies (such as Internet, mobile) were defined by their date of commercialization. Needless to say, adoption is expected to be slow when the technologies were not yet available to consumers!
Needless to say, anyone who cites this chart or its conclusion from here on out should be publicly shamed.
Gelman nicely distills one of the central messages in my Numbersense book (Get it here). All data analyses require assumptions; assumptions are subjective; making assumptions is not a sin; clarifying one’s assumptions and vigorously testing them is what make good analyses. Go read this post.
Did you buy detergent on your most recent trip to the store? (link)
Gelman was surprised by a recent paper in which the researchers found that 42% of their sample purchased detergent on their most recent trip to the store. This reminds me of the section of Numbersense (Get it here) in which I described a study in which some marketing professors had mystery shoppers track people in a supermarket and within seconds of them placing groceries in their trolley, asked them how much the items cost. The error rate was quite shocking.
There is another big problem with this research design. People's memory of what they purchased depends on how long ago that "most recent" trip was. I also wonder how online purchasing affects this sort of study as I typically don't count going to a website as "a trip to the supermarket". It seems like some sort of prequalification is needed but prequalification always restricts the generalizability of any finding.
Andrew gently mocks both of these commonly used procedures. The discussion of outlier detection is buried in the comments section so if you are interested, you should scroll below the fold. Gelman’s annoyance with outlier detection is semantic: but important semantics, which align with my own practice. Like Gelman, I don't consider any extreme value an outlier.
Stepwise is a suboptimal procedure and Gelman prefers modern techniques like lasso. But lots of practitioners use stepwise because the procedure is “intuitive”, that is to say, one can explain it to a non-technical person without rolling their eyes. The discussion below the post is worth reading.
I will be speaking at the Agilone Data Driven Marketing Summit (link) in San Francisco on Thursday. I will be talking about hiring for numbersense. Drop by if you are in the area. Future events are listed on the right column of the blog >>>
I feel bad piling on the "good guys" in the sports doping spectacle but sometimes, you need someone to point you to the mirror.
Here are the breathtaking first sentences from an article in Canada's The Globe and Mail about the scarcity of positive doping results in Sochi 2014:
At the midpoint of the Sochi Games, not yet marred by a single case of doping, the IOC’s top medical official said its efforts to catch drug cheats were so successful they had scared them all away.
A week later, after the disclosure of a fifth doping case on the final day of the games, IOC president Thomas Bach cited the positive tests as the sign of success.
If you have been reading this blog, you already know the people in the anti-doping business set themselves a really low bar. The title of Chapter 4 of Numbers Rule Your World (link) contains the phrase "timid testers" for a reason.
The statement by the unnamed "top medical official" is the more shocking. If there are no positive test results, and such is considered an accurate portrayal of the doping situation, then we must believe that there are no dopers. Apparently, this official believes no athlete that has been tested doped. Not a single one.
Who’s right? To [IOC president] Bach, it doesn’t much matter.
“The number of the cases for me is not really relevant,” Bach said. “What is important is that we see the system works.”
Now, it's Bach's turn to display his ignorance of the statistics of anti-doping. As I explained years ago in the book and also on this blog, the proportion of tests that come back positive is one of the most important numbers to look at when judging the success of an anti-doping program. So far, we know that six out of 2,630 athletes tested positive, meaning the rate of testing positive is 0.23%. (Much less than 1 percent is the norm in all large international events.)
What does that mean? If one percent of athletes doped, then we should expect 26 positives if the tests were 100% accurate. Since they only caught six, at least 20 of the 26 dopers passed the test. Yes, that means over 80% of dopers passed. (And I'm only assuming one percent doping, and not allowing the possibility of false positives.)
This leads me to the as-yet unrecognized scandal. Lance Armstrong, Ryan Braun, Mark McGwire, Alex Rodriguez, etc. etc. None of these confirmed dopers were caught by steroids tests. In fact, all of them boasted at one point or another that a long string of negative test findings proved that they were innocent.
Rather than gloating about the "success" of anti-doping measures, they should try explaining how the most notorious dopers in sports were repeatedly given a clean bill of health.
I am a supporter of anti-doping. I just want some discussion of the false negative problem.