I found my way to Mark Liberman's post at Language Log by way of a comment by Kyle on Andrew Gelman's post about Dubner's response to our Freakonomics article. I've always enjoyed Mark's posts and this one is no exception. His first bullet point speaks to one of my chief worries about Freakonomics-style analyses.
For background, Mark raised some doubts about recent academic work that supposedly shows that the left-right asymmetry in the QWERTY keyboard design affects our perception of words. The researchers concluded: "Words with more right-side letters were rated to be more positive, on average, than words with more left-side letters. We call this relationship the QWERTY effect."
***
Mark did some quick analyses which failed to replicate the finding. But his first point has nothing to do with replication. It is valid even if the original research has been done impeccably. Here are the words you must read:
1. The QWERTY effect's size. As far as I'm concerned, and as far as the general public is concerned, the size (and therefore the practical importance) of the QWERTY effect (if it exists) is the key question. This is not an entirely subjective matter — we can ask, as I did, what proportion of the variance in human judgments of the emotional valence of words is explained by the "right side advantage". The answer is "very little", or more precisely, around a tenth of a percent at best (at least in the modeling that I've done).
I focused on the effect-size question because the press release said the following (and the popular press took the hint):
Should parents stick to the positive side of their keyboards when picking baby names – Molly instead of Sara? Jimmy instead of Fred? According to the authors, “People responsible for naming new products, brands, and companies might do well to consider the potential advantages of consulting their keyboards and choosing the 'right' name."
So C&J may not be interested in my subjective evaluation of the effect size, but they promoted their own subjective evaluation by suggesting that the effect is important enough to matter to people choosing names. I felt (and feel) that this represents a serious exaggeration of the strength of the effect; and it seemed (and seems) appropriate to me to say so publicly.
***
Mark's complaint is similar to my response to several results championed by the Freakonomics team, including the "surname effect" as it relates to winners of the Nobel Economics Prize, and the "birthday effect" as it relates to sports leagues.
The common ingredients of such analyses are: published, peer-reviewed scholarly work that identifies an interesting effect meeting the standard of statistical significance, followed by the media's amplification and popularization of results that (a) ignore practical significance; and (b) apply a causal interpretation, possibly unknowingly.
(a) Practical significance
Statistical significance is designed to measure one thing only: how likely would we observe the effect being investigated assuming the effect does not exist (i.e., what's the chance of a false positive)? We need this concept because many observed effects (especially small effects) can happen by chance and therefore should not be attributed to the factor being studied.
Statistical significance is necessary but not sufficient for practical value. In other words, a practically meaningful effect must be statistically significant but there are many statistically significant effects that have little to no practical value.
Statistical significance will get one published in a peer-reviewed journal but it's not the job of a journal editor to discern practical value. Aside from statistical significance, the editor's other standard is contribution to the scholarship (i.e. novelty), which tells us nothing about practical value either. Thus, the "peer review" standard cannot defuse this issue.
As Mark pointed out, the practical importance of an effect is given by its effect size. For the QWERTY effect, it is one tenth of one percent at best. If you list out all of the factors that may affect "human judgments of the emotional valence of words", you will have a long list. If you now rank the factors in terms of their effect sizes, where would the QWERTY effect fall? Mark is saying the effect is one tenth of a percent at best and he's implying there are other factors more important than QWERTY at play here. Why focus everyone's attention on QWERTY when other factors may be even more important?
This is clearer in the Freakonomics examples. In the "birthday effect" research, it is acknowledged that genetic factors like gender and whether one's dad is a professional athlete are much more powerful than one's birthday. While the birthday effect is statistically significant, it is relatively small so why focus everyone's attention on that rather than on more powerful factors? As for Nobel economics prizes, one can name a variety of factors, such as educational background, creativity, intelligence, and influence that are more powerful than the first letter of one's surname.
***
Are Mark and I fussing over little details? No.
Go back to the 5% statistical significance level. By accepting this convention, we accept that 1 in 20 results are wrong. And no one knows which one is the false positive. If you have to take a guess, you would be more suspicious of the results in which the effect size is small, or when the the significance is barely achieved. Because of the 5% convention, you wouldn't be surprised that lots of published results just make the 5% mark. In addition, Andrew Gelman (link) tells us why the conventional acceptance criterion ensures that if the true effect is small, the estimated effect size will be exaggerated (and thus inaccurate).
For all these reasons, statisticians are reluctant to tout small effects that just achieve 5% statistical significance, and especially in observational data.
In some of these examples, there is no way to arrange randomized experiments to verify the reported effects. We can't run experiments on birthdays and sports success, for example. But when such experiments are possible, the results have often been ugly. To see what I'm talking about, I highly recommend reading this blog post by Gary Taubes, a long-time science reporter (thanks to John for his comment on my post on the red meat finding).
Taubes's is a really long article. Here are the two key quotes:
this meat-eating association with disease is a tiny association. Tiny. It’s not the 20-fold increased risk of lung cancer that pack-a-day smokers have compared to non-smokers. It’s a 0.2-fold increased risk — 1/100th the size...very few epidemiologists would ever take seriously an association smaller than a 3- or 4-fold increase in risk. These Harvard people [i.e., the researchers behind the red meat study] are discussing, and getting an extraordinary amount of media attention, over a 0.2-fold increased risk.
... every time in the past that these researchers had claimed that an association observed in their observational trials was a causal relationship, and that causal relationship had then been tested in experiment, the experiment had failed to confirm the causal interpretation — i.e., the folks from Harvard got it wrong. Not most times, but every time. No exception. Their batting average circa 2007, at least, was .000.
All those peer-reviewed, statistically-significant results amounted to a huge pile of misinformation... but a lot of press, which by the way does not have the habit of retracting erroneous reporting of this type.
***
To summarize: a lot of published effects are tiny, which means they have no practical value even if they have entertainment value; moreover, when reported effects are tiny, there is a good chance that they are false positives so trumpeting them for titillation can set you up for later embarrassment.
I'll address the other aspect of this practice, that of unwarranted causal interpretations, in a later post.
I agree overall with the post -- all very important points! But one small note: I think the end part of the quote from Taubes is misleading. He says that the causal effects were disproven by experimental studies, but my understanding is the experimental studies were testing something slightly different (ie, whether a diet - with all the attendant compliance problems and measurement issues - could reduce that elevated risk). The conclusion that the observational data are sketchy (and nutritional epidemiologists should be more cautious in interpreting causality from this observational data) is true, but Taubes starts with that valid criticism and ends up in Atkins-it's-all-carbs-that-are-bad-for-you-schtick land.
That doesn't change the point of this post - that we should be cautious re: small effect sizes - which is spot on.
Posted by: Brett Keller | 03/27/2012 at 01:26 PM
Brett: Thanks for the comment. I plan on reading Taubes's book at some point and can then confirm whether he's being fair. In general, though, I'm not surprised at all that experiments fail to validate tiny effects derived from observational studies.
Posted by: Kaiser | 03/29/2012 at 01:13 AM
One thing people don't seem to have noticed about the "red meat" study was that sex was not a variable in the regression equation. Nor were the results reported separately for males and females.
So if males have a higher mortality rate (and they do, at almost all ages, certainly those within the study) and also eat, on average, more servings of red meat (which seems almost certain, given that they eat, on average, more of everything)...
Posted by: Morgan | 04/07/2012 at 12:03 AM
Morgan -- I think that's incorrect (just glancing at the original paper at http://archinte.ama-assn.org/cgi/content/full/archinternmed.2011.2287 -- let me know if I missed something in my rush). They have data from two sources (the Nurses' Health Study and the Health Professionals Follow-up Study), with one source being all women and the other source being all men. They report results in the form of hazard ratios for each study (and find the results are significant within each data set / gender) and then also a pooled analysis. That's Table 4 in the paper. If the effect was in the pooled analysis and not in the separate ones you'd be right to be skeptical, but in this design I don't think it's possible to separate out differences between the studies and differences between genders because there's perfect correlation between the two.
Posted by: Brett Keller | 04/07/2012 at 05:40 PM
I enjoyed reading this article, it's a very good one from you. Very interesting approach and clearly expressed thoughts.
Thank you!
Posted by: Fred | 04/25/2012 at 04:09 AM
A couple of other QWERTY points: (1) even if it were a large effect, is it left-right bias of the typist OR of the original designer of the keyboard, who may have a variety of idiosyncratic reasons for the design. (2) variance across letters within left and right would likely to higher than the variance between left and right.
Posted by: zbicyclist | 05/07/2012 at 02:42 PM