I found my way to Mark Liberman's post at Language Log by way of a comment by Kyle on Andrew Gelman's post about Dubner's response to our Freakonomics article. I've always enjoyed Mark's posts and this one is no exception. His first bullet point speaks to one of my chief worries about Freakonomics-style analyses.
For background, Mark raised some doubts about recent academic work that supposedly shows that the left-right asymmetry in the QWERTY keyboard design affects our perception of words. The researchers concluded: "Words with more right-side letters were rated to be more positive, on average, than words with more left-side letters. We call this relationship the QWERTY effect."
Mark did some quick analyses which failed to replicate the finding. But his first point has nothing to do with replication. It is valid even if the original research has been done impeccably. Here are the words you must read:
1. The QWERTY effect's size. As far as I'm concerned, and as far as the general public is concerned, the size (and therefore the practical importance) of the QWERTY effect (if it exists) is the key question. This is not an entirely subjective matter — we can ask, as I did, what proportion of the variance in human judgments of the emotional valence of words is explained by the "right side advantage". The answer is "very little", or more precisely, around a tenth of a percent at best (at least in the modeling that I've done).
I focused on the effect-size question because the press release said the following (and the popular press took the hint):
Should parents stick to the positive side of their keyboards when picking baby names – Molly instead of Sara? Jimmy instead of Fred? According to the authors, “People responsible for naming new products, brands, and companies might do well to consider the potential advantages of consulting their keyboards and choosing the 'right' name."
So C&J may not be interested in my subjective evaluation of the effect size, but they promoted their own subjective evaluation by suggesting that the effect is important enough to matter to people choosing names. I felt (and feel) that this represents a serious exaggeration of the strength of the effect; and it seemed (and seems) appropriate to me to say so publicly.
Mark's complaint is similar to my response to several results championed by the Freakonomics team, including the "surname effect" as it relates to winners of the Nobel Economics Prize, and the "birthday effect" as it relates to sports leagues.
The common ingredients of such analyses are: published, peer-reviewed scholarly work that identifies an interesting effect meeting the standard of statistical significance, followed by the media's amplification and popularization of results that (a) ignore practical significance; and (b) apply a causal interpretation, possibly unknowingly.
(a) Practical significance
Statistical significance is designed to measure one thing only: how likely would we observe the effect being investigated assuming the effect does not exist (i.e., what's the chance of a false positive)? We need this concept because many observed effects (especially small effects) can happen by chance and therefore should not be attributed to the factor being studied.
Statistical significance is necessary but not sufficient for practical value. In other words, a practically meaningful effect must be statistically significant but there are many statistically significant effects that have little to no practical value.
Statistical significance will get one published in a peer-reviewed journal but it's not the job of a journal editor to discern practical value. Aside from statistical significance, the editor's other standard is contribution to the scholarship (i.e. novelty), which tells us nothing about practical value either. Thus, the "peer review" standard cannot defuse this issue.
As Mark pointed out, the practical importance of an effect is given by its effect size. For the QWERTY effect, it is one tenth of one percent at best. If you list out all of the factors that may affect "human judgments of the emotional valence of words", you will have a long list. If you now rank the factors in terms of their effect sizes, where would the QWERTY effect fall? Mark is saying the effect is one tenth of a percent at best and he's implying there are other factors more important than QWERTY at play here. Why focus everyone's attention on QWERTY when other factors may be even more important?
This is clearer in the Freakonomics examples. In the "birthday effect" research, it is acknowledged that genetic factors like gender and whether one's dad is a professional athlete are much more powerful than one's birthday. While the birthday effect is statistically significant, it is relatively small so why focus everyone's attention on that rather than on more powerful factors? As for Nobel economics prizes, one can name a variety of factors, such as educational background, creativity, intelligence, and influence that are more powerful than the first letter of one's surname.
Are Mark and I fussing over little details? No.
Go back to the 5% statistical significance level. By accepting this convention, we accept that 1 in 20 results are wrong. And no one knows which one is the false positive. If you have to take a guess, you would be more suspicious of the results in which the effect size is small, or when the the significance is barely achieved. Because of the 5% convention, you wouldn't be surprised that lots of published results just make the 5% mark. In addition, Andrew Gelman (link) tells us why the conventional acceptance criterion ensures that if the true effect is small, the estimated effect size will be exaggerated (and thus inaccurate).
For all these reasons, statisticians are reluctant to tout small effects that just achieve 5% statistical significance, and especially in observational data.
In some of these examples, there is no way to arrange randomized experiments to verify the reported effects. We can't run experiments on birthdays and sports success, for example. But when such experiments are possible, the results have often been ugly. To see what I'm talking about, I highly recommend reading this blog post by Gary Taubes, a long-time science reporter (thanks to John for his comment on my post on the red meat finding).
Taubes's is a really long article. Here are the two key quotes:
this meat-eating association with disease is a tiny association. Tiny. It’s not the 20-fold increased risk of lung cancer that pack-a-day smokers have compared to non-smokers. It’s a 0.2-fold increased risk — 1/100th the size...very few epidemiologists would ever take seriously an association smaller than a 3- or 4-fold increase in risk. These Harvard people [i.e., the researchers behind the red meat study] are discussing, and getting an extraordinary amount of media attention, over a 0.2-fold increased risk.
... every time in the past that these researchers had claimed that an association observed in their observational trials was a causal relationship, and that causal relationship had then been tested in experiment, the experiment had failed to confirm the causal interpretation — i.e., the folks from Harvard got it wrong. Not most times, but every time. No exception. Their batting average circa 2007, at least, was .000.
All those peer-reviewed, statistically-significant results amounted to a huge pile of misinformation... but a lot of press, which by the way does not have the habit of retracting erroneous reporting of this type.
To summarize: a lot of published effects are tiny, which means they have no practical value even if they have entertainment value; moreover, when reported effects are tiny, there is a good chance that they are false positives so trumpeting them for titillation can set you up for later embarrassment.
I'll address the other aspect of this practice, that of unwarranted causal interpretations, in a later post.