I recently covered the power pose research controversy, ignited by an inflammatory letter by Susan Fiske (link). Dana Carney, one of the coauthors of the original power pose study, courageously came forward to disown the research, and explained the reasons why she no longer trusts the result. Here is her mea culpa. Her co-author, Amy Cuddy, then went to New York Magazine to publish her own corrective, claiming that the "primary" finding of power pose is confirmed. Here is the statement Cuddy released through a publicist.
The room is getting crowded. We have Cuddy and the TED Talk crowd on the one side, and co-researcher Carney, Andrew Gelman, and the research replication movement on the other side, each making their respective case. In order to make a judgment, you’re going to need to understand how scientific research and publishing works, and what the statistical critique of the established methodological approach is all about.
In a multi-part post, I provide some of the background information needed to understand what is going on. Here is the plan:
Key Idea 1: Peer Review, Manuscripts, Pop Science and TED Talks (Today)
Key Idea 2: P < 0.05, P-hacking, Replication Studies, Pre-registration (Today)
Key Idea 3: Negative Studies, and the File Drawer (Here)
Key Idea 4: Degrees of Freedom, and the Garden of Forking Paths (Here)
Key Idea 5: Sample Size
***
Key Idea 1: Peer Review, Manuscripts, Pop Science, and TED Talks
Researchers have long recognized the differential quality of research studies. That’s why we have “peer-reviewed” studies and published studies. Published studies may or may not be peer-reviewed. Peer-reviewed studies generally enjoy a higher status. As the number of journals multiplies, which publication puts out the study also matters.
Then, there are manuscripts. If a manuscript is “under review,” it has been submitted for peer review. Some proportion of manuscripts will be rejected by journal editors. Of those submissions that eventually get published, most will undergo revisions based on feedback from reviewers. Manuscripts that have been accepted may live in a queue for a while, in which case they are “accepted” and pending publication. Obviously, a published study carries more weight than a manuscript, especially if the latter is still under review.
Next, we have “pop science,” a relatively recent phenomenon. Malcolm Gladwell is among the earliest to achieve success in this genre. On the one hand, such authors are de facto marketers of scientific research. On the other hand, they may not be the best judges of good science. (Recall Gladwell’s “igon value” controversy.) Even more recently, TED talks emerge as an outlet highly coveted by scientists seeking publicity. Through this channel, science that is not mainstream can attain a mass following, usurping the conventional order of things.
“Pop science” thrives on the pop. In this vein, the team of Steven Levitt and Stephen Dubner, aka Freakonomics, entertained their readers with analyses of out-of-left-field topics, ranging from abortion and crime to geo-engineering solutions to global warming. Seemingly, the only subjects off limits are unemployment, inflation, growth, and other traditional themes in economics. The news media crave Freakonomics-style research. Editors of research journals soon joined the fray, crafting press releases to pitch journal articles with attention-grabbing headlines. Critics worry that “pop science” elevate the titillation factor above the quality of science. Journal editors may have relaxed standards for the sake of publicity. If such critique holds, the implication is that peer review has lost its luster.
Key Idea 2: P < 0.05, P-hacking, Replication Studies, Pre-registration
In social science, in order to get research published, the effect must be “statistically significant” with p < 0.05. This convention is intended to weed out findings that do not generalize beyond the one experiment. If a result is generalizable, it should replicate when the experiment is repeated (after an allowance for sampling error).
A replication study repeats a published experiment to validate it. Journal editors have long dismissed replication studies, deeming them unoriginal work. They believe that generalizability is guaranteed by ensuring p < 0.05. This line of thinking is currently under attack.
The edifice of p < 0.05 relies on researchers respecting accepted protocol. When such rules of practice are broken, p < 0.05 becomes meaningless. The researchers are either fooling themselves (unintentionally) or fooling others (intentionally). The most important piece of this protocol is Thou Shall Not P-Hack.
Imagine a researcher, after investing much time and effort into an experiment, discovers that the effect has a p-value of 0.055, just large enough to earn the journal editor’s ire? Because statistics is an inexact science, this situation presents much latitude for mischief. By removing some “outliers” or other “bad” data, or picking a different way of computing the p-value, or doing a host of other tricks, it is quite easy to nudge the offending p-value south of 0.05.
It doesn’t matter whether the intention of the researcher is to mislead. As Dana Carney noted, one can easily convince oneself that the aforementioned tricks represent good science rather than poor ethics. The publish-or-perish culture within the academic and research community is an enabler of this “p-hacking” phenomenon.
The same tricks used to p-hack can be used to solve statistical issues legitimately. That is one of the key ideas in my book Numbersense (link). The same methods can lead to either good or bad analyses. It boils down to a matter of trust and ethics. A replication study performed by a different lab is a great way to check the results for possible p-hacking. Those who trust the existing system will see no need to perform such studies, and may even feel that their ethics are being questioned by those who undertake such studies.
It is challenging if not impossible for an outsider, including possibly a co-investigator and a journal reviewer, to detect if a p-value has been hacked. (In her note, Amy Cuddy claimed no knowledge of Dana Carney’s work process.) Sometimes, one gets a hint based on disclosures of how the data have been processed and/or analyzed but most journals do not require authors to submit complete programming codes, nor do they require authors to detail every step taken in the data processing and/or analysis.
Besides, steps taken after peering at the p-values can be portrayed as part of the original design. This is why researchers nowadays are encouraged to “pre-register” their designs. If her experimental design were pre-registered, Carney might not have had to issue her mea culpa, in which she said she didn’t realize at the time she was p-hacking.
Recently, replication studies have come into fashion. Several careful large-scale studies failed to replicate seminal findings in many fields, including psychology and medicine, shaking the foundation of the science. The “replication crisis” means journal editors no longer can argue that p < 0.05 guarantees that the original finding is valid. At long last, some of this research have reached top journals.
Non-replication comes in two varieties. The effect in the replication study may have the same direction as the one in the original paper, albeit at a weaker magnitude that is indistinguishable from noise. This is a major sign of trouble because the replication study typically uses a larger sample than the original. Sometimes, the effect may even run in the opposite direction (which was the case in some of the Ranehill, et. al. replication studies).
***
More to come: File drawers, garden of forking path, and more!
Second post is up now (Here)
I enjoyed Freakonomics but also found some of the analyses a bit worrying. Given that I think economists don't understand some things about economies then it is unlikely that they have corrected properly in any longitudinal analysis. It is interesting that there is another theory about reduced levels of violent crime being due to changes to abortion laws and that is it is due to reduced lead. Both have their problems as there is so much confounding with socio-economic status.
21st century economics seems to be able to model everything except economies. I'm waiting for GFC 2 where economists will have to find another reason why they didn't predict it.
Posted by: Ken | 10/20/2016 at 06:52 PM
Kaiser,
I like your post but think you have used "generalizable" where "replicable" would be better.
Correct me if I'm wrong. My sense of these things is that a replicable experiment is one where, taking then same population, the same effect will be found, with perhaps minor differences because the samples from the population will not be identical.
Generalizability raises the question of whether a same/similar effect will be found in another population. Case in point: effects observed in small samples of psychology students might not be generalizable to a wider, more diverse, perhaps less self-selected a population.
An experiment that is replicable in a not very interesting population is not very interesting. A generalizable one crosses age/gender/personality/geography etc to be a genuinely useful insight.
Posted by: CfE | 10/21/2016 at 04:20 AM
CfE: You raise a point that probably merits a different post. Because the current controversy isn't about sampling bias, I have stayed away from that subject. Replicability is necessary but not sufficient for generalizablity. In the framework for p-values, we assume that the sample is representative of the population to which one wants to generalize the result. In reality, you're opening a can of worms here: I think another huge concern with any of these social science experiments is the reliance on samples of students when conclusions are drawn on the population.
That said, I don't see why the replication movement can't extend beyond repeating the original experiment. Seems like applying the same experimental design on a different sample (such as not students) is a worthwhile undertaking as well.
Posted by: Kaiser | 10/21/2016 at 11:12 AM