Andrew Gelman and I have published a piece in Slate, discussing the failure to replicate scientific findings, using the recent example of the so-called power pose. The idea of the "power pose" is that people develop psychological and hormonal changes by making this "power pose" before walking into business meetings, whereupon these changes make them more powerful.
As you often read here and at Gelman's blog, the fact that someone got a paper published in a scientific journal, based on a statistically significant result, doesn't automatically make it a believable result. Here, a different group of scientists tried to replicate the finding, with a sample size five times larger, and their replication did not come close to being statistically significant.
The original researchers wrote a response detailing differences between the two studies, which misses the point. While there are differences, as would be the case in any replication attempts, the key issue is whether readers should believe a study result that is so fragile. For example, what year the study was conducted is described as a difference, so is the proportion of females in the study (62% in the original study versus 49% in the replication). There are also differences of execution such as how many minutes the pose was held and what type of regression was used in the analysis. Even if these differences explain the inability to replicate the original finding, they would imply that the conclusion is dependent on those very conditions, and does not engender trust in its generalizability.
This situation is not unique to the "power pose" study. Over the years, Andrew and I have discussed many other studies with similar problems. This one is one of the few in which a replication has been attempted.
One of the points made in the Slate article is important to reiterate:
Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset. A replication is cleaner: When an outside team is focusing on a particular comparison known ahead of time, there is less wiggle room, and results can be more clearly interpreted at face value.
This is a subtle point often missed by non-statisticians. There is a huge difference between a replication study for which researchers know in advance what is being analyzed, and a typical scientific study for which researchers may have measured an array of metrics, and then selectively reported ones that are "statistically significant."
Here is an analogy that may help understanding it:
You wanted to buy a woolen sweater during the after-Christmas sales at Macy’s Time Square. At the store, you discovered that woolen sweaters were hard to come by but cashmere scarves were the deal of the decade; in addition, Macy’s was running a buy-one, get-two free promotion on dress pants. So when you checked out, you purchased one cashmere scarf and three pairs of pants (none of which you had intended to buy). So we can ask whether the shopping trip was successful.
Imagine you have a tradition of going to Macy's every year to get a new sweater. If your metric of success is whether you purchased a sweater, your success rate would not be high. However, if your metric is whether you purchased something, your success rate would be much higher. A replication study has a fixed metric, fixed by the prior study: it is like measuring success based on whether you purchased a sweater.
It is much easier to prove that one of many things could happen than to prove that one specific thing would happen. The trouble is that in the reporting of scientific findings, one of many things is typically presented as one specific thing. This means replication is important: until we know that the one specific thing can be reliably replicated, we really don't have solid science.
Here is the link to our Slate article.
PS. This phenomenon is always hard to get across to students. I am not totally satisfied with this analogy. If you know of different ways to explain this, let me know.
PPS. This phenomenon is especially tricky with "big data" style studies. For example, a lot of people run "A/B tests" in which they simultaneously measure hundreds if not thousands of measures, and selectively report on the differences that are statistically significant.