Andrew and I warned you about "power poses" in Slate some time ago (link).
Breaking news is that Dana Carney, a co-author of the paper that claimed the benefits of the power pose, has now confirmed that she no longer believes in the power pose. She is actively discouraging researchers from this "waste of time and resources."
Here is her statement (PDF link), which is well worth reading in full. This is a courageous statement.
The statement discloses a variety of tricks used to game p-values so that they meet the publishable 0.05 threshold. Everyone suspects someone else is playing such tricks but it's rare when someone actually confesses to them.
The highlights are:
Initially, the primary DV of interest was risk taking. We ran subjects in chunks and checked the effect along the way. It was something like 25 subjects run, then 10, then 7, then 5. Back then this did not seem like p-hacking. It seemed like saving money (assuming your effect size was big enough and p-value was the only issue)
Unfortunately, I have witnessed this type of p-hacking in industry all too often. In fact, many, many people run so-called A/B tests until they reach significance. There are many problems with what Carney described above. Imagine an effect size that is small (close to zero). As the samples accumulate, the measured effect will fluctuate around zero. If you wait long enough, the measure will hit p=0.05 by chance and then you stop. Further, they were reducing the sample size as the experiment continues - which means they are introducing more sampling variability, which means it is more likely that the measure will hit extreme values by chance!
It's tough for me to believe that she wasn't aware that stopping when you hit p=0.05 is p-hacking but that's what she is saying.
For the risk-taking DV: One p-value for a Pearson chi square was 0.052 and for the Likelihood ratio it was 0.05. The smaller of the two was reported... I had found evidence that it is more appropriate to use "Likelihood" when one has smaller samples and this was how I convinced myself it was OK.
She's focused here on the researcher degree of freedom issue. The larger problem is the magic dust that seems to sprinkle off p=0.05. If that is the chosen threshold for significance, and my result is right on the cusp, I would be very skeptical of this result. I don't think 0.052 is the better number - they are both bad.
The self-reported DV was p-hacked in that many different power questions were asked and those chosen were the ones that "worked".
Many A/B testing platforms come with a battery of hundreds of metrics automatically computed for each test. No further comment needed.
As of today, the TED talk on "power poses" is still going strong. It has accumulated 36 million "views" and the official description does not mention Dana Carney's retraction.
Recent Comments