Yesterday, I started a series of posts covering the "power pose" research controversy. The plan is as follows:
Key Idea 1: Peer Review, Manuscripts, Pop Science and TED Talks
Key Idea 2: P < 0.05, P-hacking, Replication Studies, Pre-registration
Key Idea 3: Negative Studies, and the File Drawer (Today)
Key Idea 4: Degrees of Freedom, and the Garden of Forking Paths
Key Idea 5: Sample Size
Here is a quick overview of the key documents:
I recently covered the power pose research controversy, ignited by an inflammatory letter by Susan Fiske (link). Dana Carney, one of the coauthors of the original power pose study, courageously came forward to disown the research, and explained the reasons why she no longer trusts the result. Here is her mea culpa. Her co-author, Amy Cuddy, then went to New York Magazine to publish her own corrective, claiming that the "primary" finding of power pose is confirmed. Here is the statement Cuddy released through a publicist.
Now a quick summary of post #1. There is a hierarchy of research studies based on its status within the publishing process, the publication and/or media outlet. Lower-quality research sometimes get popularized through non-traditional outlets such as pop science and TED talks. Even top journals may thirst for media attention. The old world of p < 0.05 is slowly unraveling, partly because it relies on researchers adhering to certain protocols, such as not engaging in p-hacking. The replication crisis is a warning sign to some critics that p-hacking is prevalent. P-hacking may be intentional or unintentional; in either case, the integrity of the science is compromised. Pre-registration and replication studies are potential remedies.
The concepts covered today are in the center of the current controversy.
Key Idea 3: Negative Studies, and the File Drawer
Now that replication studies have been legitimized, that leaves one type of studies that are missing from the scholarly literature: so-called negative studies, which are studies that fail to confirm the researcher’s hypothesis, which in turn means the experimental findings fail to hit p < 0.05. This publication bias is known colorfully as the “file drawer effect.” If one scours the literature for studies, such as when conducting a meta-analysis, one will find only studies that show p < 0.05. The other studies are left in the proverbial file drawer. The effect size estimated by such an analysis will be exaggerated.
The file drawer is a form of p-hacking. If only positive results are published, then one can repeatedly perform (substantively) the same experiment until one lands a positive result by chance. If readers of that study with p < 0.05 had known of the file drawer filled with negative studies, they would have treated the published study with appropriate skepticism. “If at first you don’t succeed, try, try and try again” is a positive way to live one’s life but a negative way to conduct science that relies on statistical significance.
Key Idea 4: Degrees of Freedom, Garden of Forking Paths
Developing a good experimental design is a key step in the scientific method. Table 1 in the Carney-Cuddy-Yap response to Ranehill, et. al.’s replication study is a useful reference to the types of decisions researchers make when conducting experiments. Whether the “power pose” is held for 2 minutes or 6 minutes is an example of a design choice. The experiment used in a given research project is a single configuration selected from a large set of possibilities.
Not all design decisions are equally important. The entries in Table 1 by Carney-Cuddy-Yap represent their opinion of which elements are important. The true impact of a specific design decision is not known unless the researcher explicitly tests multiple settings of that factor in the experimental design. Only a subset of the design decisions will materially affect the experimental outcome; some - perhaps many - of such decisions will be meaningless.
Imagine a researcher whose experiment failed to yield the desired p < 0.05. Instead of throwing out the hypothesis, the researcher may re-run the experiment using a different configuration of the design elements (e.g. hold the pose for a longer duration). Again, there are both legitimate and dubious reasons for doing this. Now there are many possible configurations, so this process can be repeated until p > 0.05.
Assume that the effect being measured is non-existent. In any run of the experiment, there is still a positive chance of seeing a positive signal. If one runs a sequence of experiments, eventually one will hit upon one that achieves p < 0.05. That study will be publishable, and thanks to the file-drawer effect (Key Idea 3 above), will be the only one that readers see. If the researcher isn’t careful, he or she might even conclude that the specific configuration of design choices contributed to the “success” of the experiment.
The current criticism of research practice is directed at this type of p-hacking, which some believe to have become all too common because many of the effects under study are likely to be only marginally positive. In addition, when negative studies are stacked away in the file drawer, the observed “path” of published studies represents a path of positive studies only. The problem is that such a path is a rare event akin to a series of consecutive Heads in a sequence of coin tosses. This is where the Cuddys of the world clash with the Gelmans.
In Cuddy’s world, the appearance of the next positive study in series is a triumphant event confirming the original result. In Gelman’s world, the observed path of positive studies is only one path in a “garden of forking paths,” and a rare event shaped by journal editing decisions, and thus not likely to yield valid generalizations. The garden of forking paths contains all possible configurations of studies, including any configurations that might have been tried and yielded negative (p < 0.05) results.
I have only spoken of one of the “researchers’ degrees of freedom,” namely, the design of the experiment. Another common fertilizer in the “garden of forking paths” is the freedom related to choosing the target variable. As Cuddy’s note reveals, the “power pose” studies have used a variety of target variables, ranging from self-reported feeling of powerfulness, happiness, mood, hormones, cortisol, potassium levels, etc., etc.
Proper protocol calls upon the researchers to premeditate their target variable, and not vary from it even if the experimental results disappoint. In practice, there is a strong temptation to ask whether the treatment affected something else when the original target missed the p < 0.05 threshold. In the world of small, positive effects, the more target variables are inspected, the more likely one of them will show p < 0.05 by chance. In the statistics community, this “multiple testing” problem is long recognized, with a vast literature.
Another temptation is to ask whether the treatment affected a subset of the test subjects when its effect on the average subject is deemed not significant. Such analysis should also be pre-registered. Otherwise, there are countless ways to slice and dice the test subjects, and in the world of small, positive effects, one can easily find subsets that show p < 0.05 by chance. And again, because of publishing protocol, only the positive studies see the light of day, leaving outsiders none the wiser.
There are too many other causes to document here. I will mention one more, just because the New York Times recently published an article discussing it. The journalist (link) asked four pollsters to analyze the same underlying poll data, and they yielded somewhat different estimates. The divergent analyses came about because of subjective assumptions during the analytical process. It may be that some of these assumptions delineate p greater or less than 0.05.
Just like before, researchers are either fooling themselves or fooling others (or both). Unless the design is pre-registered, an outsider cannot tell if negative studies have been abandoned or suppressed. The lack of replication of many seminal studies is driving the narrative that most published studies are “false positives.”
This discussion ought to throw a little doubt on the utility of a systematic review or meta-analysis, which consolidates evidence from multiple studies addressing the same topic. Whether this consolidation work is qualitative or quantitative, it starts with identifying all published studies, and then culling the list down to “trustworthy” studies. But all published studies is a biased subset of all studies.
In the last part of this series, I will address the role of sample size in this controversy.