This is the third and final post about the controversy over statistical analysis used in peer-reviewed published scholarly research. Most of the new stuff are covered in post #2 (link). Today's post covers statistical issues related to sample size, which is nothing new, but it was mentioned in Amy Cuddy's response to her critics and thus I also discuss it here.
In post #2 (link), I offer the following mental picture of the two sides in the boxing ring:
- the traditionalists (e.g. Susan Fiske who wrote the scathing condemnation of the reformists) believe that there is nothing wrong with the long-accepted standards of publication - in their world view, each new published research article showing p<0.05 experiments reinforces prior publications, and strengthens the scientific basis of the research agenda;
- the reformists (e.g. Andrew Gelman) believe that the standards of publication are broken - in their view, the additional published research with p<0.05 experiments creates a false sense of security because consumers cannot see (1) any of the negative studies rejected by journal editors, or not written up because of anticipated rejection, and (2) results of a universe of alternative experiments that the researchers could have done that might have shown negative results.
For more on the concepts behind these arguments, with names such as file drawers and garden of forking paths, read my earlier post.
The full agenda is as follows:
Key Idea 1: Peer Review, Manuscripts, Pop Science and TED Talks (link)
Key Idea 2: P < 0.05, P-hacking, Replication Studies, Pre-registration (link)
Key Idea 3: Negative Studies, and the File Drawer (link)
Key Idea 4: Degrees of Freedom, and the Garden of Forking Paths (link)
Key Idea 5: Sample Size (Today)
Key Idea 5: Sample Size
In the previous post, I ended with a brief discussion of meta-analysis, or systematic review, which attempts to consolidate evidence, and present a summary of the state of the world.
One motivation for conducting a meta-analysis is to pool the data from numerous small studies. Many results from psychological experiments come from 30-50 subjects, who are typically students enrolled in college classes. Comparatively, polls typically have thousands of respondents. In all experiments, the effect under study must be strong enough to be observable. Small samples have more noise, making it harder to observe the effect, even if it exists. If the anticipated effect is small but positive, larger samples are recommended. When a statistician complains that a study is “under-powered,” he or she is saying it needs a larger sample size.
The Ranehill, et. al. power pose replication study used 200 subjects, five times more data than the original study with 40 subjects. All else equal, the error rate drops by more than half. So it should be a lot easier to observe the power pose effect in the replication study relative to the original study, assuming that the effect comes in in the same direction.
The concern about sample size or power in a study is well-trodden territory, and not controversial. It is part of the standard conversation during the design phase of any experiment, psychological or otherwise. The current statistical critique of the power pose research has nothing to do with sample sizes.
While this controversy is breaking out in academia, it has a lot of meaning for anyone working with "big data."
Nobody wants false positive results, which leads to wasted time, effort and money. In this set of posts, I discuss a range of enablers of false positive results:
- ignoring negative studies, or dismissing them as non-informative
- testing too many response variables
- testing too many sub-populations
- tweaking the experiments too much
- too small samples
- running your study on one population and drawing conclusions about a different population
It's not that you shouldn't do any of these things - they are standard steps in a data analysis process. If different settings/variables/sub-populations lead to different conditions, you have to be careful in interpreting the findings. Most researchers do not intentionally make false-positive results. But we all are at risk of unintentionally making these mistakes. It's all too easy to come up with stories that justify the tweaking/testing/ignoring that we do.