One of the most impressive courses I took was the Investment Management class taught by Andre Perold at Harvard. For each session, we had to evaluate some fancy-schmancy, attractively-named, financially-engineered investment product (hint: most of them you don't want to touch). And again and again, Perold led us back to the question: what is the alternative?
I kept coming back to this question as I read Tom Siegfried's wrong-headed article ("Odds are, it's wrong") about the use of statistics by scientists. In it, he makes a mockery of the statistics profession, accentuates the negatives, ignores the positives while offering no useful alternatives. Unsuspecting readers may take this caricature of statistical analysis, and conclude erroneously that it is a waste of time. I am surprised that other bloggers have given him a pass.
Siegfried did offer one alternative but got it all wrong. He seemed to hint that Bayesian statistics will save us. I suspect that even dedicated Bayesians like Andrew Gelman will not make this claim (he made some comments here.). Bayesian statistics do solve some problems but not many of those that terrified Siegfried, which includes: existence of false positive results, inability to "prove" science, lack of practical significance, errors from multiple comparisons, inability to generate truly random samples, too much aggregation, publication bias, lack of replication, etc.
The potential for misuse of hypothesis tests (and p-values) is common knowledge. Siegfried appeared to be targeting more than misapplication; at times, he argued against the entire concept of a statistical test. For instance, he said "The 'scientific method' of testing hypotheses by statistical analysis stands on a flimsy foundation... the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions." (my emphasis)
The fundamental problem in statistics is incomplete information. Because we must generalize data from a sample, there is always uncertainty in any estimate. Instead of seeking to eliminate said uncertainty, statisticians try to quantify it via a probability distribution. We further acknowledge that the observed data may not yield the "true" answer with positive but small probability. Siegfried does not like this, and believes that scientists should look for something that will give them "truth".
there's no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: there is a real effect, or the result is an improbable fluke. Fisher's method offers no way to know which is which.
He quotes biostatistician Steven Goodman as saying "A lot of scientists don't understand statistics. And they don't understand statistics because the statistics don't make sense." This quote is used to support an indictment of our field that reads "any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical." But what Goodman meant was that the misuse of statistics is sometimes caused by its counter-intuitiveness; he isn't saying that statistics in general is illogical. (I verified this with Steve.)
What's the alternative? Assume that a method exists which tells us the observed sample is a fluke. This implies we know the true value of the parameter we are estimating. And this implies we have the answer before we take the sample. Then there is no need to infer from a sample. In the Bayesian world, the parameter does not have a value, it has a distribution. In effect, all values are possible but some are more likely. How can that tell us whether the observed sample is a fluke?
Replication is offered as some kind of remedy. This is often impractical, as Andrew pointed out. Besides, two replicates can make two flukes. So it's not guaranteed to solve anything. If there is time and money to repeat the experiment, how much gain does one get from replication compared to running the original experiment at twice the sample size?
I discussed the case of the steroid test in Numbers Rule Your World. While the example (Box 4) illustrates Bayes Theorem properly, it leads to a conclusion of excessive false positives that is at odds with the real world.
The example makes the assumption that 5 percent of MLB players take steroids, and that the test returns positive findings for about 10 percent of the tested samples. In reality, the positive rate of MLB steroid tests is much below 1 percent (in 2008, only 14 positives out of 3,500 tests), and seasoned observers (such as Jose Canseco who has been proven right again and again) estimate that well over 10 percent of players may be doping. So the real problem is false negatives.
Box 1 deals with the arbitrary nature of a cut-off p-value, and makes the oft-repeated complaint about edge cases that just miss the cut-off. This is a valid criticism but with zero practical significance.
This issue is endemic to any binary decision problem. The top colleges have a fixed number of slots per class, and one can be sure that the top-ranked rejected candidate and the lowest-ranked accepted candidate have little appreciable differences between them. Where parking meters stop charging at 6 pm, cops give tickets to drivers who park without paying at 5:59 pm but not the ones who park without paying at 6:01 pm. And, as noted in SuperFreakonomics, youth-league baseball teams set a cutoff birthday of July 31, meaning that kids born in some months are disadvantaged. Statisticians are doing something everyone else is also doing.
In real problems where a decision is a required outcome (e.g. the drug has to be approved or not, the marketing campaign has to be launched or not, the ballplayer is drafted or not), a cut-off value has to be established, and any such value will seem arbitrary to some.
In my opinion, Fisher did us all a favor by establishing a convention of 95% confidence. Without this, it would have been difficult to bring conflicting parties together to make decisions. The result would be even more arbitrary; the numbers would become subordinate to the power and influence of the decision-makers.
What's the alternative?
I agree with David (in his response to Andrew) that Box 2's example of the hungry dog is a silly example. It's not even a statistics problem (it's a toy probability problem). Where's the data and what population value are we trying to estimate?
The "gold standard" of randomized clinical trials is subject to the possibility that an unknown factor not being properly balanced between test and control groups by chance. Again a valid complaint but what is the alternative?
If the unbalanced factor is a known covariate, then a stratified sample solves the problem, or a post-randomization check. Or, one can use post-trial adjustments, such as propensity score methods. If the unbalanced factor is unknown, what else is better than randomization?
What is troubling is Siegfried's fondness for sensational language, which ends up painting the profession of statistics in a ridiculous light. A quick sampling includes: "a mutant form of math has deflected science's heart", "seduced by statistics", "like a crapshoot", "dirtiest secret", "flimsy foundation", "inconsistent", "erroneous", "contradictory", "confusing", "spawned countless illegitimate findings"; and these examples come only from the introductory section.
While the article will revive some lively debate among statisticians about the limits of our methods, I think it conveys a lop-sided, misleading impression to the wider audience. It would be far worse for science if scientists abandon statistical analysis.
It is true that statistical conclusions always have a margin of error, that many published findings may be false, and that we don't have a grand unifying principle that everyone agrees with. But it is also true that if what is being measured is not governed by physical laws (which covers most of social science, medicine, business, education, etc.), there is no hope of a "100% certain" science. False results are part of the process of scientific inquiry, not a sign of its failure.