Via Social Science Statistics blog, I found this article in the Times about baseball's longest hitting streaks. The authors ran 10,000 simulations of "baseball seasons using historical data to come up with a probability distribution of the longest hitting streak in each season. They showed the following chart.
The record was 56 consecutive games with hits in a season, which in some circles is seen as unbeatable. These authors -- "in a fit of scientific skepticism -- found that in any season, the simulated longest streak ranged from 39 to 109, with the median at 53 games. They concluded that "the unlikely becomes likely".
That is sure to turn some heads. I have a question for them as I can't make sense of these numbers. A median of 53 meant that 50% (or 5000 out of 10,000) simulated seasons ended up with a hitting streak exceeding 53 games. Empirically, according to here, Dimaggio's was the only one to go over 53. Using the authors' time line of 1871 to 2005, that would be 134 seasons. One out of 134 is 0.75% probability. 0.75 versus 50... sounds like something has gone wrong.
The article doesn't give enough details on the simulation so it is hard to understand what is going on. I hope I am not misinterpreting their analysis.
Source: "A Journey to Baseball's Alternate Universe", Samuel Arbesman and Steven Strogatz, Mar 30 2008.
PS. As readers pointed out, each simulation is of all the seasons. So the histogram is saying that the particular sequence of 134 seasons that we lived to see is not a rarity considering all the possibilities. I'm not sure this is telling us much. It doesn't address the question of how likely the 56-game record would be beat in the future. It can't address this question because the particular sequence is now already set; the alternative universes are irrelevant because we can't jump from one universe to another mid-stream.
Also, readers want to have each hitter's probability be modeled rather than using the historical average; in other words, factor in opposing pitcher, home/away, etc.
I'll throw in another... there must have been an assumption of independence between one game to the next. One would think the pressure would be so much higher on the hitter once he gets to 45, 50, 53 etc. games and it would be inappropriate to assume the hitting probability would remain the same.
Along those lines, why should the hitting probability be treated as fixed, rather than modeled as a probability distribution, which would account for variance as one of the readers suggested?
For more discussion, see this Wall Street Journal discussion.