Via Social Science Statistics blog, I found this article in the Times about baseball's longest hitting streaks. The authors ran 10,000 simulations of "baseball seasons using historical data to come up with a probability distribution of the longest hitting streak in each season. They showed the following chart.

The record was 56 consecutive games with hits in a season, which in some circles is seen as unbeatable. These authors -- "in a fit of scientific skepticism -- found that in any season, the simulated longest streak ranged from 39 to 109, with the median at 53 games. They concluded that "the unlikely becomes likely".

That is sure to turn some heads. I have a question for them as I can't make sense of these numbers. A median of 53 meant that 50% (or 5000 out of 10,000) simulated seasons ended up with a hitting streak exceeding 53 games. Empirically, according to here, Dimaggio's was the only one to go over 53. Using the authors' time line of 1871 to 2005, that would be 134 seasons. One out of 134 is 0.75% probability. 0.75 versus 50... sounds like something has gone wrong.

The article doesn't give enough details on the simulation so it is hard to understand what is going on. I hope I am not misinterpreting their analysis.

Source: "A Journey to Baseball's Alternate Universe", Samuel Arbesman and Steven Strogatz, Mar 30 2008.

PS. As readers pointed out, each simulation is of all the seasons. So the histogram is saying that the particular sequence of 134 seasons that we lived to see is not a rarity considering all the possibilities. I'm not sure this is telling us much. It doesn't address the question of how likely the 56-game record would be beat in the future. It can't address this question because the particular sequence is now already set; the alternative universes are irrelevant because we can't jump from one universe to another mid-stream.

Also, readers want to have each hitter's probability be modeled rather than using the historical average; in other words, factor in opposing pitcher, home/away, etc.

I'll throw in another... there must have been an assumption of independence between one game to the next. One would think the pressure would be so much higher on the hitter once he gets to 45, 50, 53 etc. games and it would be inappropriate to assume the hitting probability would remain the same.

Along those lines, why should the hitting probability be treated as fixed, rather than modeled as a probability distribution, which would account for variance as one of the readers suggested?

For more discussion, see this Wall Street Journal discussion.

The simulation is for the maximum hitting streak for the history of baseball. So 50% of simulated histories have one season with a hitting streak longer than 53.

It is actually a bit silly, because the higher streaks are either Dimaggio or the couple of guys from an earlier era. So the question becomes: With a bit of luck could these guys have been better? Answer of course is yes.

Posted by: Ken | Apr 04, 2008 at 01:22 AM

Ken's got it right. This is claiming that if professional baseball were played over its ~1871 to 2007 history, 10000 times, that 53 would be the median record per "alternate history," not per season.

This also solves your last question. Its not .75% compared to 50%, because the 50% applies to all seasons per "alternate history", not to any individual season.

So the record is still impressive, but yes, in some sense it could have been more impressive in an alternate universe :)

Posted by: Zubin | Apr 04, 2008 at 01:36 AM

But "unbeatable", when claimed in 2008, might be a prediction that 53 is unlikely to be exceeded in the space of all future universes 2008-infinity, not all alternative universes 1871-2007.

See Stephen Jay Gould's

Full Housefor why outstanding excellence in baseball gets harder to achieve as time goes on and the average player improves.Posted by: derek | Apr 04, 2008 at 02:59 AM

I think the above comments are correct; however, there's also a methodological problem with the study, in that it assumes that the batter's batting average is the result of a string of AB against equal pitchers. In fact, this is not the case (some pitchers are better than others), and is also the assumption that maximizes the expected length of a hitting streak.

For simplicity, imagine two scenarios:

(A) A batter gets 4 ABs per game, each game having a different pitcher, with each pitcher he has a 1/3 chance of getting a hit.

(B) A batter gets 4 ABs per game, each game having a different pitcher, with 1/3 of pitchers he has a 100% of getting a hit, and 2/3 of pitchers a 0% chance.

In both cases, the batter will hit .333. However, lengthy hitting streaks are only possible in scenario A.

-- Eric

Posted by: Eric | Apr 04, 2008 at 08:29 AM

Here here, Derek: this is a classic case of rigor (the 10k iterations) masking the overall immaturity of the variable definitions in the simulation. What of the variance in pitcher performance due to pitcher skill level, coaching tactic (not pitching to the streaking hitter), or the overwhelming pressure of intensified public scrutiny (as well evidenced from the first-person narratives of Dimagio and would-be inheritors of the record)? Assuming a flatline of probability from ABs 1->56 is the inherent flaw in the Times' analysis - be it 10k or 10 million k simulations.

Posted by: le ped | Apr 04, 2008 at 11:18 PM

This is a fascinating post. I'd never thought that one would need so much variables to make this simulation "at least plausible".

I'm very much interested in statistics, even though my knowledge is very limited. Any idea on where I could find some resources to finally understand what you smart guys are talking about ? :-P

-- Tim

Posted by: Tim | Apr 06, 2008 at 06:17 AM

I highly recommend "Curve Ball" the book by Albert and Bennett for a statistician's perspective. It's featured in the Core Collection; click on the link above and then click on popular statistics.

Posted by: Kaiser | Apr 06, 2008 at 01:25 PM

Ordered on amazon. Thanks for the tip Kaiser :-)

-- Tim

Posted by: Tim | Apr 06, 2008 at 05:05 PM

Monte Carlo is always going to closely match historical data. Garbage out, garbage in. But its virtue is that it captures features that a model may overlook.

If you wanted to predict the likelihood of a streak in a given future season, what you would want to do is take all the data (probably weighting the recent data more heavily due to changes in the game over time), find a probablility distribution that is a good fit (probably something like the Poisson distribution), and estimate the parameters of the probability distribution from the data.

Posted by: ohwilleke | Apr 07, 2008 at 11:47 AM