A kind reader sent me a Christmas gift, which accompanied me on my vacation. The book is Curve Ball by Jim Albert and Jay Bennett, and I'm completely fascinated by it. It presents a statistical perspective on baseball data, a soothing antidote to the nonsense spouted by the typical sportscaster. Even more impressively, the book is liberally sprinkled with charts, and these charts are generally of a very high standard.

Their first feat was to debunk the myth of the batting average BA (hits divided by at-bats). They accomplish this using this innovative chart.

Each vertical bar is a range of estimate of the batter's BA after he has a given number of at-bats. The bars get shorter as the number of at-bats increases because over the course of the season, we can be more and more certain of the batter's true hitting ability.

Notice that the bar is very tall in the first 100 at-bats, roughly ranging from 0.35 to 0.50. This illustrates why statisticians love data *quantity: *without sufficient samples, any estimation is highly unreliable.

Also notice that the rate of shortening is very slow after say 250 at-bats and after 700 at-bats (roughly a full season), the bar is still about 0.06 tall, roughly between 0.385 and 0.459. This shows why BA is not as definitive as usually thought. Looking up 2005 batting statistics, one finds that Derek Lee, the top hitter, hit 0.335. This means his true batting average is roughly between 0.305 and 0.365. There were 20 other hitters who hit at least 0.305.

Further, because the 2005 league BA was 0.264, any player with BA between 0.234 and 0.294 may be a league-average hitter. Looking up the statistics, one finds that this range includes hitters ranked 37 through 150 (which is the end of the list).

More to come...

Reference: Albert and Bennett, Curve Ball, pp. 67-8