Jan 27, 2006

Review: Curve Ball

A kind reader sent me a Christmas gift, which accompanied me on my vacation.  The book is Curve Ball by Jim Albert and Jay Bennett, and I'm completely fascinated by it.  It presents a statistical perspective on baseball data, a soothing antidote to the nonsense spouted by the typical sportscaster.  Even more impressively, the book is liberally sprinkled with charts, and these charts are generally of a very high standard.

Their first feat was to debunk the myth of the batting average BA (hits divided by at-bats).  AlbertbaThey accomplish this using this innovative chart. 
Each vertical bar is a range of estimate of the batter's BA after he has a given number of at-bats.  The bars get shorter as the number of at-bats increases because over the course of the season, we can be more and more certain of the batter's true hitting ability.

Notice that the bar is very tall in the first 100 at-bats, roughly ranging from 0.35 to 0.50.  This illustrates why statisticians love data quantity: without sufficient samples, any estimation is highly unreliable.

Also notice that the rate of shortening is very slow after say 250 at-bats and after 700 at-bats (roughly a full season), the bar is still about 0.06 tall, roughly between 0.385 and 0.459.  This shows why BA is not as definitive as usually thought.  Looking up 2005 batting statistics, one finds that Derek Lee, the top hitter, hit 0.335.  This means his true batting average is roughly between 0.305 and 0.365.  There were 20 other hitters who hit at least 0.305.

Further, because the 2005 league BA was 0.264, any player with BA between 0.234 and 0.294 may be a league-average hitter.  Looking up the statistics, one finds that this range includes hitters ranked 37 through 150 (which is the end of the list).

More to come...


Reference: Albert and Bennett, Curve Ball, pp. 67-8

Jan 24, 2006

Concordance, or tag clouds

I noticed that Amazon has adopted the tag cloud metaphor in its newest feature known as "concordance".  Clicking on concordance gives you a list of the top 100 most frequently occurring words in the book; mousing over each word provides the exact number of mentions in the book; clicking on the word brings up pages on which the word is mentioned.

They are using the simple and elegant presentation that I praised here, the same as Flickr.  Beautiful as it is, it took me a little while to come up with a use case for this feature.  But I did!

Imagine someone wanting to buy a book on probability for self-study.  It is a cardinal rule of book publishing that every text book must be labelled "introduction" or "elementary", regardless of content.  But Amazon's concordance is here to help.  Here are four books of increasing difficulty (Aczel's Chance, Ross' Probability Models, Resnick's Probability Path and Dudley's Real Analysis and Probability):
Books
Looking at the tag clouds, one can roughly judge the level of sophistication of these books.  Below I present them in mixed order.
Amzns

[1] appears to be an elementary book that emphasizes the key concepts ("probability", "random", "distribution", "independent") while "customers" is the most interesting word indicating it is perhaps an applied book.  [2] is even more novice as we don't find words like "suppose", "system" and "function" that showed up in [1].  Words like "martingale", "sequence", "convergence" give away [3] as reaching another level of sophistication.  I should click on "oc" and "oo" to find out what these mean.  [4] is the only book on probability where "probability" is not in the top 10; it is evidently entirely theoretical with oodles of measure theory.  (So, [1] Ross [2] Aczel [3] Resnick [4] Dudley.)

How else have you used this concordance feature?  Let us know!

Aug 20, 2005

How representative is your sample?

Taking a hint from Mahalanobis, I dug into Howard Wainer's other book  (Visual Revelations) to find the following gem.  Imagine you're an engineer working for the military.  You have the ingenious idea to inspect planes that returned home and plot the pattern of bullet holes.  The dark regions had high density of bullet holes.  Your task is to recommend where to put extra armour on the new planes.  What would you recommend?  (Note: the answer appears after the graphic!)
 

Wainerplanepng

 

 
 
 
Howard credited Abraham Wald for his counter-intuitive insight.  We should put extra armour in the white regions, not the dark regions.  The inference is that the planes that got shot in the dark regions managed to return to the base while others got hit presumably in the white regions and never returned.

What has this to do with sampling?  If we forgot about the planes that never came back, we may jump to the conclusion that we should reinforce the dark regions.  The sample we didn't see is as important as the sample we observed.  To wit:

Wainerplanemissing2_1

Statisticians call this "survivorship bias".  We only oberve survivors but we must not forget about the non-survivors!

A related page I found on the Web: Steve Simon

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31