Mar 28, 2008

Two books

Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data.   There are still a few days left to participate.  While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.

I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs.  She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on.  Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover.  The page design - with half of every page blank - is refreshingly easy on the eyes.  Inclusion of examples is generous. 

Lets review her point of view of some of the topics we discuss frequently on Junk Charts:

Starting axis at zero: she thinks "all bar charts must include zero.  However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)

Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.

Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot.  Again, I infer she's in favor.

Nov 25, 2007

A dangerous equation

Graduation rates at 47 new small public high schools that have opened since 2002 are substantially higher than the citywide average, an indication that the Bloomberg administration’s decision to break up many large failing high schools has achieved some early success.

Most of the schools have made considerable advances over the low-performing large high schools they replaced. Eight schools out of the 47 small schools graduated more than 90 percent of their students.

Nyt_smallsch This graphic included in the NYT article  lent support to the "small schools movement".  In particular, note the last sentence of the above quotation: it incorporates the oft-used device of subgroup support of a hypothesis, in this case, the subgroup of eight top-performing schools.

Such analysis is "dangerous", according to Howard Wainer, who discusses this and other examples of misapplication in a recent article in American Scientist, entitled "The Most Dangerous Equation".  He alleged that billions have been wasted in the pursuit of small schools.

The issue concerns sample size.  Dr. Wainer and associates analyzed math scores from Pennsylvania public schools.  Wainer_mathscoresAverage scores for smaller schools are based on smaller number of students, and therefore less stable (more variable).  More variability means more extremes.  Thus, by chance alone, we expect to find more smaller schools among the top performers.  Similarly, by chance alone, we also expect to find more smaller schools among the worst performers. 

The scatter plot lays out their argument. Focusing only on the top performers (blue dots), one might conclude that smaller schools do better.  However, when the bottom performers (green) are also considered, the story no longer holds.  Indeed, the regression line is essentially flat, indicating that scores are not correlated with school size.

This is all nicely explained via the standard error formula (De Moivre's equation) in Dr. Wainer's article.  Here is a NYT article from the mid 1990s describing this same phenomenon.

File this as another comparability problem.  Because estimates based on smaller samples are less reliable, one must take extra care when comparing small samples to large samples.

Dr. Wainer is publishing a new book next year, called "The Second Watch: navigating the uncertain world".  I'm eagerly looking forward to it.  His previous books, such as Graphic Discovery and Visual Revelations, both part of the Junk Charts collection.

Sources: "The Most Dangerous Equation", American Scientist, November 2007; "Small Schools Are Ahead in Graduation", New York Times, June 30 2007.


P.S. Referring back to the NYT chart above, one might wonder at the impossible feat of raising graduation rates across the board simply by breaking up large schools into smaller ones.  This topic was taken up here, here and here.  When evaluating the "small schools" policy, it is a mistake to discuss only the performance of small schools; any responsible analysis must look at improvement over all schools.  Otherwise, it's a simple matter of letting small schools skim off the cream from larger schools.

 

Oct 30, 2007

Super Crunchers

Supercrunchers Here's something different, a mini book review of Ian Ayre's "Super Crunchers".  This book can be recommended to anyone interested in what statisticians and data analysts do for a living.  Ian is to be congratulated for making an abstruse subject lively.

His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization
He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.

Regression is a statistical workhorse often used for prediction based on historical data.  Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response.  (In particular, the chapter on randomization covers the topic well.)  Using regression to analyze data collected from randomized experiments allows one to establish cause-effect. 

In the following, I offer a second helping for those who have tasted Ian's first course:

  • Randomized experiments represent an ideal and are not typically possible, especially in social science settings.  (Think about assigning a group of patients at random to be "cigarette smokers".)  When these are not possible, regression uncovers only correlations, and does not say anything about causation.
  • Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
  • Regression is only one tool in the toolbox.  It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules.  Regression has the strongest theoretical underpinning but some of the others are catching up.  (Ian did describe neural networks in a latter chapter.  It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
  • If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care.  The size of the data may even overwhelm the computation.  Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
  • One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.

Mar 12, 2007

Lines of death

I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization). 

Who_tobacco I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.

I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.

Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35".  This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).

Who_tobacco2 However, the description of the "lines of death" raised my eyebrows.  It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions. 

Did they mean 25% of the dead middle-aged people die from smoking?  Or 25% of all middle-aged folks die from smoking?  A gigantic difference!

Percentages are very tricky things to use.  Every time I see a percentage, the first thing I ask is what is the base population.  Here, the baseline appeared to have gotten lost in translation.

This set of maps also shows the peril of focusing too much on  entertainment value, and losing the plot. 

For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust".  It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.

Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization

Dec 01, 2006

Smoking-Screening

Smokeathome2

Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)
     

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.


Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.

Aug 01, 2006

Statistical literacy

I finally got around to reading "When Genius Failed", Roger Lowenstein's account of the spectacular collapse of LTCM, the hedge fund fronted by Scholes and Merton, Nobel laureates both.

It is a sobering read for anyone in the business of statistical prediction and modeling for sure.

What also caught my eye, and caused dismay, is how Lowenstein got basic statistical principles wrong in the book.   He used the bully pulpit to sound the usual alarm against the normality assumption and for fat tails.  He began by confusing LLN and CLT (central limit theorem):

Statisticians have long been aware of the "law of large numbers".  Roughly speaking, if you have enough samples of a random event, they will tend to distribute in the familiar bell curve ...

In the same breadth, he then equated two different probability distributions:

This is called the normal distribution, or in mathematical terms, the lognormal distribution.

Doesn't this say something about the state of statistical literacy?


PS. Here is a link to Dunbar's "Inventing Money" (thanks Marc).  It apparently came out before Lowenstein but didn't get as much press. 

Feb 09, 2006

Review: Curve Ball 3

Just want to highlight one more graphic from Curve Ball, one which I consider the most innovative, highly effective and powerful.  Without much ado:
AlberttorThis is one of those charts that paints a vivid story.  Any fan can mentally re-trace the baseball game by reading this chart, without having seen the game itself.  The horizontal axis traces the 9 innings of a baseball game while the vertical axis plots the probability of Toronto winning the game.  This probability is updated over the course of the game as we read from left to right.  (For those asking, this plots Game 6 of the 1993 World Series.)

To quote the authors:

" We see that Toronto's probability of winning rose from the start as they prevented the Phillies from scoring in the first inning.  This trend continued as the [Toronto] Blue Jays scored three times in the first inning... The low point in the [fifth] inning for Toronto occurred just after [Phillie] John Kruk walked to load the bases.  [Phillie Dave] Hollin's big out is shown by the rise ... in Toronto's victory probability from this low point ... in the seventh the Phillies turned the tables ... scoring five runs to take the lead.  The plot of Toronto's probability of winning looks like the Dow Jones Industrial Average in free-fall.  Toronto did not score in its half of the seventh, pushing its probability of winning even further down. ... the plot rises (and the plot thickens) in the eighth inning as a result of a threat with bases loaded and two outs.  In the ninth inning, the Phillies went down quickly.  Toronto came out storming, ... quickly putting runners on base.  The triumphant ... impact of [Toronto's Joe] Carter's home run is evident in the steep rise in the final markings of the plot."

This chart belongs to the same class as the Bumps chart.  In a previous post, I traced how one can re-imagine the Bumps race just by tracing the plot from left to right.

Reference: "Curve Ball", Albert & Bennett.

Feb 03, 2006

Narratives that create questions

FreygraphicbigNarrative charts, like this one shown on the right, are particularly difficult to master.  The temptation is strong on the part of the designer to mislead by inclusion/exclusion and on the part of the reader to misjudge by reading between the lines.

The accompanying article made the point that James Frey's fraudulent memoir experienced a severe drop in sales since Oprah "sacrificed" him.  Unfortunately, what the chart shows is how choppy sales have been for this book.  It experienced a similar drop after his first appearance on Oprah.

In fact, this chart raised a host of unanswered questions:

  • How to explain the end-of-year rise in sales back to post-first-Oprah-appearance level?
  • What is the purpose of the red line?  Is it fair to compare a hardcover with a paperback?
  • How to explain the "turbulence" of Frey's memoir sales even before the scandal broke?  Why is it that his other book has a much smoother sales trend?  (Is it merely a scale effect hiding its ups and downs?)
  • Oprah's influence appeared to be highly specific to the recommended book and time of the show.  There seemed to be zero impact on the sales of other books by the same author (scale effect?) and a rapidly diminishing impact even for the highlighted book.  Is this a general phenomenon?

This type of time-series chart does not provide direct evidence of cause and effect; however, they are commonly used in the media for that dubious purpose.  At the best, we can conclude that those factors are correlated, prompting hypotheses and further analyses.

Reference: "James Frey's Falsehoods Improved His Tale", New York Times, Feb 1 2006.

Feb 01, 2006

Review: Curve Ball 2

Continuing the book review.  The reader who sent me the book noted that the authors used a similar technique to the one I used to study whether suicide spots on the Golden Gate Bridge were random (see here, here, here, here and here).  They used it to study whether Todd Zeile was really a "streaky" hitter or not.

In the following set of charts, Zeile's batting average in the first half of a season was compared against eight fictional hitters.  These fictional hitters were simulated to have two hitting "states" (hot, cold).  At each fictitious game, the hitters were assigned one of the two states with some probability.  The authors asked whether Zeile's batting pattern was similar to those of streaky hitters.  (Conventional sportscaster wisdom says he is streaky.)

Albertziele

Here, we want to know whether the graphs are similar; in my graphs of suicide locations, I asked whether the actual data is different from random.  In general, I find it easier to see differences than similarities.  In both cases, it is not sufficient to visually inspect these charts.  We must use some tests (possibly statistical tests) to help confirm our intuition.  The authors picked these metrics: Max - Min, Number of long streaks (8 or longer), Number of runs, Number of 0-hit games and Number of 3+ hit games.  (A "run" is a string of consecutive 0-hit games or consecutive games with at least 1 hit.)  These are all measures of dispersion or extreme values.

My first review can be found here.


Jan 27, 2006

Review: Curve Ball

A kind reader sent me a Christmas gift, which accompanied me on my vacation.  The book is Curve Ball by Jim Albert and Jay Bennett, and I'm completely fascinated by it.  It presents a statistical perspective on baseball data, a soothing antidote to the nonsense spouted by the typical sportscaster.  Even more impressively, the book is liberally sprinkled with charts, and these charts are generally of a very high standard.

Their first feat was to debunk the myth of the batting average BA (hits divided by at-bats).  AlbertbaThey accomplish this using this innovative chart. 
Each vertical bar is a range of estimate of the batter's BA after he has a given number of at-bats.  The bars get shorter as the number of at-bats increases because over the course of the season, we can be more and more certain of the batter's true hitting ability.

Notice that the bar is very tall in the first 100 at-bats, roughly ranging from 0.35 to 0.50.  This illustrates why statisticians love data quantity: without sufficient samples, any estimation is highly unreliable.

Also notice that the rate of shortening is very slow after say 250 at-bats and after 700 at-bats (roughly a full season), the bar is still about 0.06 tall, roughly between 0.385 and 0.459.  This shows why BA is not as definitive as usually thought.  Looking up 2005 batting statistics, one finds that Derek Lee, the top hitter, hit 0.335.  This means his true batting average is roughly between 0.305 and 0.365.  There were 20 other hitters who hit at least 0.305.

Further, because the 2005 league BA was 0.264, any player with BA between 0.234 and 0.294 may be a league-average hitter.  Looking up the statistics, one finds that this range includes hitters ranked 37 through 150 (which is the end of the list).

More to come...


Reference: Albert and Bennett, Curve Ball, pp. 67-8

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31