Oct 30, 2007

Super Crunchers

Supercrunchers Here's something different, a mini book review of Ian Ayre's "Super Crunchers".  This book can be recommended to anyone interested in what statisticians and data analysts do for a living.  Ian is to be congratulated for making an abstruse subject lively.

His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization
He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.

Regression is a statistical workhorse often used for prediction based on historical data.  Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response.  (In particular, the chapter on randomization covers the topic well.)  Using regression to analyze data collected from randomized experiments allows one to establish cause-effect. 

In the following, I offer a second helping for those who have tasted Ian's first course:

  • Randomized experiments represent an ideal and are not typically possible, especially in social science settings.  (Think about assigning a group of patients at random to be "cigarette smokers".)  When these are not possible, regression uncovers only correlations, and does not say anything about causation.
  • Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
  • Regression is only one tool in the toolbox.  It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules.  Regression has the strongest theoretical underpinning but some of the others are catching up.  (Ian did describe neural networks in a latter chapter.  It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
  • If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care.  The size of the data may even overwhelm the computation.  Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
  • One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.

Sep 04, 2007

Read fast, pay the price

At first, this looks like a decent chart despite the donut construct, which I cannot stand (but the Economist loves).

Rockstars

The accompanying text proclaimed: "Rock stars are famous for excess, and some pay the price".  The rest of the paragraph points out drug- and alcohol-related deaths, plus deaths due to "unhealthy lifestyles", which apparently include cancer and cardiovascular disease.

There is a gaping hole between what's on the chart and what's in the text.  They just talk past each other.

  • The chart invites us to compare the European experience to the American experience. Each donut presents the proportion of total deaths by causes of death. The top donut presents American rock-star deaths, the bottom European ones. But this comparison has zilch to do with the key point, which is how rock stars are different from the rest of us.  The chart tells us nothing about the rest of us.  The 20% death by cancer would be entirely unremarkable if 20% of non-rock-star deaths also were attributed to cancer!
  • We must also bear in mind that the base populations are rock stars who died young. This is a very specific demographic segment, and so the only valid point of reference are people who died young.  If we think along those lines, then among unmusical people, if they died young, what might have been the causes of death?  Drugs? Alcohol?  Accidents?  Suicide?  You bet.  I am not sure who is the authoritative source of such data but the CDC reported that among Americans aged 15-34 who died, the leading causes were "unintentional injury", suicides, homicides, cancer and heart disease.  Not much different from the above list...
  • The deaths depicted in the two donuts totaled fewer than 100, and yet percentages are given to one decimal place.  This creates a false sense of precision not justified by the sample size.
  • The deaths occurred over about 50 years.  It is very likely that the causes of premature death have shifted during this time span, making an aggregate analysis questionable.

Charting is much more than just aesthetics.  Some basic statistical common sense goes a long way.  This was observed long ago by Huff.

Source: "Rock stars: live fast, die young", Economist, Sept 4 2007.

Aug 12, 2007

Non-elites

From Mikhail Simkin comes some intriguing analysis of "experts"; in this line of research, experts are compared to the "general public" and often "proved" to be shenanigans. Stock pickers don't do better than apes; economists don't do better than Big Macs; you get the idea.  In a new twist, Simkin puts twelve images of modern art on his website, and asks visitors to distinguish between those by grand masters and those "ridiculous fakes" produced by him apparently on a computer.

Since conventional wisdom says elite universities provide better education, Simkin attempted to find out if there is a difference between "elites" and "the crowd" in their ability to recognize modern art. (Elites, to him, meant the Ivy League and Oxbridge.)  The following pair of histograms clinched his point:

we see that there is not much difference between the elite and the crowd.

Simkin_fakeart


Since the shapes of the histograms are similar, one might be inclined to agree with the statement.  This is until one notes the wildly different scales used because only 143 of the 56,020 quiz-takers could be identified as "elites".

The shapes are clarified if we use a relative scale (percentages) rather than absolute scale.  Further, the difference is more easily seen when cumulative percentages are plotted.  In other words, we are interested in comparing the proportion of respondents who score at least X points out of 12.

Redo_fakeart

Two features are worth noting:

  • A gap opens up between 4 to 7: specifically, 40% of "non-elites" scored 7 points or below while only 25% of "elites" scored 7 points or below.
  • The curves criss-cross around 11 to 12: this shows that "non-elites" were more likely to have perfect scores (although this difference is small).  Perhaps museum directors don't have .edu addresses.

Notice that I plotted Elite vs Non-Elite rather than Elite vs All Respondents.  While it seems innocuous to use "All Respondents", and in this case, there is no noticeable difference since Elites were a tiny proportion, when the test group accounts for a significant proportion of the total, the value for "All Respondents" will be influenced by that for the test group.  As a general rule, compare A to not A.

Simkin's exercise raises many statistical issues of design, which we won't discuss here.

Source: "Properly Prescribed" (via, RSS Significance)

Jul 18, 2007

Mid-week entertainment: dogma

Wsj_laff1This chart from a Wall Street Journal editorial has been making the rounds lately, being ridiculed left and right.  A number of you have been leaving comments here so I'm putting it up and center as our light entertainment for the week.

The chart is being used to justify this economic concept called the "Laffer Curve" which claims that lowering tax rates can increase total tax receipts (for example, because fewer people will cheat the government.)  As far as I know, it is dogma, and has never been proven empirically.

I also agree with Prof. Gelman's skepticism about using countries as experimental units to inform domestic policy.

Fire away!



Further reading:

Junk Chart readers

Economist's View
Tufte blog
Gelman blog


And more:

Cosmic Variance
Brad DeLong

Jul 12, 2007

More prevalent versus more likely

Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line.  This is a pretty chart that does an admirable job with a difficult data set.

Bw_onlinedataThe key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense.  So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line.  In addition, the total of each column can be much more than 100% because multiple responses were allowed.

Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people.  A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers".  But this is wrong because the chart hides the age distribution.  While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives".  This is the difference between prevalence and incidence rate.  (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)

The construct of the square grids is less damaging than it seems.  In effect, the data has been rescaled by dividing by 10.  The reader is then forced to apply "rounding".  If you are someone who sees $19.95 as $19, then you'd round down the partial rows.  If you see $19.95 as $20, you'd round up the partial rows.  So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.

Here's another example where the profile chart shines.  Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work.  (Prior discussion of this issue here.)

Redo_onlinedata

This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities.  The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives".  We also see that the likelihood of being "Collectors" has little to do with age.

Source: "Inside Innovation -- In Data", Business Week, June 11 2007.


Jun 26, 2007

Baby names and success

Wsj_babynamesWhile we speak of baby names, David F. nominates this set of 6 charts from WSJ.  Compare this with Wattenberg's names voyager, and the benefit of interactive graphics is immediately evident.

In David's words:

They show graphs of six different names, but the two on the bottom use a dramatically different scale (from 1st to ~20th, instead of from 1st to 1000th). The introductory text notes the difference, but it is still a shock.

We like the use of "small multiples" but their impact is compromised if we don't keep the background material constant so that readers can compare between charts.  By having  different scales, the message was distorted: Mary has had a much larger drop than David, and it's easily missed in these charts.

Lines should take the place of areas which carry scant meaning in this context.

The use of blue and red is a nice touch but dovetailing the male and female charts strikes us as excessive fun.  It would have been clearer to give the sons and the daughters their own columns.

The article itself relates the anguish of modern parents in naming their babies.  Much of this angst can be traced to serious econometric studies that claim to have found cause-and-effect relationships between someone's name and their eventual success in life.  Some of this research was highlighted in Freakonomics, for example.  My stance is that all such studies are dubious, there being innumerable confounding factors (socio-economic, genetic, cultural, luck, etc. etc.).  In addition, the measured response can range from "happiness" to income to many other metrics.  The danger of finding something because one looks hard enough is very real.  We don't currently have tools powerful enough to substantiate this sort of studies.

Source: "The Baby-Name Business", Wall Street Journal, June 22, 2007

Jun 04, 2007

Airline bumps and bump charts

The Harvard Social Science Statistics blog pointed to an NYT article about revenue optimization in the airline industry.  Huge props to the Times for explaining the science (and art and politics) of one of the most successful applications of operations research.

In short, valuable business travellers want refundable tickets.  Because of this and other reasons, about 10% of booked tickets become no shows.  Airlines recoup the loss by over-booking.  Implicitly, they trade off the potential for dissatifying a few unlucky passengers (who would be bumped from their flights) and the potential for flying with 10% empty seats (in addition to unsold seats).  Optimization algorithms (constantly tuned by entry-level staff) try to strike a balance.

Recently, because the average percentage of seats sold has been going up, the room for such maneuvreing has been squeezed, leading to higher bump rates, and more travellers being stranded.  There is some variation across airlines due to the level of sophistication of their revenue optimization algorithms, corporate strategy, etc.

The following charts present data by airline of the bump rates in 2005 and 2006.  One would be interested in answering questions such as:

  • Which airlines have the best (or worst) bump rate?
  • Are some airlines consistently better (or worse) at controlling the bump rate?
  • Which airlines have improved (or worsened) from year to year?
  • Are the differences of practical significance?

Redo_airlinebumps

The original chart shown on the left does not reveal the answers readily.  My favourite bumps chart offers them up clearly (well, except on the question of significance).

The biggest problem, though, is the header: number of passengers per 10,000 bumped.  The data plotted appeared to be the reverse: the number of bumps per 10,000 passengers.  Otherwise, there would have been more bumped passengers than passengers!

Source: "Bumped Fliers and No Plan B", New York Times, May 30, 2007.

May 31, 2007

If we report it, it's a fact

David Leonhardt wrote in the NYT of a shocking incident of statistical abuse committed by Lou Dobbs and the CNN crew.

On several recent occasions, while commenting on the red-hot immigration issue, Lou and company remarked that "there had been 7,000 cases of leprosy in this country over the previous three years, far more than in the past".  (Leprosy is a flesh-eating disease prevalent among immigrants, particularly of Asian or Latin American origin.)

Nyt_leprosyWhen asked about fact-checking, Lou reportedly said: "If we reported it, it's a fact."  A quick visit to the government's leprosy program web-site immediately reveals the time-series chart, shown on the left.  With annual rates at about 150 in the last 5 years or so, one is hard impressed to find the 7,000 alleged cases!

Furthermore, because this chart lacks comparability, we fail to see that 150 cases out of a population of 300 million represent a minuscule risk.

A slight downward trend is evident in the last 20 years or so; this record is even more impressive when we realize the population grew during this period.  These points can be made clearer in multivariate plots.

Source: "Truth, Fiction and Lou Dobbs", New York Times, May 30, 2007; U.S. National Hansen's Disease web-site.

 

Mar 12, 2007

Lines of death

I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization). 

Who_tobacco I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.

I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.

Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35".  This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).

Who_tobacco2 However, the description of the "lines of death" raised my eyebrows.  It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions. 

Did they mean 25% of the dead middle-aged people die from smoking?  Or 25% of all middle-aged folks die from smoking?  A gigantic difference!

Percentages are very tricky things to use.  Every time I see a percentage, the first thing I ask is what is the base population.  Here, the baseline appeared to have gotten lost in translation.

This set of maps also shows the peril of focusing too much on  entertainment value, and losing the plot. 

For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust".  It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.

Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization

Feb 27, 2007

Mean and median

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

Redoonlineweather2 It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31