« September 2007 | Main | November 2007 »

Super Crunchers

Supercrunchers Here's something different, a mini book review of Ian Ayre's "Super Crunchers".  This book can be recommended to anyone interested in what statisticians and data analysts do for a living.  Ian is to be congratulated for making an abstruse subject lively.

His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization
He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.

Regression is a statistical workhorse often used for prediction based on historical data.  Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response.  (In particular, the chapter on randomization covers the topic well.)  Using regression to analyze data collected from randomized experiments allows one to establish cause-effect. 

In the following, I offer a second helping for those who have tasted Ian's first course:

  • Randomized experiments represent an ideal and are not typically possible, especially in social science settings.  (Think about assigning a group of patients at random to be "cigarette smokers".)  When these are not possible, regression uncovers only correlations, and does not say anything about causation.
  • Most large data sets amenable to "super crunching" (e.g. public records, web logs, sales transactions) are not collected from randomized experiments.
  • Regression is only one tool in the toolbox.  It is fair to say that most "data miners" prefer other techniques such as classification trees, cluster analysis, neural networks, support vector machines and association rules.  Regression has the strongest theoretical underpinning but some of the others are catching up.  (Ian did describe neural networks in a latter chapter.  It must be said that many forms of neural networks have been shown to be equivalent to more sophisticated forms of regression.)
  • If used on large data sets with hundreds or thousands of predictors, regression must be used with great care, and regression weights (coefficients) interpreted with even more care.  The size of the data may even overwhelm the computation.  Particularly when the data was collected casually, as in most super crunching applications, the predictors may be highly correlated with each other, causing many problems.
  • One of the biggest challenges of data mining is to design new methods that can process huge amounts of data quickly, deal with much missing or irrelevant data, deal with new types of data such as text strings, uncover and correct for hidden biases, and produce accurate predictions consistently.

Clocks and pies

Keith A submitted this graphical idea from the folks at Ikea (via Boing Boing). 
Based on the comments, it seems like some people really like this presentation!

Consider these for amusement:

  • Does the "9" on Sunday mean 9 am or 9 pm?  (This chart mixes A.M. and P.M. hours in a totally nonchalant way.)
  • If the above is too easy, try the "9" for Saturday!
  • Why was "9" displayed on Sunday anyway?  Meanwhile, why wasn't "7" displayed for Saturday?  (How were the hour labels chosen?)
  • Why was "Closed" written on the chart while "High", "Mid", and "Low" were put into the legend?
  • Since pie charts show proportions, is it possible to describe what proportions were plotted?

Reminds me of this pie chart.

Light entertainment

Christopher P submitted this chart, which is great for our light entertainment series.
Apparently it came from the Netherlands and showed how privileged their citizens are compared to the rest of the world.  It would appear that they need to reverse the color scheme (and font size?) to highlight the privileged.  Comments welcome.

Source: AdsoftheWorld.com

Charts, charts, charts

Jorge Camoes has been a regular reader and sometime commenter for a while.  Little did we know that he has been blogging in Portuguese for the last 10 months.  Recently, he has decided to join the English-speaking world.  His new blog is, simply, Charts.

One post discusses the "population pyramid" chart for comparing advertising spending. 
ChartsspendHe suggested the overlapping bar chart; see his comment here.  By folding one side onto the other, this chart is clearly an improvement over the original, and yet it fails to convey the proportional spend, which is the key point being made in the article.

In another post, Jorge created a "screencast" (tutorial) of how to create a population pyramid in Excel.  A lot of this mirror my own experience using Excel for graphing.  Those of you who have asked for tips in the past should definitely see it.

What you'll find is that creating a nice-looking chart in Excel requires a lot of tedious finger-work.  It is truly incredible how many steps, how much opening and closing of windows, back and forth navigation, etc. users are made to suffer through to make cosmetic changes.

With the advent of AJAX and other interactive technologies, one can only hope that new graphing software will use the "canvass" metaphor.  If we want to reduce the spacing between bars, we should be able to grab the bars and move them together.  If we want to change the ordering, we should be able to mouse over some menu and select a pre-defined ordering scheme, or to drag and move bars around as we please. etc. etc.

(I have heard that Apple's spreadsheet software Numbers has some of these features.  I have yet to use it myself.  If any of you have, let us know what you think.)

Points of comparison

Econ_mortgage In light of the current housing crisis, arising from mortgage defaults, I pulled this graphic from a Jan 2007 opinion piece that plotted historical default rates of mortgages.  Notice the high degree of stretching on the vertical axis that exaggerates the volatility: essentially, the annual delinquency rate ranged from 1.75% to 2.65% during the last six years or so.  One might be forgiven to think that a 2% default rate is quite acceptable.

Nyt_mortgage_2 Compare the above chart to the pair that showed up in the NYT in Oct 2007 (see right).  The default rates here are in the 10-20% range, very alarming indeed.

The two graphics illustrate a key issue of "aggregation" in statistical analysis.  The first graphic is super-aggregated: all types of mortgages of all ages are put together to calculate each year's default rate.  The second graphic hones in on subprime mortgages only.

More importantly, the second graphic presents data in "vintages".  Each line represents loans originated during a particular year (a "vintage").  This establishes comparability.  On the first chart, each point in time represents the default rate of mortgages averaged over all ages (some loans may be only a few months old; others may be 15 years old).  Since the default rate is much higher for very young mortgages than for older mortgages, such averaging hides crucial information.

Overall, the NYT graphic very effectively conveys the alarming trend of new mortgages performing much worse, especially those originated in 2007.

Redo_mortgage It can benefit from two slight edits: adding a few more years, and using vertical lines (the most critical comparisons are default rates for loans of a given age!)  Something like this...

Sources: "As Defaults Rise, Washington Worries", New York Times, Oct 16 2007; "Mounting Mortgage Credit Problems", economy.com, Jan 23 2007.

Sense of proportion

[I'm back from vacation.  Will provide my reaction to the responses to the Gelman challenge, and for those who have sent me email, I will work through them soon.]

The NYT commented on a trend among marketers to shift their advertising spending from so-called "measured" media like print and TV to so-called "unmeasured" media like product placements, contests, etc. 
The following chart accompanied the article:


This construct is akin to a population pyramid; it's great for comparing two groups along one metric, say age groups between males and females.  Here, the two halves aren't comparable groups but two different metrics.  The main metric, that is, the proportion of unmeasured, is not directly depicted: the reader must figure out mentally how much of each bar the black part covers.  Also, the companies are sorted by unmeasured media spending but this leaves the measured spending with a jagged profile, confusing matters.

As for the little white slits on the gray bars, they are admittedly cute but it is difficult to compare the detailed breakdown between print, TV and other media among companies.

The following dot plot gives the two halves equal weight.  Redoads1(Pink dots are measured, blue unmeasured.) It's not a very interesting graphic though. The sense of proportion is still missing.

I settled on a scatter plot which relates the proportion spent on unmeasured to the total amount of spending.  It appears that the largest advertisers had the lowest proportional unmeasured spend while the smallest (among the majors) had the highest.  (It's only a weak correlation: a linear fit yields only 16% R-squared.)

Source: "The New Advertising Outlet: Your Life", New York Times, Oct 14, 2007.