May 06, 2008

Turning in his grave 1

(Thanks to reader Josh R. for the tip.)  The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist?  Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.)  Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries.

Suffice it to say, even if it can be proven that most states with high densities of Starbucks are more likely to have more Democratic primary voters who prefer Obama to Clinton, it is a far cry from proving Obama an elitist.  However, we take the leap of faith and look at the evidence presented to us.

Blog_obamaelite The star witness was this chart plotting the "vote spread" of Obama minus Clinton and the per-capita Starbucks density.  The black line was a linear fit to the Starbucks data as shown in green dots.  Since the black and blue lines both pointed northeast roughly speaking, we were told: "States with more latte-purveying Starbucks stores are more likely to have gone for Obama."  (So Obama is indeed an elitist.)

To cover all bases, the creator of this chart suggested that "my statistics professor might be rolling over in his grave to hear me say it, but there's a mild but real correlation here!".

Mr. Urbanspoon, the statistics professor is here and he disapproves.  As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster.  Not reaching immediately for the scatter plot when one has two data series is another serious misstep.  (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".)  So here is the appropriate graphic:

A quick first glance at the left chart indicates that any correlation, if it exists, is very weak indeed.  A simple linear regression analysis shows that Starbucks density explains only 14% of the variability in vote spread.  Note especially the wide dispersion of dots around the line.  Further, for the vast majority of the states (say those with vote spread between -20% and 40%), there appears to be no correlation.  This is seen on the right chart.

Redo_obamaelitist

To the extent that there is a linear correlation, the points (orange dots) would be most influential.  The top cluster included Alaska, Kansas, DC, Hawaii and Idaho in which Obama had a large winning margin while the Starbucks density was above average.  The bottom cluster included Arkansas and Olkahoma where Obama was wiped out and where Starbucks had the lowest density.  These two clusters alone explained the mild relationship; removing them wiped it out.

Redo_obamaelitist2Following Nyhan, we should remove some obvious outliers, such as Arkansas, Illionois and New York (home states), Michigan and Florida (disputed) and New Hampshire and Iowa (Edwards territory).  The result is also mild correlation (R-sq = 0.075).


Till next post, when the professor rolls over again ...


 

Notice that I prefer the number of people per Starbucks metric, as opposed to the number of Starbucks per thousand people (See prior discussion on Gelman's blog.)  The reason is that every number on the former metric is reality-based while the latter metric produces imaginary numbers for small states, i.e. the imputed number of Starbucks is smaller than what actually exists!

Also note that I used a renormalized vote spread so that the Obama proportion and the Clinton proportion added up to 100%.  This made the assumption that Edwards and other voters would split among Obama and Clinton in the same proportions as those who explicitly voted for the two frontrunners.

May 05, 2008

Turning the table

Nyt_runningbacks We recently showed an example of when data tables worked well to clarify the data.  Last week, there was an example from the Times which did the opposite.

The accompanying article boldly claimed that

the 40-yard dash stands above them all as having the strongest correlation to success in the NFL.  The three-cone drill, the shuttle run, the bench press -- none correlate to NFL success.  The 40 is king.

Further, it cited Bill Barnwell from FootballOutsiders.com who created an "index" using both 40 time and body weight that is "an even better predictor than 40 time alone".  In other words, this formula Nyt_runningback_eqt

does the trick.

The data table, shown above, presumably clinched the case.

Redo_runningback1 We were mystified when we put the data to the test, however.  Among the set of 15 running backs, the Index did not predict the Yards Per Carry at all!  The Index explained only 8% of the variation in Yards Per Carry between the backs.

The data table obscures this bivariate relationship.  As it was sorted by the Index, we would look for the column showing Yards Per Carry to be naturally sorted in the same order.  But it is hard to tell the trend from the noise in a table.

What went wrong?  It turned out neither 40 Time nor Body Weight had any relationship with Yards Per Carry.

Redo_runningback2

These variables did not explain the range of Yards Per Carry attained by this set of running backs.

Redo_runningback3Finally, we found strong correlation between 40 Time and Body Weight.  (The heavier you are, the slower you run!) This meant that both variables contained similar information and some unlikely formula involving the two would be unlikely to perform significantly better than each variable alone.

So we are left to turn the table on the table.  More pertinent evidence is needed to prove the case.

The entire analysis suffers from survivorship bias as only the top running backs are examined, and no adjustment is made to deal with wide-ranging tenures.  Apparently, there is more data available in a book.  There is no indication of how the model shown above was validated.

Reference: "The Race of Truth: 40-Yard Times Can Tell the Future", New York Times, April 27, 2008.

 

Apr 29, 2008

Flows and partitions

Andrew M., a new but loyal reader, didn't like the flow charts used by the EPA to illustrate cleantech.  We had some lively discussion on flow charts before.  The bottom line seems to be that they are difficult beasts to tame, especially when the relationships are complex.  The example shown by Andrew (below) is not particularly horrid in this scheme of things.  It's the abundance of annotations and colors that cause dizziness.

Combinedheat

Here's a view of the same data, using a partitioning approach.  The inputs are fixed at 100 units, which I find easier to comprehend, while the original fixed output at 30 units of electricity and 45 units of heat.  And of course, it is a tremendous service to readers not to have to work out the efficiencies.  Tacitness is a vice, not a virtue, in graph-making.

Redo_combinedheat


Reference: "Catalog of CHP Technologies", US EPA Combined Heat and Power Partnership.

Apr 27, 2008

Running in the rain

Reader Eduardo is unhappy about the embellishments in this Nikeplus chart of miles ran by day; "pretty but misleading" he wrote us to say.  This is a clear case of more is less.

Nikeplus


As a data graphic, it doesn't work.  The reflections don't work.  Perhaps Nike wants to remind all you super-dedicated Nano-wearing runners what it's like to run in mist or rain!  To quote Eduardo: "The bars start at -1! I guess it is motivation."  An extra mile for everyone.  The rounded corners make it harder to read the level.

Startat8Speaking of bar charts, I want to follow up on an exchange from March.  In that example, we claimed that not starting bars at zero misrepresented the relative lengths of those bars.  The chart showed counts of baseball players implicated in the Mitchell Report by position.

This distortion arises from taking the same length off each bar regardless of the data.  As a result, the ratios of the lengths between the bars have been changed drastically.

For example, the ratio of P/3B in the top chart is 31/9 = 3.4 but in the bottom chart, it is 23/1 = 23!




Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31