May 06, 2008

Turning in his grave 1

(Thanks to reader Josh R. for the tip.)  The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist?  Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.)  Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries.

Suffice it to say, even if it can be proven that most states with high densities of Starbucks are more likely to have more Democratic primary voters who prefer Obama to Clinton, it is a far cry from proving Obama an elitist.  However, we take the leap of faith and look at the evidence presented to us.

Blog_obamaelite The star witness was this chart plotting the "vote spread" of Obama minus Clinton and the per-capita Starbucks density.  The black line was a linear fit to the Starbucks data as shown in green dots.  Since the black and blue lines both pointed northeast roughly speaking, we were told: "States with more latte-purveying Starbucks stores are more likely to have gone for Obama."  (So Obama is indeed an elitist.)

To cover all bases, the creator of this chart suggested that "my statistics professor might be rolling over in his grave to hear me say it, but there's a mild but real correlation here!".

Mr. Urbanspoon, the statistics professor is here and he disapproves.  As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster.  Not reaching immediately for the scatter plot when one has two data series is another serious misstep.  (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".)  So here is the appropriate graphic:

A quick first glance at the left chart indicates that any correlation, if it exists, is very weak indeed.  A simple linear regression analysis shows that Starbucks density explains only 14% of the variability in vote spread.  Note especially the wide dispersion of dots around the line.  Further, for the vast majority of the states (say those with vote spread between -20% and 40%), there appears to be no correlation.  This is seen on the right chart.

Redo_obamaelitist

To the extent that there is a linear correlation, the points (orange dots) would be most influential.  The top cluster included Alaska, Kansas, DC, Hawaii and Idaho in which Obama had a large winning margin while the Starbucks density was above average.  The bottom cluster included Arkansas and Olkahoma where Obama was wiped out and where Starbucks had the lowest density.  These two clusters alone explained the mild relationship; removing them wiped it out.

Redo_obamaelitist2Following Nyhan, we should remove some obvious outliers, such as Arkansas, Illionois and New York (home states), Michigan and Florida (disputed) and New Hampshire and Iowa (Edwards territory).  The result is also mild correlation (R-sq = 0.075).


Till next post, when the professor rolls over again ...


 

Notice that I prefer the number of people per Starbucks metric, as opposed to the number of Starbucks per thousand people (See prior discussion on Gelman's blog.)  The reason is that every number on the former metric is reality-based while the latter metric produces imaginary numbers for small states, i.e. the imputed number of Starbucks is smaller than what actually exists!

Also note that I used a renormalized vote spread so that the Obama proportion and the Clinton proportion added up to 100%.  This made the assumption that Edwards and other voters would split among Obama and Clinton in the same proportions as those who explicitly voted for the two frontrunners.

May 05, 2008

Turning the table

Nyt_runningbacks We recently showed an example of when data tables worked well to clarify the data.  Last week, there was an example from the Times which did the opposite.

The accompanying article boldly claimed that

the 40-yard dash stands above them all as having the strongest correlation to success in the NFL.  The three-cone drill, the shuttle run, the bench press -- none correlate to NFL success.  The 40 is king.

Further, it cited Bill Barnwell from FootballOutsiders.com who created an "index" using both 40 time and body weight that is "an even better predictor than 40 time alone".  In other words, this formula Nyt_runningback_eqt

does the trick.

The data table, shown above, presumably clinched the case.

Redo_runningback1 We were mystified when we put the data to the test, however.  Among the set of 15 running backs, the Index did not predict the Yards Per Carry at all!  The Index explained only 8% of the variation in Yards Per Carry between the backs.

The data table obscures this bivariate relationship.  As it was sorted by the Index, we would look for the column showing Yards Per Carry to be naturally sorted in the same order.  But it is hard to tell the trend from the noise in a table.

What went wrong?  It turned out neither 40 Time nor Body Weight had any relationship with Yards Per Carry.

Redo_runningback2

These variables did not explain the range of Yards Per Carry attained by this set of running backs.

Redo_runningback3Finally, we found strong correlation between 40 Time and Body Weight.  (The heavier you are, the slower you run!) This meant that both variables contained similar information and some unlikely formula involving the two would be unlikely to perform significantly better than each variable alone.

So we are left to turn the table on the table.  More pertinent evidence is needed to prove the case.

The entire analysis suffers from survivorship bias as only the top running backs are examined, and no adjustment is made to deal with wide-ranging tenures.  Apparently, there is more data available in a book.  There is no indication of how the model shown above was validated.

Reference: "The Race of Truth: 40-Yard Times Can Tell the Future", New York Times, April 27, 2008.

 

Apr 08, 2008

Pick-and-choose

Gelman pointed to this Brendan Nyhan post dissecting David Sirota's chart purportedly showing a "race chasm" in the Democratic primaries.  The left chart is David's original and the right is a Nyhan revision.
Sirota

Please see Nyhan for the political interpretation.  Here, I want to note a number of improvements Brendan made to the chart:

  • Sirota plotted the ranks of the percent of black population, which is misleading.  Nyhan plotted the actual percentages on his horizontal axis
  • Sirota connected the dots which highlighted the noise (ups and downs) in the data.  Nyhan fitted a linear model (he also tried other non-linear versions).
  • Sirota plotted Obama's overall margin of win/loss.  Nyhan plotted his margin among white voters only, which more directly addressed the issue.
  • Nyhan exposed the excluded states in a footnote.  Sirota didn't.  For this chart, this piece of information is very important since so many states were excluded.

Nyhan walked us through multiple charts he used to explore the data.  Much of the time was spent picking and choosing states to include or exclude.  We learnt that Sirota excluded states with large Hispanic populations, which Nyhan disagreed with while Nyhan wanted to exclude Florida, which Sirota decided against, even though Sirota excluded Michigan, which Nyhan consented but Nyhan also wanted to exclude the causus states, and so on...

Judging from the charts, this picking and choosing appears not to have changed the outcome in this case.  In general, one should exercise great care in such decisions because one might end up seeing what one wants to see.

The following chart is missing from the post, which I think points out something more telling than the negative correlation between Obama's margin with white voters and the proportion of black population.

Sirota2




Mar 22, 2008

Trying too hard

In the course of business and governing, a lot of charts are generated.  An anonymous tipster pointed us to a set created by the "Communities and Local Government" division in the UK government.  Judging from the content, this division has responsibility for economic development in local neighborhoods.

Below are a pair of exhibits.  Truly they are trying too hard!  What we see is a hybrid scatter-bubble chart.  Between the jargon, the acronyms (LAD, LSOA), the boxed text, the multi-color circles, the colored axis labels and lack of title, the reader is plunged into a state of confusion.

Uk_communities3

The chart can be unraveled.  Each district was evaluated based on two measures of "gaps in worklessness".  The vertical axis compares each district to the national average; positive numbers indicate an above-average district relative to the nation.  The horizontal axis compares the most deprived 10% neighborhood within each district to the local average; positive numbers indicate worst neighborhoods improving. 

Thus, the policy goal would be to move all districts into the upper right quadrant.  The multi-color bubbles were designed to show us the state of the nation.  On the left chart, 41% of the districts (or population?) reside in the improving districts while 19% live in deteriorating areas.

The following strategies can help improve readability:

  • Redo_communities3use English on the axis
  • relegate technical definitions to the legend
  • add succinct title to tell the story
  • use color on the data rather than on axis or data labels
  • use color to draw attention to the upper right quadrant
  • remove bubbles
  • define acronyms

 

Mar 05, 2008

Mid-week entertainment: Pity grapefruit

Courtesy of Derek.  Hope for the scatter plot?

Grapefruit_scatter

Original link here

Jan 17, 2008

Football rankings 2

Nyt_nfloffense

The above chart is another one in the NYT series on the NFL playoffs.  It evaluates the mix of passing and rushing attempts by offense.  The convoluted way by which the caption strains to tell a story indicates trouble ahead:

Of the three playoff teams that threw the ball the most, two of them come from cities known for cold weather.  Conversely, of the three teams that ran the most, two of them play their home games in milder weather.

The implication is that teams from cold-weather cities are supposed to want to rush more, and vice versa.  And the data (total of six samples) pointed to the opposite.

This presentation suffers from low data-to-ink ratio:  too much ink is spilled over not much data.  The designer arbitrarily picks one of the two variables (passing attempts, rushing attempts) as the primary, sorting variable -- trace the orderly green diamonds on the right chart.  This makes it hard to see a pattern in the brown diamonds.  As usual, a scatter plot works much better with two data series.

Redo_nfloffenseIn the junkart version, the raw numbers of attempts are converted into proportion of attempts that were passing versus rushing.  This easy move immediately collapses the two dimensions into one.  Now, we have room to include an extra variable which matters: the average amount of snowfall in these cities.

So what does the data say about the relationship between propensity to pass and cold weather?  There appears to be very little relationship as the dots are all over the chart.  In particular, the teams playing in cities with the highest snowfall span the range of passing percents; similarly, those playing in lowest-snowfall cities also span the range of passing percents. 

The caption ignores all the blue dots, focusing only on the gray ones.  A more direct examination of the relationship reveals the folly of the so-called "not so conventional wisdom".

References: "NFL Offences Undergo a Thaw in Thinking", New York Times, Jan 5 2008; government snowfall statistics.

Dec 16, 2007

Hits and misses

In this NYT article, we are told that "the most likely result when a policeman discharges a gun is that he or she will miss the target completely."  That's a shocker for those of us conditioned by Hollywood movies to think anyone who picks up a gun for the first time hits the villain right on the temple.  The following graphic attempts to tell the story.

Nyt_bullets

The one hit here is how the distances are visually presented.  The elliptical lines remind us of the neglected variable of direction; it also means the scale is correct only along one direction.

The dot matrix construct highlights the absolute numbers of shots, hits and misses but barely addresses the key issue of hit rates (accuracy). Nyt_bullets3 Specifically, this data set was presumably collected to explore the relationship between hit rates and distances from the target.  The use of different widths clouds our judgement of proportions.  To wit, it is not obvious that the 10-wide block and the 40-wide block shown left depict roughly equal hit rates (23%, 29%).

Redo_bullets The junkart version adopts a different approach.  This is the Lorenz curve, often used to show income inequality (see also here and here).  Here, the shots were ordered from closest to furthest from target, then summed up by distance segments.  For example, shots from 0 to 6 feet accounted for 60% of all shots but 72% of all hits.

If distance does not affect hit rates, we'd expect 60% of all shots to result in 60% of all hits.  This data point would show up on the 45-degree diagonal on the chart, labelled "totally unpredictable".  Any data appearing above the diagonal indicates that closer shots are more accurate, accounting for more than their fair share of hits.

Comparing the fitted blue line and the diagonal, one sees that distance is a weak predictor of hit rate.  The police commissioner explains this in the article; many other variables also affect accuracy, including "the adrenaline flow, the movement of the target, the movement of the shooter, the officer, the lighting conditions, the weather..."

Note that the shots with "unknown" distances were removed from the analysis.  Also, the categories of 21-45 and 45-above were combined: the rates were similar and with only three hits, it does not make sense to treat these as separate categories.

Of course, this version would not work well in the mass media.  For that, one can just plot hit rates against the distance categories.

Source: "A Hail of Bullets, a Heap of Uncertainty", New York Times, Dec 9 2007; New York Firearms Discharge Report 2006.

Nov 25, 2007

A dangerous equation

Graduation rates at 47 new small public high schools that have opened since 2002 are substantially higher than the citywide average, an indication that the Bloomberg administration’s decision to break up many large failing high schools has achieved some early success.

Most of the schools have made considerable advances over the low-performing large high schools they replaced. Eight schools out of the 47 small schools graduated more than 90 percent of their students.

Nyt_smallsch This graphic included in the NYT article  lent support to the "small schools movement".  In particular, note the last sentence of the above quotation: it incorporates the oft-used device of subgroup support of a hypothesis, in this case, the subgroup of eight top-performing schools.

Such analysis is "dangerous", according to Howard Wainer, who discusses this and other examples of misapplication in a recent article in American Scientist, entitled "The Most Dangerous Equation".  He alleged that billions have been wasted in the pursuit of small schools.

The issue concerns sample size.  Dr. Wainer and associates analyzed math scores from Pennsylvania public schools.  Wainer_mathscoresAverage scores for smaller schools are based on smaller number of students, and therefore less stable (more variable).  More variability means more extremes.  Thus, by chance alone, we expect to find more smaller schools among the top performers.  Similarly, by chance alone, we also expect to find more smaller schools among the worst performers. 

The scatter plot lays out their argument. Focusing only on the top performers (blue dots), one might conclude that smaller schools do better.  However, when the bottom performers (green) are also considered, the story no longer holds.  Indeed, the regression line is essentially flat, indicating that scores are not correlated with school size.

This is all nicely explained via the standard error formula (De Moivre's equation) in Dr. Wainer's article.  Here is a NYT article from the mid 1990s describing this same phenomenon.

File this as another comparability problem.  Because estimates based on smaller samples are less reliable, one must take extra care when comparing small samples to large samples.

Dr. Wainer is publishing a new book next year, called "The Second Watch: navigating the uncertain world".  I'm eagerly looking forward to it.  His previous books, such as Graphic Discovery and Visual Revelations, both part of the Junk Charts collection.

Sources: "The Most Dangerous Equation", American Scientist, November 2007; "Small Schools Are Ahead in Graduation", New York Times, June 30 2007.


P.S. Referring back to the NYT chart above, one might wonder at the impossible feat of raising graduation rates across the board simply by breaking up large schools into smaller ones.  This topic was taken up here, here and here.  When evaluating the "small schools" policy, it is a mistake to discuss only the performance of small schools; any responsible analysis must look at improvement over all schools.  Otherwise, it's a simple matter of letting small schools skim off the cream from larger schools.

 

Oct 15, 2007

Sense of proportion

[I'm back from vacation.  Will provide my reaction to the responses to the Gelman challenge, and for those who have sent me email, I will work through them soon.]

The NYT commented on a trend among marketers to shift their advertising spending from so-called "measured" media like print and TV to so-called "unmeasured" media like product placements, contests, etc. 
The following chart accompanied the article:

Nyt_ads_2


This construct is akin to a population pyramid; it's great for comparing two groups along one metric, say age groups between males and females.  Here, the two halves aren't comparable groups but two different metrics.  The main metric, that is, the proportion of unmeasured, is not directly depicted: the reader must figure out mentally how much of each bar the black part covers.  Also, the companies are sorted by unmeasured media spending but this leaves the measured spending with a jagged profile, confusing matters.

As for the little white slits on the gray bars, they are admittedly cute but it is difficult to compare the detailed breakdown between print, TV and other media among companies.

The following dot plot gives the two halves equal weight.  Redoads1(Pink dots are measured, blue unmeasured.) It's not a very interesting graphic though. The sense of proportion is still missing.

I settled on a scatter plot which relates the proportion spent on unmeasured to the total amount of spending.  It appears that the largest advertisers had the lowest proportional unmeasured spend while the smallest (among the majors) had the highest.  (It's only a weak correlation: a linear fit yields only 16% R-squared.)
Redoads2


















Source: "The New Advertising Outlet: Your Life", New York Times, Oct 14, 2007.









Jul 29, 2007

Transgender trends

One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough).  The thoughtfulness of these nominations continues to impress me.

Evan sent in 254 charts he created after looking at the post on baby namesJordanv31970200528yrs_2An example is shown on the right. 

He is particularly interested in the question of names that are given to both males and females. 

For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side. 

It's a nice touch to label the most recent year.  I'd also label the values for the most recent year on the axes.

Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:

My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.

In other words, for less popular names, the top chart would look much more compressed.

There are many more charts to sift through on his site.  Evan welcomes suggestions.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31