May 06, 2008

Turning in his grave 1

(Thanks to reader Josh R. for the tip.)  The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist?  Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.)  Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries.

Suffice it to say, even if it can be proven that most states with high densities of Starbucks are more likely to have more Democratic primary voters who prefer Obama to Clinton, it is a far cry from proving Obama an elitist.  However, we take the leap of faith and look at the evidence presented to us.

Blog_obamaelite The star witness was this chart plotting the "vote spread" of Obama minus Clinton and the per-capita Starbucks density.  The black line was a linear fit to the Starbucks data as shown in green dots.  Since the black and blue lines both pointed northeast roughly speaking, we were told: "States with more latte-purveying Starbucks stores are more likely to have gone for Obama."  (So Obama is indeed an elitist.)

To cover all bases, the creator of this chart suggested that "my statistics professor might be rolling over in his grave to hear me say it, but there's a mild but real correlation here!".

Mr. Urbanspoon, the statistics professor is here and he disapproves.  As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster.  Not reaching immediately for the scatter plot when one has two data series is another serious misstep.  (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".)  So here is the appropriate graphic:

A quick first glance at the left chart indicates that any correlation, if it exists, is very weak indeed.  A simple linear regression analysis shows that Starbucks density explains only 14% of the variability in vote spread.  Note especially the wide dispersion of dots around the line.  Further, for the vast majority of the states (say those with vote spread between -20% and 40%), there appears to be no correlation.  This is seen on the right chart.

Redo_obamaelitist

To the extent that there is a linear correlation, the points (orange dots) would be most influential.  The top cluster included Alaska, Kansas, DC, Hawaii and Idaho in which Obama had a large winning margin while the Starbucks density was above average.  The bottom cluster included Arkansas and Olkahoma where Obama was wiped out and where Starbucks had the lowest density.  These two clusters alone explained the mild relationship; removing them wiped it out.

Redo_obamaelitist2Following Nyhan, we should remove some obvious outliers, such as Arkansas, Illionois and New York (home states), Michigan and Florida (disputed) and New Hampshire and Iowa (Edwards territory).  The result is also mild correlation (R-sq = 0.075).


Till next post, when the professor rolls over again ...


 

Notice that I prefer the number of people per Starbucks metric, as opposed to the number of Starbucks per thousand people (See prior discussion on Gelman's blog.)  The reason is that every number on the former metric is reality-based while the latter metric produces imaginary numbers for small states, i.e. the imputed number of Starbucks is smaller than what actually exists!

Also note that I used a renormalized vote spread so that the Obama proportion and the Clinton proportion added up to 100%.  This made the assumption that Edwards and other voters would split among Obama and Clinton in the same proportions as those who explicitly voted for the two frontrunners.

Apr 08, 2008

Pick-and-choose

Gelman pointed to this Brendan Nyhan post dissecting David Sirota's chart purportedly showing a "race chasm" in the Democratic primaries.  The left chart is David's original and the right is a Nyhan revision.
Sirota

Please see Nyhan for the political interpretation.  Here, I want to note a number of improvements Brendan made to the chart:

  • Sirota plotted the ranks of the percent of black population, which is misleading.  Nyhan plotted the actual percentages on his horizontal axis
  • Sirota connected the dots which highlighted the noise (ups and downs) in the data.  Nyhan fitted a linear model (he also tried other non-linear versions).
  • Sirota plotted Obama's overall margin of win/loss.  Nyhan plotted his margin among white voters only, which more directly addressed the issue.
  • Nyhan exposed the excluded states in a footnote.  Sirota didn't.  For this chart, this piece of information is very important since so many states were excluded.

Nyhan walked us through multiple charts he used to explore the data.  Much of the time was spent picking and choosing states to include or exclude.  We learnt that Sirota excluded states with large Hispanic populations, which Nyhan disagreed with while Nyhan wanted to exclude Florida, which Sirota decided against, even though Sirota excluded Michigan, which Nyhan consented but Nyhan also wanted to exclude the causus states, and so on...

Judging from the charts, this picking and choosing appears not to have changed the outcome in this case.  In general, one should exercise great care in such decisions because one might end up seeing what one wants to see.

The following chart is missing from the post, which I think points out something more telling than the negative correlation between Obama's margin with white voters and the proportion of black population.

Sirota2




Mar 28, 2008

Two books

Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data.   There are still a few days left to participate.  While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.

I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs.  She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on.  Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover.  The page design - with half of every page blank - is refreshingly easy on the eyes.  Inclusion of examples is generous. 

Lets review her point of view of some of the topics we discuss frequently on Junk Charts:

Starting axis at zero: she thinks "all bar charts must include zero.  However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)

Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.

Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot.  Again, I infer she's in favor.

Mar 01, 2008

Don't believe what you see

Mankiw's blog linked to a press release by the Congressman Jim Saxton, using CBO data to show "middle income tax burden at lowest level in decades".  Cbo_taxrateThe attached graph, as Junk Charts readers will immediately recognize, is classic chartjunk.  Every time the vertical axis does not start at zero,  one suspects something is amiss.  And what with the gridlines and data labels?

"Don't believe it? Check out the data source yourself."  I followed Mankiw's suggestion and was indeed surprised... but not by the great fortune of the "middle class".  The surprise was how the chart painted a dishonest picture of the CBO data.

The original chart plotted only the tax rate experienced by the middle 20% of the population. 
Redo_taxrate1The CBO provided data for all five quintiles; why not plot them all?  In this new chart (right), the "surprise" windfall to the middle 20% proved not to be anything special at all!  All five quintiles, especially the middle three, followed pretty much the same trend over time.  The effect of singling out the middle 20% is to deprive the context by which the data should be interpreted.

Further, what might be the result of the declining middle income tax burden?  Redo_taxrate3 The CBO data painted an unexpected picture.  Paradoxically, as the middle 20% see their tax rate decrease, they also earn a smaller share of the nation's after-tax income (black line at right).  At the same time, the top 1% saw their share of after-tax income double from about 8% to almost 16% (blue line).  The top 20% line is also upward-sloping although less pronounced.  So, the implication that the middle class have had it good is plainly wrong.

What is going on?  Two factors were at play and the Congressman presented
only one side of the story (the tax rate).  What he omitted was that during this period, the nation's wealthy took home larger and larger shares of the pre-tax income.  This shift in pre-tax income more than offset any relative reduction in tax rate for the middle 20%.

This distortion can be traced back to the use of quintiles (or more generally, ranks).  We use them to cope with data having extreme distributions but a by-product is losing information about how extreme are the extreme values.  As demonstrated here, the quintiles from old are really different from the quintiles from today because the underlying distribution has become much more extreme.

Finally, another bit of mystery (to me) is how the middle 20% came to be considered "middle class".  Is there a widely accepted definition?

Reference: "CBO Data Show Middle Income Debt Burden At Lowest Level in Decades", Feb 21 2008.

Feb 25, 2008

Playful and exploratory

I share reader Bernard L.'s enthusiasm for this very imaginative chart, courtesy of the graphics people at NYT.  The chart captures the ebb and flow of weekly movie receipts over the last two decades.
Nyt_films
The details that particularly interest me include:

  • The addition of area colors (on top of lines) serves to highlight box office successes; this really helps readers sort out the massive amount of data
  • Nicely spaced text (and dots) does not interfere with our reading of the chart
  • The hiding of text for less important films, plus taking advantage of interactivity to show their titles if the reader mouses over the respective areas

All of the above indicate a keen sense of foreground versus background.  Besides, the authors had the good sense to speak of inflation-adjusted box office sales; I'm tired of the movie industry proclaiming higher sales each year when ticket prices are rising, and the population is growing.

This is another chart where more data do not easily translate into better communication (see my guest post at Flowing Data).  While I like the playful nature of the interactive chart, it is left to the reader to discover the information buried in the data, such as the assertion in the header that Oscar-winning films typically take time to attain box-office success while many blockbusters do not Oscars make.

In this presentation, it is challenging to compare the total receipts of one film versus another (this requiring comparing oddly shaped, partially obscured areas).  It is also hard to compare across years since the data is spread out over a lot of space.

There may really be two types of graphics: the one like the example here which is a dictionary and designed for exploration; and the other kind where the designer has selected a subset of the data to make a specific point.

Reference: "The ebb and flow of movies", New York Times, Feb 23 2008.

Feb 03, 2008

Redundancy

Nick B., who occasionally writes about statistical graphics, found some classic chart junk from a Canadian report on the Afghan army.  Here's one example, together with the junkchart version.Redoafghan_2

Redundancy is an enemy of good graphics, and incongruous redundancy is worse.  Here, troop level is variously described as "total force size", "strength" and "army growth"; the chart on the right uses only the army concept.  The data labels ("47000 Strength"), the axis labels ("50000 Total Force Size"), and the gridlines all germinate from the five grand data points underlying the entire chart!

Another distorting feature is that use of different-sized time intervals, which we space out appropriately on the right chart.

Ultimately, the key message should be growth in the army size, not the absolute number of troops.  The slopes of the line segments encode this information.  Alternatively, a data table can be rather powerful for simple data like this:

Redoafghan2 By what is called the "end state", there would be 70% more troops than those as of December 2007.

 


Jan 15, 2008

Water and wine

Marketers have always argued that price signals quality; this leads to the startling idea that one should just set a high price. 

If you don't believe it, note how Coca Cola and Pepsi turned tap water into a premium-priced $1.7 billion market.  As we now know, Dasani and Aquafina are just bottled tap water.

Wine_tasting Even if one can turn water to wine, now researchers discovered the same rule applies.  Unlike most scholarly articles, they actually published a well-made chart to illustrate their experiment.

Testers were given the same wine but told either it cost $10 or $90.  Their brain activity is measured.  The chart showed that those thinking it cost $90 (green line) had much better sensation about the wine than those thinking it cost $10 (blue).

A standard way to display this information is a data table that spells out every estimate and its standard error, plus some asterisk or bolding scheme to indicate statistical significance.  Visualization is far superior.

For more examples, see Gelman's paper or Kastellec and Leoni's paper.

Reference: "Study: $90 wine tastes better than the same wine at $10", News.com, Jan 14, 2008.

Dec 09, 2007

Lacking buzz

Nielsen, they of the ratings, is roughing it in the information age.  When they announced on-line tracking tools, Wired quipped: "It's looking like online video policing companies will have to make room for another deputy."  Last year, cable companies revolted over a service measuring the effectiveness of commercials.

Via the Data Mining blog, I learnt about yet another new on-line offering, called "Hey! Nielsen" for obscure reasons.  (Perhaps Hey! Nielsen is the new Yahoo! !)

The site is an enigma wrapped in a mystery.  The official description says:

Hey! Nielsen is the place to make a name for yourself while trading opinions on TV, movies, music, personalities, web sites and more.

How does one "trade" opinions?

According to the FAQ, the "Hey! Nielsen" score, the cornerstone of the site, is:

a real-time indicator of a topic's impact and value and you play a major role. As the site evolves and users submit their opinions and commentary, the score will rise or fall based on a number of factors including, but not limited to, user opinions, news coverage, and raw data from our sister sites Billboard.com, HollywoodReporter.com, and BlogPulse.com.

Sounds like a product aimed at marketers to help them track public opinion but offering little control over sampling. 

The "Hey! Nielsen" buzz chart (below) captures the change in "Hey! Nielsen" score over time.

Heynielsen

This chart is an unfortunate case of flipping background into foreground.  What grabs our attention are those hideous white circles with numbers in them.  The legend explains that these are the daily numbers of opinions on the subject, in other words, the daily sample sizes.  As they stand now (with the site still in beta), they serve to expose the low level of participation, leading to small sample sizes, and irrelevance.  But what when the site became super-popular, would the circles say 56234, 19245, 90257, etc.?  Why would visitors care about daily sample sizes anyway?  Mousing over these circles reveal text but in most cases, they are blocked by neighboring white circles.

In the meantime, the circles obscure the line which shows the trend in the "Hey! Nielsen" score over time.  This chart reminds me of that Google toy known as Google Trends.  The Googlers provide no vertical scale so the graphs are unreadable.  "Hey! Nielsen"ers provide a vertical scale -- kind of -- but the graphs are still meaningless: what does a score of 881 mean?  how about 724?  what is the maximum score?  what is the minimum?  Beware numbers without context.

The vertical axis does start from zero but has an odd spacing of tick labels. The gridlines are distracting and serve no purpose.  The orange area under the curve also makes little sense.

We look forward to seeing version 2.0.

 

Nov 06, 2007

The eyeball test

This set of graphs was used by the NYT to discuss changes in U.S.  spending patterns over time.  For this post, I am focusing on the bottom left and bottom right graphs.  One shows spending on energy as a percent of GDP; the other, on "nonresidential structures" (aka, commercial buildings).

Nyt_spending

At first glance, spending on energy and that on commercial buildings look very similar in shape (see above or below left).  Alas, this "eyeball test" doesn't work very well with time series data.  Lets investigate further.

Redospend1_2

"Standardizing" the data (above right) tells us whether the swings are unusual or not in the history of the data.  So in the 1980s, commerical building spend spiked to more than three times the standard deviation above the historical average.  Generally speaking, the standardized unit of 3 is taken to mean highly unusual. 

Notice that the peaks of the left graph had equal heights but on the right graph, energy spending peaked only above two while commerical building spend rose above three.  This is because energy spending has been more volatile historically so it takes larger jumps (or plunges) to count as "unusual" movements.  This information is hidden in the unstandardized version.

Further, since we are concerned with long-term trends, lets take a look at five-year moving averages (below right): in other words, each time point is the average of the preceding five years worth of data. 

Redospend2

The fluctuations have been smoothed out and the peaks are no longer as high.  Glancing at this chart, we may still conclude that the spending patterns are quite similar -- especially in the period prior to 1995.

But is that really the case?  Zooming in on the 1980s, we may mistakenly think the two lines are "close together" if our eyes read the horizontal distance and/or area between the curves, rather than focusing on the vertical distance.  The arrows on the bottom left chart depict this difference.  To make things clearer, the bottom right chart plots the vertical distances between the two lines.

Redospend3

Observe that the difference expanded to above 1 unit in the late 1980s.  A difference of one unit is very large in the standardized scale (of "unusualness") since 0 is business as usual and 3 is "highly unusual".

Eyeballing the two time series would lead us to believe that the two series are similar but we run the risk of underestimating the differences as illustrated here.


Source: "Auto Sector's role Dwindles, and Spending Suffers", New York Times, Nov 3 2007.

Oct 17, 2007

Points of comparison

Econ_mortgage In light of the current housing crisis, arising from mortgage defaults, I pulled this graphic from a Jan 2007 opinion piece that plotted historical default rates of mortgages.  Notice the high degree of stretching on the vertical axis that exaggerates the volatility: essentially, the annual delinquency rate ranged from 1.75% to 2.65% during the last six years or so.  One might be forgiven to think that a 2% default rate is quite acceptable.

Nyt_mortgage_2 Compare the above chart to the pair that showed up in the NYT in Oct 2007 (see right).  The default rates here are in the 10-20% range, very alarming indeed.

The two graphics illustrate a key issue of "aggregation" in statistical analysis.  The first graphic is super-aggregated: all types of mortgages of all ages are put together to calculate each year's default rate.  The second graphic hones in on subprime mortgages only.

More importantly, the second graphic presents data in "vintages".  Each line represents loans originated during a particular year (a "vintage").  This establishes comparability.  On the first chart, each point in time represents the default rate of mortgages averaged over all ages (some loans may be only a few months old; others may be 15 years old).  Since the default rate is much higher for very young mortgages than for older mortgages, such averaging hides crucial information.

Overall, the NYT graphic very effectively conveys the alarming trend of new mortgages performing much worse, especially those originated in 2007.

Redo_mortgage It can benefit from two slight edits: adding a few more years, and using vertical lines (the most critical comparisons are default rates for loans of a given age!)  Something like this...


Sources: "As Defaults Rise, Washington Worries", New York Times, Oct 16 2007; "Mounting Mortgage Credit Problems", economy.com, Jan 23 2007.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31