Oct 21, 2008

Mind the gap

When comparing two time series, one typically wants to discuss the size of the gap as it changes over time.  This Business Week chart, for example, depicted for readers the expanding gap between intra-day high and low prices of the S&P 500 for 2008.

Bw_SandPHiLow
This chart construct is effective at pointing out large changes but lacks precision in conveying smaller differences, or trends.  It is always a good idea to plot the gap directly, as we will show below.

Redo_SandPHiLow More importantly, a better choice of scale can help a lot.  By focusing exclusively on variability (extreme values), this chart hides the relevant information of the closing prices of the S&P.  A point spread of a 100 points means more when the index is at 800 than at 1200.  In order to capture this, we can divide the point spread by the opening price of that day so we say the gap is one-eighth or one-twelfth of the opening price. 

The junkart version makes both changes.  The top chart fixes the scale, plotting the point spread as a percentage of daily opening prices.  Relative to the original chart, the variability in the front part of 2008 was muted because the index was at higher levels back then. 

The bottom chart plots the gap sizes (lengths of the high-low lines).  It is without doubt that directly plotting the gaps showcases the key message.  The current level of volatility is more than double what occurred at the beginning of the year.

If one wants to illuminate the trend as opposed to daily fluctuations, a further improvement will be using moving averages.

For those interested, shown below is a scatter plot that compares the original point spread and the derived point spread, which shows that the change is not trivial.


Redo_SandPHiLow2 


Reference: "The Market: A Daily Roller Coaster", Business Week, Oct 27 2008.

Jul 07, 2008

Divided nation

Professor Gelman generally believes the red state, blue state paradigm is too simplistic to describe the American electorate.  He has been sharing some of his work on his blog, and has just published a book about this topic.  Recently he produced the following chart, which is gimmick-looking but crystal clear in its message.

Gelman_redblue

Here, economic and social ideology are plotted on a scatter chart, with positive values indicating conservatism and negative values liberalism.  Further, each state is represented twice on the chart, the red point for the Republicans and the blue for Democrats within the state.

This is a cluster analyst's dream data set.  The absolute separation of the Republican cluster and the Democrat cluster is astounding: imagine a diagonal line perfectly classifying all points.

We should not miss a host of details:

  • as Andrew pointed out, "the big thing we see from the graph ... is that Democrats are much more liberal than Republicans on the economic dimension: Democrats in the most conservative states are still much more liberal than Republicans in even the most liberal states."  This is clear from the wide gap on the horizontal axis.
  • there is a small degree of overlap on the social ideology axis so the nation is closer together on that front.
  • but wait a minute, the scale on the social axis is not the same as that on the economic axis.  This means that the extremes are more extreme on the social axis: the difference between MS and VT is roughly 0.8 on the social scale while the largest difference on the economic scale is roughly 0.5.  (here, I am assuming that the scales are comparable to each other)
  • there is high correlation between social and economic ideologies: the points are well-aligned along the 45-degree line
  • especially on social issues, the Democrats are divided within (the elongated shape of the blue cluster).

Reference: Gelman, "Ranking states by conservatism/liberalism of their voters", June 30 2008.

Jun 30, 2008

A splitting headache

Fry_baseballsalaryTodd B didn't like this chart showing the correlation between baseball team salaries and their win-loss records.

A few problems are in plain sight:

  • Most importantly, putting a second set of logos next to the salaries column would really help
  • Unclear why the lines should be of varying widths
  • Winning percentage is more telling than win-loss, especially in the middle of a season when there is a  slight imbalance in total games played
  • the spread of salaries is so wide (10 times) that reducing the numerical scale to rank scale meant a big loss of information
  • Each column is sorted by its own metric while the most important sorting variable should be the slope of the lines (i.e. the cost per win)


The interactive feature of individual plots for each day (control bar at the top) of the baseball season is something of a gimmick.  Props though for realizing that the first few days of the season don't tell us anything.  There really is little use for investigating this correlation on a day-by-day basis.  Particularly when the salaries are given in aggregate.

On the diagram, the blue lines represent teams such as the Devil Rays and Arizona that had better winning records than their salaries would suggest.  Red lines display those teams spending more money than their records would suggest.  The steeper the line, the best/worst the team's cost efficiency.

With so many long steep lines in both colors (directions), one might posit that a negative correlation may exist between salary level and winning record. 

The following scatter plot suggests otherwise:

Redo_baseballsalary The correlation between salary and winning is very weak.  If one were to fit a linear model, it would show that the higher-salaried teams generally were doing slightly better (black line).  The Yankees were sufficiently outside the range in salaries that I didn't include them in estimating the line.  (However, as the chart shows, the line in fact estimated the Yankees winnning percentage really well.)

Teams above the line are performing better than their salaries would lead us to believe. 



Reference: Ben Fry's baseball salary page

Jun 12, 2008

Rise and fall

Via Adam came this "colorful" chart of the rise and fall of house prices since 2000, as measured by the Case-Shiller index.  He commented that this showed the old saw "the taller they are, the harder they fall".

Houseprices

A different chart allows us to test this theory directly.  From the above, we noted that each curve was composed of two phases, a long rise from 2000 to roughly mid-2000s followed by a steep decline.  We computed two data series: the average monthly growth rate during the inflation phase and the average monthly decline during the deflation phase.  The scatter plot showed the correlation. 

Redo_houseprice

The dots displayed pretty strong correlation, confirming that on average, the faster they rise, the steeper they fall.

The diagonal line indicated equal rates of growth and subsequent decline.  The cities above the line, especially Boston and New York, have witnessed declines that were much slower than the earlier rises.  On the other end, cities like Detroit, Cleveland, Atlanta and Dallas suffered price deflation much faster than earlier inflation.  Indeed, the ratio of decline to rise rates is given by the slope from the origin to the dot.

As for the original chart, it showed all the signs of Excel defaults.  It just does not make sense for a charting program to pick a different color for each time series, no matter how many there are.  Beyond four or five colors, it is impossible for readers to tell the lines apart.  In these situations, we should adopt a foreground / background strategy: decide on the key lines, highlight those with color, gray out the remaining lines.


Reference: Standard & Poor

May 12, 2008

A matter of timing 2

Our last post generated much discussion around double axes.  In this post, we take up Michael's suggestion of a scatter plot, and several suggestions to retain the original units.

The scatter plot in this case did not provide any insight, unfortunately.  See below.  It just highlighted the jerkiness in the data so we ended with much zig-zagging.
Redo_bikecrash3


Retaining the original units is not advisable because those units were not comparable.  In the following caricature, we show how to shape the axis to tell any story we want.
Redo_bikecrash4

Panel plots are slightly better insofar as such mischief could be spotted by the amount of white space.

Another way to make the two data series comparable is to plot percentage change from year to year.  This is similar to indexing, just the difference between annual change and cumulative change.


May 06, 2008

Turning in his grave 1

(Thanks to reader Josh R. for the tip.)  The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist?  Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.)  Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries.

Suffice it to say, even if it can be proven that most states with high densities of Starbucks are more likely to have more Democratic primary voters who prefer Obama to Clinton, it is a far cry from proving Obama an elitist.  However, we take the leap of faith and look at the evidence presented to us.

Blog_obamaelite The star witness was this chart plotting the "vote spread" of Obama minus Clinton and the per-capita Starbucks density.  The black line was a linear fit to the Starbucks data as shown in green dots.  Since the black and blue lines both pointed northeast roughly speaking, we were told: "States with more latte-purveying Starbucks stores are more likely to have gone for Obama."  (So Obama is indeed an elitist.)

To cover all bases, the creator of this chart suggested that "my statistics professor might be rolling over in his grave to hear me say it, but there's a mild but real correlation here!".

Mr. Urbanspoon, the statistics professor is here and he disapproves.  As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster.  Not reaching immediately for the scatter plot when one has two data series is another serious misstep.  (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".)  So here is the appropriate graphic:

A quick first glance at the left chart indicates that any correlation, if it exists, is very weak indeed.  A simple linear regression analysis shows that Starbucks density explains only 14% of the variability in vote spread.  Note especially the wide dispersion of dots around the line.  Further, for the vast majority of the states (say those with vote spread between -20% and 40%), there appears to be no correlation.  This is seen on the right chart.

Redo_obamaelitist

To the extent that there is a linear correlation, the points (orange dots) would be most influential.  The top cluster included Alaska, Kansas, DC, Hawaii and Idaho in which Obama had a large winning margin while the Starbucks density was above average.  The bottom cluster included Arkansas and Olkahoma where Obama was wiped out and where Starbucks had the lowest density.  These two clusters alone explained the mild relationship; removing them wiped it out.

Redo_obamaelitist2Following Nyhan, we should remove some obvious outliers, such as Arkansas, Illionois and New York (home states), Michigan and Florida (disputed) and New Hampshire and Iowa (Edwards territory).  The result is also mild correlation (R-sq = 0.075).


Till next post, when the professor rolls over again ...


 

Notice that I prefer the number of people per Starbucks metric, as opposed to the number of Starbucks per thousand people (See prior discussion on Gelman's blog.)  The reason is that every number on the former metric is reality-based while the latter metric produces imaginary numbers for small states, i.e. the imputed number of Starbucks is smaller than what actually exists!

Also note that I used a renormalized vote spread so that the Obama proportion and the Clinton proportion added up to 100%.  This made the assumption that Edwards and other voters would split among Obama and Clinton in the same proportions as those who explicitly voted for the two frontrunners.

May 05, 2008

Turning the table

Nyt_runningbacks We recently showed an example of when data tables worked well to clarify the data.  Last week, there was an example from the Times which did the opposite.

The accompanying article boldly claimed that

the 40-yard dash stands above them all as having the strongest correlation to success in the NFL.  The three-cone drill, the shuttle run, the bench press -- none correlate to NFL success.  The 40 is king.

Further, it cited Bill Barnwell from FootballOutsiders.com who created an "index" using both 40 time and body weight that is "an even better predictor than 40 time alone".  In other words, this formula Nyt_runningback_eqt

does the trick.

The data table, shown above, presumably clinched the case.

Redo_runningback1 We were mystified when we put the data to the test, however.  Among the set of 15 running backs, the Index did not predict the Yards Per Carry at all!  The Index explained only 8% of the variation in Yards Per Carry between the backs.

The data table obscures this bivariate relationship.  As it was sorted by the Index, we would look for the column showing Yards Per Carry to be naturally sorted in the same order.  But it is hard to tell the trend from the noise in a table.

What went wrong?  It turned out neither 40 Time nor Body Weight had any relationship with Yards Per Carry.

Redo_runningback2

These variables did not explain the range of Yards Per Carry attained by this set of running backs.

Redo_runningback3Finally, we found strong correlation between 40 Time and Body Weight.  (The heavier you are, the slower you run!) This meant that both variables contained similar information and some unlikely formula involving the two would be unlikely to perform significantly better than each variable alone.

So we are left to turn the table on the table.  More pertinent evidence is needed to prove the case.

The entire analysis suffers from survivorship bias as only the top running backs are examined, and no adjustment is made to deal with wide-ranging tenures.  Apparently, there is more data available in a book.  There is no indication of how the model shown above was validated.

Reference: "The Race of Truth: 40-Yard Times Can Tell the Future", New York Times, April 27, 2008.

 

Apr 08, 2008

Pick-and-choose

Gelman pointed to this Brendan Nyhan post dissecting David Sirota's chart purportedly showing a "race chasm" in the Democratic primaries.  The left chart is David's original and the right is a Nyhan revision.
Sirota

Please see Nyhan for the political interpretation.  Here, I want to note a number of improvements Brendan made to the chart:

  • Sirota plotted the ranks of the percent of black population, which is misleading.  Nyhan plotted the actual percentages on his horizontal axis
  • Sirota connected the dots which highlighted the noise (ups and downs) in the data.  Nyhan fitted a linear model (he also tried other non-linear versions).
  • Sirota plotted Obama's overall margin of win/loss.  Nyhan plotted his margin among white voters only, which more directly addressed the issue.
  • Nyhan exposed the excluded states in a footnote.  Sirota didn't.  For this chart, this piece of information is very important since so many states were excluded.

Nyhan walked us through multiple charts he used to explore the data.  Much of the time was spent picking and choosing states to include or exclude.  We learnt that Sirota excluded states with large Hispanic populations, which Nyhan disagreed with while Nyhan wanted to exclude Florida, which Sirota decided against, even though Sirota excluded Michigan, which Nyhan consented but Nyhan also wanted to exclude the causus states, and so on...

Judging from the charts, this picking and choosing appears not to have changed the outcome in this case.  In general, one should exercise great care in such decisions because one might end up seeing what one wants to see.

The following chart is missing from the post, which I think points out something more telling than the negative correlation between Obama's margin with white voters and the proportion of black population.

Sirota2




Mar 22, 2008

Trying too hard

In the course of business and governing, a lot of charts are generated.  An anonymous tipster pointed us to a set created by the "Communities and Local Government" division in the UK government.  Judging from the content, this division has responsibility for economic development in local neighborhoods.

Below are a pair of exhibits.  Truly they are trying too hard!  What we see is a hybrid scatter-bubble chart.  Between the jargon, the acronyms (LAD, LSOA), the boxed text, the multi-color circles, the colored axis labels and lack of title, the reader is plunged into a state of confusion.

Uk_communities3

The chart can be unraveled.  Each district was evaluated based on two measures of "gaps in worklessness".  The vertical axis compares each district to the national average; positive numbers indicate an above-average district relative to the nation.  The horizontal axis compares the most deprived 10% neighborhood within each district to the local average; positive numbers indicate worst neighborhoods improving. 

Thus, the policy goal would be to move all districts into the upper right quadrant.  The multi-color bubbles were designed to show us the state of the nation.  On the left chart, 41% of the districts (or population?) reside in the improving districts while 19% live in deteriorating areas.

The following strategies can help improve readability:

  • Redo_communities3use English on the axis
  • relegate technical definitions to the legend
  • add succinct title to tell the story
  • use color on the data rather than on axis or data labels
  • use color to draw attention to the upper right quadrant
  • remove bubbles
  • define acronyms

 

Mar 05, 2008

Mid-week entertainment: Pity grapefruit

Courtesy of Derek.  Hope for the scatter plot?

Grapefruit_scatter

Original link here

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2009

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31