Dec 13, 2006

The trouble with percentages

In the aftermath of the Democratic victory in the 2006 mid-term election, the NYT published a column floating the idea that "it was the economy, stupid".  For statistics buffs, this column provides much food for thought. 

Suffice it to say, if you were my student, you would not want to hand this in as an essay.  To the author's credit, he did backload the article with lots of disclaimers.

The key thesis of the piece is:

if your state wasn’t among the best economic performers in the last six years, judged by the growth of personal income, it appears that you were three times as likely to vote to throw the bums out.

Redo_election06b_1 (We'll just assume he didn't mean "you" but "your state".) To help us understand the author's logic, I created a scatter plot, relating the change in state average personal income (2000-2006) to the change in percent of Republican seats.

He first segmented the states into two groups: the red dots had the top 10 income growth rates; the blue dots were the remaining states.  Then for each group, he computed the average drop in % Republican.  For the reds, it was 2%; for the blues, it was 7%.  (These levels are indicated by the horizontal lines.  My data are slightly different from his.)  Case proven -- with disclaimers.

Some of you are already counting the dots.  If you only find 42, you'd have counted correctly.  The following explanation provided by the analyst is classic:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

I will leave the emergent pattern thesis to a future post.  For this post, I am interested in the trouble with percentages.  He is right to point out that for those 100% Blue states, the change in %Republican is constrained to be positive, from 0% up to 100%.  For most other states, the change can be positive or negative.

Good observation but wrong remedy -- those six states with 0% Republicans in 2000 are not special; removing them from the analysis is wrong-headed.  What about those states with 100% Republicans in 2000?  There, the change in %Republican can only be 0% or negative.  In fact, the possible range for the change in seats for each state is different, and it depends on the Republican proportion in 2000!  For example, if in 2000 the Republicans held 30% of the seats, then in 2006, the change must be between -30% and +70%.

The situation is worse: the range of possible values also depends on the number of seats in each state.  The fewer total seats there are, the fewer possible values that can be taken.  As the author notes, with only 1 seat, you either lose it, gain it or retain it, so that the change will be either -100%, +100% or 0%.  No other values are possible!

Both the above troubles arise because we use percentages to describe something discrete (number of seats).  This is a difficult problem and I don't know of a general solution. Redo_election06c However, in this example, because the change in seats is small across all states, regardless of the total number involved, I recommend that we avoid percentages and stick with positive, zero and negative changes.

The boxplot shows that there is little correlation between income growth and whether Republicans would win or lose House seats in 2006.  Here, the states are divided into three groups depending on whether the Republicans gained, lost or retained seats in the 2006 mid-term election.  The median income growth are similar in all three groups and the boxes overlap heavily.

Reference: "Maybe You Did Vote Your Pocketbook", New York Times, Nov 12 2006.

PS. If you like this post, consider sending me a holiday gift.

 


Dec 05, 2006

Time travel

Cambridge_traveltime_web

One of my scientific heroes and seminal teachers is Professor Frank Kelly at Cambridge.  What a pleasant surprise to see his involvement in a data visualization project.  To cite his wise words:

The travel-time maps are more than just pretty to look at; they also demonstrate an innovative way to use and present existing data. We are entering a world where we have access to vast quantities of data, and ways of turning that data into information, often involving clever ideas about visualisation, are becoming more and more important in science, government and our daily lives.

The little black dot near the center of the map indicates the Mathematics building at Cambridge.  The contours (vaguely visible at our scale) represent intervals of 10 minutes by public transportation away from the black dot.  Any colored dot on the map refers to the time at which a traveller must leave in order to get to the Math building by 9 am, taking into account traffic situation, time of day, and decisions.  The hope of such maps is to help commuters (by public transit) plan their travel.

Professor Kelly has a very nice write-up on the intricacy of generating the data for such a map, which includes techniques of sampling, smoothing, extrapolation and so on.  It is rare that we get insights into the chart-making process.  He also carries a larger version of the travel-time map.

A similar article can be found at Plus magazine.

Oct 20, 2006

The elusive catchup

CommoditiesThanks to Michael S. for sending in this chart from the economists at IMF (via this blog).

At its heart, this is a scatter plot that displays the correlation between a country's development stage (indicated by its PPP GDP) and the importance of the industrial sector to its economy.

On top of that, the chart adds a third dimension of time by linking the dots together with lines.  The lines trace the evolution in each country or set of countries.  Some countries (mostly developed nations) have a clear trend; others exhibit choppy curves which imply fluctuating economic conditions.

We have created this type of chart when discussing the fabulous Gapminder site.

The shading in the chart is supposed to draw attention to an inflection point around $15,000 per capita GDP, wherefrom the industrial sector starts to decline in importance.

In my view, that conclusion is forced because Korea is the only curve displayed on the chart that bridged the $15,000 divide.  Thus, one can say there exists only one data point supporting this hypothesis.

However, one aspect of this chart jumps out at us, which is the chasm between developed and developing countries, right at the $15,000 divide.   On the right side, the rich gets richer in a relatively steady fashion.  On the left side, the poor remains poor.  These nascent economies suffer from a great deal of volatility.  What's worse, the slopes are much sharper on the left than on the right, meaning that the gains in GDP are much smaller on the left of the divide.  Even more troubling are the cases of Brazil and Mexico which seemed to have endured a decline in the industrial sector without much gain in GDP.

The only bright spot is Korea.  (And China is the outlier.)


 

Aug 28, 2006

The dots don't connect

Nyt_stockownerNew York Times published a bar chart reminiscent of the one discussed here last week.  They added the 50% line and did not cluster the countries into groups of five. 

I like this chart for clarity and simplicity.  (Removing the decimal from the data would improve it.)  The U.S. and her special partner stand out as countries with the highest outside ownership of corporate shares. 

So far, so good.

Until I scanned the article itself, which startled and started with:

It turns out that most American investors are not xenophobic... Shareholders in the United States have been criticized as harboring "home bias" -- allocating far less to foreign stocks than they would if they did not let familiarity, patriotism and national loyalties stand in the way.

The dots don't connect, notwithstanding the academic references contained.  The chart shows how much U.S. stocks are owned by outsiders (which includes some foreigners but also many U.S. investors).  What has this to do with how much money U.S. investors spend on foreign stocks?

Even a good chart can't save a poor story.

Reference: "Investors without Borders", New York Times, Aug 27, 2006

Aug 15, 2006

Bumps charts and NYT

I just cannot resist another post on Bumps charts since  NYT finally started using them.  Here are two recent examples:



Nyt_propertytaxThis first chart illustrates the change in property taxes in different municipalities since 1998, as compared to the national average.

A wealth of information is revealed:

  • All these places charge more than the national average today
  • New York City used to charge less than average but that ended in 2003
  • The tax rates are clustered into three groups, about 6%, about 5% and below 4%.  The variance between different places has decreased during these years
  • A sharp rise was recorded in all these places in 2001-3 although New York City lagged slightly.  The sharp rise was not observed nationwide


Reference: "Gain in Income is Offset by Rise in Property Taxes", New York Times, Aug 8 2006.

Nytmconfidant
The second example is much cleaner as it involves only one period.  Bolding the "no one" line is particularly effective, bringing out the author's point well.

However, I'd have put the "no one" label on the right, just like the other labels, but bolded.

One could also argue that the real story is the simultaneous decline of "friend", "co-worker" and "neighbor" and rise of "no one" and "spouse".

Finally, it'd be interesting to see the multi-period version as the smooth linear trends are rather incredulous.

Reference: New York Times Magazine, July 16 2006.

Jul 31, 2006

Enigma of the big-buck pitcher

A data table accompanied a recent NYT article pointing out that big-buck pitchers were far from sure wins for those clubs who have taken Scott Boras' pitches.  The table contains a wealth of data but very little information is immediately revealed to the reader.

Nyt_bigcontracts


Sorting by size of contract makes no sense, especially since the key metric of success, i.e. change in winning percentage pre- and post-contract, cannot be discerned without pulling out a calculator.  Further, once the contract size is expressed by dollars per season, it is clear that all these contracts fall into the same range (about $10-13 million per year).

BigcontractsOne graphical alternative is shown on the right.  It brings out the desired message, that big-buck pitchers may or may not perform after signing big-buck contracts.  Several pitchers are annotated as these have improved or declined by more than 200 points.

A graph cannot hope to achieve the data density of a data table.  But the process of making a graph forces the designer to focus on the most important data, which itself has great benefits.

Reference: "Big-buck pitchers are often big busts", New York Times, July 16, 2006.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31