## The shackle of time 1

##### Jan 16, 2009

I ran across this hugely successful chart on Dean Foster's home page (and noted that he and his Wharton colleagues have a nice blog picking apart statistical errors committed in public.)

This is a histogram plotting the historical year-on-year returns of the S&P 500 index, binned into 10%-levels.  It succeeds on two levels: the innovation of printing the years inside little blocks provides extra information without distracting the overall picture; the key message of this plot, that the negative return of 2008 is a negative outlier in the history of returns, is extremely clear.

This, in my mind, is a superior presentation than the usual time-series line chart that we see in every economics publication.  For some purposes, it is better to unshackle ourselves from the linear time dimension, and this is a good example.

One question/comment: within each 10% level, the years are arranged in reverse chronological order fro top to bottom.  This facilitates searching for a particular year.  The obvious alternative is to order by the actual level of return, so that the result is akin to a stem-and-leaf plot.

While I like the graphical aspect of the chart, I feel like it has limited function.  This graph appears useful to anyone who has a one-year investment horizon.  If I want to predict what next year's S&P 500 return is, I might take a random sample from this distribution.  However, as a lazy investor, I never look at a one-year horizon so this creates two problems: if I am looking five years out, I can't take five samples from this distribution because there is serial correlation in this data for sure; even if I could take those five samples, it is difficult to compute the five-year return in my head.

So what I did was to take the data and replicate this histogram for 2-year, 3-year, 5-year, 10-year, etc. returns.  The results are as follows.  I decided to simplify further and use Tukey's boxplot instead of the histogram.  The data are real compounded total returns from S&P 500 from 1910-2008.

The boxplot on the top right shows that there is about a 25% chance that an investment in the S&P 500 will return negative in real terms in any three-year period (below the green line).  At the other end, there is a 25% chance of getting earning more than 50% on the principal during those three years.

The next set of boxplots compared 5-year returns to 10-year returns and 10-year returns to 20-year returns.  If we have a 10-year horizon, there is still positive chance of reaching the end of the decade and finding the investment under water!  The median 10-year return is approximately doubling the principal (about 8% per annum compounded).

In a twenty-year period, there is hardly any chance of not making money on the S&P.  There were two positive outliers of over 1000% (about 13% per annum compounded over 20 years).

Reference: Data from Global Financial Data

## The trouble with percentages

##### Dec 13, 2006

In the aftermath of the Democratic victory in the 2006 mid-term election, the NYT published a column floating the idea that "it was the economy, stupid".  For statistics buffs, this column provides much food for thought.

Suffice it to say, if you were my student, you would not want to hand this in as an essay.  To the author's credit, he did backload the article with lots of disclaimers.

The key thesis of the piece is:

if your state wasn’t among the best economic performers in the last six years, judged by the growth of personal income, it appears that you were three times as likely to vote to throw the bums out.

(We'll just assume he didn't mean "you" but "your state".) To help us understand the author's logic, I created a scatter plot, relating the change in state average personal income (2000-2006) to the change in percent of Republican seats.

He first segmented the states into two groups: the red dots had the top 10 income growth rates; the blue dots were the remaining states.  Then for each group, he computed the average drop in % Republican.  For the reds, it was 2%; for the blues, it was 7%.  (These levels are indicated by the horizontal lines.  My data are slightly different from his.)  Case proven -- with disclaimers.

Some of you are already counting the dots.  If you only find 42, you'd have counted correctly.  The following explanation provided by the analyst is classic:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

I will leave the emergent pattern thesis to a future post.  For this post, I am interested in the trouble with percentages.  He is right to point out that for those 100% Blue states, the change in %Republican is constrained to be positive, from 0% up to 100%.  For most other states, the change can be positive or negative.

Good observation but wrong remedy -- those six states with 0% Republicans in 2000 are not special; removing them from the analysis is wrong-headed.  What about those states with 100% Republicans in 2000?  There, the change in %Republican can only be 0% or negative.  In fact, the possible range for the change in seats for each state is different, and it depends on the Republican proportion in 2000!  For example, if in 2000 the Republicans held 30% of the seats, then in 2006, the change must be between -30% and +70%.

The situation is worse: the range of possible values also depends on the number of seats in each state.  The fewer total seats there are, the fewer possible values that can be taken.  As the author notes, with only 1 seat, you either lose it, gain it or retain it, so that the change will be either -100%, +100% or 0%.  No other values are possible!

Both the above troubles arise because we use percentages to describe something discrete (number of seats).  This is a difficult problem and I don't know of a general solution. However, in this example, because the change in seats is small across all states, regardless of the total number involved, I recommend that we avoid percentages and stick with positive, zero and negative changes.

The boxplot shows that there is little correlation between income growth and whether Republicans would win or lose House seats in 2006.  Here, the states are divided into three groups depending on whether the Republicans gained, lost or retained seats in the 2006 mid-term election.  The median income growth are similar in all three groups and the boxes overlap heavily.

Reference: "Maybe You Did Vote Your Pocketbook", New York Times, Nov 12 2006.

PS. If you like this post, consider sending me a holiday gift.

## Calming the rip tide

##### Nov 10, 2006

Xan Gregg at Forth Go helpfully scraped the auto market share data off the NYT chart discussed here before.  He even created an improved chart based on histograms.

I have created another view of the data, using boxplots.  Tukey's boxplot is one of the most spectacular graphical inventions, as I have said before (see here, for example).  Its power is evident again for this data set.

This chart is in fact two boxplots superimposed on the same surface.  I forgot to put on the legend: the green boxes represent U.S. market shares, and the blue boxes Europe shares.

The automakers are ordered by decreasing U.S. market shares (with apologies to European readers).

Lots of information can be immediately read off this chart:

• The European market is much more fragmented than the U.S. market.
• The Big 2 (GM, Ford) has had mixed fortunes over this period (as indicated by the large variance)
• The Big 2 are competitive in Europe although they are definitely not dominant there
• Several key players in Europe (Peugot, Renault, Fiat, BMW) have negligible shares in the U.S

Most importantly, there is little evidence that the U.S. market is "looking more like Europe".

One weakness of the above chart is the suppression of temporal information: there is no indication whether the recent shares are moving to the left or the right of the medians (center of each box).

In the next chart, with the Europe data removed, I highlighted the data for the most recent 5 years in red.  I can make the general statement that there is a small movement towards less concentration and more parity in the U.S. market but one have to conclude that the U.S. market shares in 2000-2006 look more similar to the U.S. market shares in 1990-1999 than to Europe market shares.

P.S. I added legends to the charts.

## The sad tally 4: the boxplot

##### Nov 15, 2005

Given several datasets (scatter plots), how does one tell random from non-random?  We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.

In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; non-random data betrays their character because they would show more or less variability at different points in the distribution (see right).  So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets.  We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.

Given what we know now, a much simpler plot will suffice.  The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).

The evidence of non-randomness in Dataset 9 now starkly confronts us.  Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.

Now, lets take a step back and look at what this non-randomness means.  We are concluding that the choice of suicide location is not random.  Location 69/70 (the outlier, with 61 deaths) is definitely the most popular.  Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.

Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.