Eye heart this

Dan at Eye Heart New York has a fantastic post relating to the recent release of restaurant health inspection data by New York City. This has caused a furor among the restaurant owners because they are now required to wear their A/B/C badges front and center. Dan collected some data (which he also posted), made some charts, and reported some interesting insights.

Here is an overview chart that shows the distribution of scores (the higher the score, the lower the grade). He called it a "scatter plot" but it is really a histogram where the bucket size is 1 except for the rightmost bucket.

Chart-scores-colored-nycfood
 

I like the use of green, yellow and red colors to indicate (without words) the conversion scale from scores (violation points) to grades (A/B/C). The legend "Count" is an Excel monstrosity. I'd have used a bucket size of at least 5, which would smooth out the gyrations in the green zone.

A more typical way to summarize numeric data in groups is Tukey's boxplot, as shown below.

Tukey_boxplot 

I use Dan's raw data on this chart. 1 = A, 2 = B, 3 = C. What is group 4?

It turns out Dan has removed this group from all of his analysis. A little research shows that group 4 are restaurants that have been closed by the Dept of Health. Interestingly, the scores of these restaurants are spread widely so the DOH appears to be closing restaurants not just for health violations. (In the rest of this post, I have removed group 4.)

For those not familiar with box plots, the box contains the middle 50% of the data (in this case, the scores of the middle half of the restaurants in the respective group); the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers. As Dan pointed out, group C has lots of outliers on the high end of the score.

Score111Just for fun, I pulled the violations of the highest scoring restaurant (111 violation points). What I find intriguing is the huge fluctuation in scores over the last 5 inspections. Does this happen to other restaurants too? What does that say about the grading system?

 


***

Next, Dan then attempted to address the questions: did scores vary across the 5 boroughs? and did scores vary across cuisine groups? This is the concept covered in Chapter 1 of my book: always look at the variation around averages, that's where the most interesting stuff is.

He calculated the means and standard deviations of different subgroups. It is simpler to visualize the data, again using boxplots.

Here's one dealing with boroughs, and it is clear that there is not much to pick between them. You could possibly say Staten Island is better than the other 4 boroughs.

Redo_scorebyborough

Here's one dealing with cuisine groups, using Dan's definitions.

Redo_scorebycuisinggroups

The order of the cuisine groups is by median score from lowest on the left to highest on the right. Again, there is no drastic difference. It is certainly not the case that Asian/Latin American restaurants are worse than say European or American ones.

About half of the restaurants under desserts, drinks, misc., african, and others received As while a bit less than half of the other cuisine groups got As. Some of the cuisine groups had few egregious violators (African, Middle East) - but this data is perhaps skewed by the removal of the "closed" restaurants.

One shortcoming of the traditional boxplot is the omission of how large each group is. For groups that are too small, it is difficult to draw any statistical conclusions. We know from Dan's table, for instance, that there were only 17 restaurants classified as "African".

(Unfortunately, Excel does not have built-in capability for generating boxplots.)


Self-sufficient charts

A good example showed up in the New York Times recently of a chart that fails the self-sufficiency test that I often speak about here. First, the doctored chart (with the data removed):

Redo_hometeampies
And for comparison, the chart as originally printed (the chart was found only on the paper edition but not on line):

Nyt_homefield_sm
There is little doubt that the second version, with the data -- all four numbers -- printed on the chart, is much more effective, and that is why the designer thought to include them.

This shows that readers are gravitating to the data rather than the graphical constructs, and thus I consider these types of charts not self-sufficient. The graphical constructs can't stand on their own.

***

The choice of pie charts in a small-multiples arrangement is a mistake for this data set. While indeed in theory the winning percentage could range from 0 to 100%, in practice the winning percentages are rather narrowly dispersed (with the exception of the NFL which has a 16-game regular season).

Just quickly looking up the 2009 regular seasons: MLB teams ranged from 36% (Nationals) to 65% (Yankees); NHL ranged from 32% (Islanders) to 65% (Bruins); NBA from 21% (Sacramento) to 81% (Cleveland).

In order to judge whether 60% or 52% is a large or small number, readers need to have a sense of how teams are dispersed around those averages. A side-by-side boxplot brings this out pretty well (the data is for 2009 seasons).

Redo_homewins

The "box" in a boxplot contains the middle 50% of the teams in each league while the line inside the box depicts the median team (in terms of winning percentage).

The NBA teams showed much higher variability in winning percentages than the NHL or the MLB. The difference in average winning percentage of say, 2% or 5%, from one league to the next is not remarkable, given this fact.

(The original article did not really pertain to such a comparison so the reason for this chart is not clear.)


The shackle of time 1

SP_from_1825 I ran across this hugely successful chart on Dean Foster's home page (and noted that he and his Wharton colleagues have a nice blog picking apart statistical errors committed in public.)

This is a histogram plotting the historical year-on-year returns of the S&P 500 index, binned into 10%-levels.  It succeeds on two levels: the innovation of printing the years inside little blocks provides extra information without distracting the overall picture; the key message of this plot, that the negative return of 2008 is a negative outlier in the history of returns, is extremely clear.

This, in my mind, is a superior presentation than the usual time-series line chart that we see in every economics publication.  For some purposes, it is better to unshackle ourselves from the linear time dimension, and this is a good example.

One question/comment: within each 10% level, the years are arranged in reverse chronological order fro top to bottom.  This facilitates searching for a particular year.  The obvious alternative is to order by the actual level of return, so that the result is akin to a stem-and-leaf plot.

While I like the graphical aspect of the chart, I feel like it has limited function.  This graph appears useful to anyone who has a one-year investment horizon.  If I want to predict what next year's S&P 500 return is, I might take a random sample from this distribution.  However, as a lazy investor, I never look at a one-year horizon so this creates two problems: if I am looking five years out, I can't take five samples from this distribution because there is serial correlation in this data for sure; even if I could take those five samples, it is difficult to compute the five-year return in my head.

So what I did was to take the data and replicate this histogram for 2-year, 3-year, 5-year, 10-year, etc. returns.  The results are as follows.  I decided to simplify further and use Tukey's boxplot instead of the histogram.  The data are real compounded total returns from S&P 500 from 1910-2008.

Redo_sandp123  The boxplot on the top right shows that there is about a 25% chance that an investment in the S&P 500 will return negative in real terms in any three-year period (below the green line).  At the other end, there is a 25% chance of getting earning more than 50% on the principal during those three years.

The next set of boxplots compared 5-year returns to 10-year returns and 10-year returns to 20-year returns.  If we have a 10-year horizon, there is still positive chance of reaching the end of the decade and finding the investment under water!  The median 10-year return is approximately doubling the principal (about 8% per annum compounded).  

In a twenty-year period, there is hardly any chance of not making money on the S&P.  There were two positive outliers of over 1000% (about 13% per annum compounded over 20 years).






Reference: Data from Global Financial Data





The trouble with percentages

In the aftermath of the Democratic victory in the 2006 mid-term election, the NYT published a column floating the idea that "it was the economy, stupid".  For statistics buffs, this column provides much food for thought. 

Suffice it to say, if you were my student, you would not want to hand this in as an essay.  To the author's credit, he did backload the article with lots of disclaimers.

The key thesis of the piece is:

if your state wasn’t among the best economic performers in the last six years, judged by the growth of personal income, it appears that you were three times as likely to vote to throw the bums out.

Redo_election06b_1 (We'll just assume he didn't mean "you" but "your state".) To help us understand the author's logic, I created a scatter plot, relating the change in state average personal income (2000-2006) to the change in percent of Republican seats.

He first segmented the states into two groups: the red dots had the top 10 income growth rates; the blue dots were the remaining states.  Then for each group, he computed the average drop in % Republican.  For the reds, it was 2%; for the blues, it was 7%.  (These levels are indicated by the horizontal lines.  My data are slightly different from his.)  Case proven -- with disclaimers.

Some of you are already counting the dots.  If you only find 42, you'd have counted correctly.  The following explanation provided by the analyst is classic:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

I will leave the emergent pattern thesis to a future post.  For this post, I am interested in the trouble with percentages.  He is right to point out that for those 100% Blue states, the change in %Republican is constrained to be positive, from 0% up to 100%.  For most other states, the change can be positive or negative.

Good observation but wrong remedy -- those six states with 0% Republicans in 2000 are not special; removing them from the analysis is wrong-headed.  What about those states with 100% Republicans in 2000?  There, the change in %Republican can only be 0% or negative.  In fact, the possible range for the change in seats for each state is different, and it depends on the Republican proportion in 2000!  For example, if in 2000 the Republicans held 30% of the seats, then in 2006, the change must be between -30% and +70%.

The situation is worse: the range of possible values also depends on the number of seats in each state.  The fewer total seats there are, the fewer possible values that can be taken.  As the author notes, with only 1 seat, you either lose it, gain it or retain it, so that the change will be either -100%, +100% or 0%.  No other values are possible!

Both the above troubles arise because we use percentages to describe something discrete (number of seats).  This is a difficult problem and I don't know of a general solution. Redo_election06c However, in this example, because the change in seats is small across all states, regardless of the total number involved, I recommend that we avoid percentages and stick with positive, zero and negative changes.

The boxplot shows that there is little correlation between income growth and whether Republicans would win or lose House seats in 2006.  Here, the states are divided into three groups depending on whether the Republicans gained, lost or retained seats in the 2006 mid-term election.  The median income growth are similar in all three groups and the boxes overlap heavily.

Reference: "Maybe You Did Vote Your Pocketbook", New York Times, Nov 12 2006.

PS. If you like this post, consider sending me a holiday gift.

 



Calming the rip tide

Xan Gregg at Forth Go helpfully scraped the auto market share data off the NYT chart discussed here before.  He even created an improved chart based on histograms.

I have created another view of the data, using boxplots.  Tukey's boxplot is one of the most spectacular graphical inventions, as I have said before (see here, for example).  Its power is evident again for this data set.

Redo_autoshares_1 This chart is in fact two boxplots superimposed on the same surface.  I forgot to put on the legend: the green boxes represent U.S. market shares, and the blue boxes Europe shares.

The automakers are ordered by decreasing U.S. market shares (with apologies to European readers).

Lots of information can be immediately read off this chart:

  • The European market is much more fragmented than the U.S. market.
  • The Big 2 (GM, Ford) has had mixed fortunes over this period (as indicated by the large variance)
  • The Big 2 are competitive in Europe although they are definitely not dominant there
  • Several key players in Europe (Peugot, Renault, Fiat, BMW) have negligible shares in the U.S

Most importantly, there is little evidence that the U.S. market is "looking more like Europe".

One weakness of the above chart is the suppression of temporal information: there is no indication whether the recent shares are moving to the left or the right of the medians (center of each box). 

In the next chart, with the Europe data removed, I highlighted the data for the most recent 5 years in red.  I can make the general statement that there is a small movement towards less concentration and more parity in the U.S. market but one have to conclude that the U.S. market shares in 2000-2006 look more similar to the U.S. market shares in 1990-1999 than to Europe market shares.

Redo_autoshares2000

P.S. I added legends to the charts.



The sad tally 4: the boxplot

Given several datasets (scatter plots), how does one tell random from non-random?  We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.

In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; Realsuidkeys_1non-random data betrays their character because they would show more or less variability at different points in the distribution (see right).  So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets.  We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.

Given what we know now, a much simpler plot will suffice.  The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).


SuicidesboxThe evidence of non-randomness in Dataset 9 now starkly confronts us.  Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.

 
 

 
 
Now, lets take a step back and look at what this non-randomness means.  We are concluding that the choice of suicide location is not random.  Location 69/70 (the outlier, with 61 deaths) is definitely the most popular.  Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.

Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.