Over at Junk Charts, I examined Nate Silver's ranking of New York neighborhoods (first published in *New York* magazine): Which factors affected the rankings? How did the factors correlate amongst themselves?

While analyzing the data (which I hand-transferred from the printed pages), I found a moderate number of typos, scores and ranks that don't make much sense. Now, I am not here to criticize their editors because as anyone who makes a living analyzing data knows, typos and other data issues are the norm, not the exception in this business. What I want to do here is to describe how I uncovered the typos, and more importantly, why statistical analyses are often immune to such typos.

***

On the right are plots of the scores against the ranks for each category (factor) being evaluated. We expect to see a monotonically decreasing function, i.e. as rank increases (moving from left to right), score must decrease (or stay put), score should not increase.

The sharp valleys and peaks in almost every one of these charts are typos. For example, the sharp valley in the "Creative Capital" corresponds to Parkchester, ranked 29th in this category, but its reported score of 63 is much lower than the Harlem (75, rank 28) and Astoria (74, rank 30).

I spent quite a bit time trying to fix these errors, trying to use the surrounding data to reason whether the rank or the score was mis-typed. It was a fruitless exercise.

(Look at "Green Space" for example. The line went up and down, indicating that there were many typos, and any fix would have involved a whole series of changes.)

***

In practice, data analysts do not fix typos unless they are extremely egregious and unambiguous -- and even then, the fix may just be to restate the value as "unknown". One reason is that one doesn't want to make a bad situation worse. Another is that statistical techniques by definition generalize the data, and thus are not very sensitive to individual values.

To illustrate this point, I did a linear regression of category scores and overall scores. According to Silver's ranking formula, the overall score should be a weighted average of the category scores, e.g. housing affordability had a weight of 25% in the formula.

The regression answers the question of how much of the overall ranking is explained by the individual category rankings. It should be 100% if there were no typos -- if you know the category scores, you should be able to derive the overall score without uncertainty. Because there are typos, the correlation will be slightly off.

The chart on the right shows that the correlation is almost but not perfect. The chart compares the actual overall score as reported in New York magazine with the "predicted" overall score as per the regression analysis.

The regression in effect "recovers" the weights used by Nate Silver in his algorithm (shown to the right). Despite the "noise" introduced by the typos, the weights found by the regression (shown in the column labelled "Estimate") are almost exactly those used by Silver.

This is why many statisticians are not overly concerned with small errors in the data. We expect that data is not clean, and we know many of our techniques can overcome those errors.

***

PS. Here is my post on Junk Charts on Nate Silver's rankings.

Interesting post, and I agree with the conclusion. However, this would not work if the information is presented with a drill-down functionality. Drilling down will effectively reduce the sample size and will magnify the influence of errors/typos.

I frequently have to explain to customers that a report is produced from dirty data and they should be aware of it, to which they would typically remark that a few errors will note make a huge difference, and they are right. But a lot of BI tools allow easy drill-down, and that is where they have to be careful.

Posted by: Dimitri | 06/18/2010 at 12:07 AM

Dimitri: Absolutely, drilling down often presents problems, usually because the original research design did not anticipate analysis at those levels. This post does not deal with drilling down, however.

Posted by: Kaiser | 06/18/2010 at 11:17 PM

"Another is that statistical techniques by definition generalize the data, and thus are not very sensitive to individual values."

This statement is a little over the top. The robustness of a statistical technique to outliers varies a good deal between methods and across sample sizes. Least squares regression like you're doing here is actually quite sensitive to anomalous data points in many circumstances.

Posted by: J | 06/20/2010 at 11:55 PM