Over at Junk Charts, I examined Nate Silver's ranking of New York neighborhoods (first published in New York magazine): Which factors affected the rankings? How did the factors correlate amongst themselves?
While analyzing the data (which I hand-transferred from the printed pages), I found a moderate number of typos, scores and ranks that don't make much sense. Now, I am not here to criticize their editors because as anyone who makes a living analyzing data knows, typos and other data issues are the norm, not the exception in this business. What I want to do here is to describe how I uncovered the typos, and more importantly, why statistical analyses are often immune to such typos.
On the right are plots of the scores against the ranks for each category (factor) being evaluated. We expect to see a monotonically decreasing function, i.e. as rank increases (moving from left to right), score must decrease (or stay put), score should not increase.
The sharp valleys and peaks in almost every one of these charts are typos. For example, the sharp valley in the "Creative Capital" corresponds to Parkchester, ranked 29th in this category, but its reported score of 63 is much lower than the Harlem (75, rank 28) and Astoria (74, rank 30).
I spent quite a bit time trying to fix these errors, trying to use the surrounding data to reason whether the rank or the score was mis-typed. It was a fruitless exercise.
(Look at "Green Space" for example. The line went up and down, indicating that there were many typos, and any fix would have involved a whole series of changes.)
In practice, data analysts do not fix typos unless they are extremely egregious and unambiguous -- and even then, the fix may just be to restate the value as "unknown". One reason is that one doesn't want to make a bad situation worse. Another is that statistical techniques by definition generalize the data, and thus are not very sensitive to individual values.
To illustrate this point, I did a linear regression of category scores and overall scores. According to Silver's ranking formula, the overall score should be a weighted average of the category scores, e.g. housing affordability had a weight of 25% in the formula.
The regression answers the question of how much of the overall ranking is explained by the individual category rankings. It should be 100% if there were no typos -- if you know the category scores, you should be able to derive the overall score without uncertainty. Because there are typos, the correlation will be slightly off.
The chart on the right shows that the correlation is almost but not perfect. The chart compares the actual overall score as reported in New York magazine with the "predicted" overall score as per the regression analysis.
The regression in effect "recovers" the weights used by Nate Silver in his algorithm (shown to the right). Despite the "noise" introduced by the typos, the weights found by the regression (shown in the column labelled "Estimate") are almost exactly those used by Silver.
This is why many statisticians are not overly concerned with small errors in the data. We expect that data is not clean, and we know many of our techniques can overcome those errors.
PS. Here is my post on Junk Charts on Nate Silver's rankings.
Vitamin A is commonly added to sunscreens because of its supposed anti-aging effect but an FDA study from ten years ago showed that Vitamin A accelerates the growth of cancerous tumors in rats.
Moral hazard: people who buy high-SPH sunblocks tend to stay out in the sun longer because they think they are better protected.
Lab conditions versus reality: people who buy high-SPH sunblocks fool themselves in a different way; they apply only a quarter of the recommended amount, which means that the protective effect reported by the manufacturer is vastly overstated.
Using a Freakonomics-style argument, one can say that Dr. Andrew Wakefield may have endangered lots of children. He was the one who published discredited research that purportedly linked autism to the combined vaccine for measles, mumps, rubella (MMR).
As a result, vaccination rates have dropped (roughly from 90% to 80% in the U.K.), and measles have made a comeback in Western countries, with worrisome consequences (from under 100 cases to 1400 cases). But note that the 10-fold increase most likely came from the 10% who switched from the vaccinated to the unvaccinated category. There have, thankfully, only been a few deaths.
In the wake of the controversy, Dr. Wakefield moved to Texas but has recently left the clinic he founded.
Several attempts to replicate his research have failed. He also was found guilty of various counts of unethical conduct, including testing a new vaccine on a kid without permission, and taking blood samples from unsuspecting kids attending his son's birthday party (by offering 5 quid each).
The original Wakefield study had a sample size of 12.
Ben Goldacre of the Guardian did exemplary work in bringing attention to the MMR scare in the UK. He believed that the blame should be placed squarely on the media for promulgating Dr. Wakefield's "research" for years while ignoring available evidence to the contrary.
Martin Gardner, 1914-2010
Brian Hayes remembers a man who entertained many with mathematical puzzles.
Jacques Bertin, 1918-2010
Reader from France Bernard L sent in this note:
It is with great sadness I've learnt of the recent departure, early May at the age of 92, of Jacques Bertin, author of the Semiology of graphics
Through his work he laid down the foundation of information visualization.
I'll keep the fond memories of the time I've spent with him when he
accepted to preface my book, of his wits and ever amused child gaze
when we discussed the data visualisation topics. He left us for a new
territory to charts and maps...
The small sample size used in the "useful chartjunk" paper is a major downer. Typically, small samples contain much "noise", making it difficult to find the "signal". (Recall the fallacy discussed by Howard Wainer concerning the small-schools movement.)
The authors, however, found several statistically significant differences. For example, participants were found to have greater ability to describe the "value message" of USA-Today-type (Holmes) charts relative to Tuftian (plain) charts showing the same dataset of 5 numbers. The chart below displays this result:
Even more shocking: the significance threshold was not merely passed but demolished. According to the paper, the p-values for the above tests were 0.003, 0.026 and 0.020 respectively. These are incredibly small p-values, especially when the sample size was only 20. (The p-value of 0.003 or 0.3% means that if both types of chart are equally effective, there is only a 0.3% chance that the 20 participants did as well as they did on the USA-Today charts relative to the Tuftian charts. Thus, the observed result presented an almost bulletproof case that chartjunk was better. For more on how this works, see Chapter 5 of Numbers Rule Your World.)
How did the researchers overcome the small sample size? The short answer is: it appeared that the experimenter consistently scored the Holmes charts higher than the plain charts for all participants, thus the "signal" was very strong, and able to rise above the noise.
It is hard to believe that Tuftian charts are so awful that everyone performs worse on those relative to Holmes charts. I'm more inclined to believe that this result is due to too much subjectivity in the design of the experiment.
Warning: the rest of the post is technical.
Fortunately, the authors provided just enough data in the paper to unravel this mystery. I'll focus attention on the description task (the first set of columns in the figure above). Since the sample size is so small, we may suspect that significance is a result of participants being very similar to one another.
The figure above tells us that the metric being evaluated is the difference in sum of scores between the Holmes charts and the plain charts. Recall that each participant saw 6 Holmes charts, 6 plain charts, and 2 training
charts (dropped from the analysis). Each chart is given a score by the
experimenter between 0 and 3. Thus, the sum of scores for any one
participant and one chart type could range from 0 to 18. The maximum difference in sum of scores would be 18-0 = 18.
Amazingly, the observed difference in sum of scores, averaged across 20 participants, was 1, since on average, the participants scored 5 on the Holmes charts and 4 on the plain charts. Put differently, on average, they scored 0.83 per Holmes chart, and 0.67 per plain chart. According to the scoring criteria, this means they were "mostly" to "all" incorrect for pretty much every chart.
Based on the t statistic (t=3.37) provided in the paper, we can also estimate the variability across participants. Since the difference was 1.0, the "standard error" (of the difference) is 0.3. This means the standard deviation of each chart type's sum of scores was approx. 0.21. As a first-order approximation, if we assume the sums were normally distributed and use the 3-sigma rule, this implies that for the Holmes charts, the participants scored between 4.4 and 5.6 while for the plain charts, between 3.4 and 4.6. (This estimation appeared to match the SE intervals shown in the figure above.)
So, incredibly, pretty much everyone did more poorly on the plain charts than the Holmes chart. Since the difference is so consistent, there is no need to have a large number of participants to prove the case!
The question is whether we believe in the scoring mechanism.
This post is a companion to my Junk Charts post on why we can't trust the research which purportedly showed that USA-Today chartjunk is "more useful" than Tuftian plain graphics. Here is an example of the two chart types they compared:
In this post, I discuss how to read a paper such as this that describes a statistical experiment, and evaluate its validity.
First, note the sample size. They only interviewed 20 participants.
This is the first big sign of trouble. Daniel Kahneman calls this "law
of small numbers", the fallacy of generalizing limited information from
small samples. For a "painless" experiment of this sort in which
subjects are just asked to read a bunch of charts, there is no excuse
to use such a small sample.
Next, tally up the research questions. At the minimum, the researchers claimed to have answered the following questions:
Which chart type led to a better description of subject?
Which chart type led to a better description of categories?
Which chart type led to a better description of trend?
Which chart type led to a better description of value message?
Did chart type affect the total completion time of the description tasks?
Which chart type led to a better immediate recall of subject?
Which chart type led to a better immediate recall of categories?
Which chart type led to a better immediate recall of trend?
Which chart type led to a better immediate recall of value message?
Which chart type led to a better long-term recall of subject?
Which chart type led to a better long-term recall of categories?
Which chart type led to a better long-term recall of trend?
Which chart type led to a better long-term recall of value message?
Which chart type led to more prompting during immediate recall of subject?
Which chart type led to more prompting during immediate recall of categories?
Which chart type led to more prompting during immediate recall of trend?
Which chart type led to more prompting during immediate recall of value message?
Which chart type led to more prompting during long-term recall of subject?
Which chart type led to more prompting during long-term recall of categories?
Which chart type led to more prompting during long-term recall of trend?
Which chart type led to more prompting during long-term recall of value message?
Which chart type did subjects prefer more?
Which chart type did subjects most enjoy?
Which chart type did subjects find most attractive?
Which chart type did subjects find easiest to describe?
Which chart type did subjects find easiest to remember?
Which chart type did subjects find easiest to remember details?
Which chart type did subjects find most accurate to describe?
Which chart type did subjects find most accurate to remember?
Which chart type did subjects find fastest to describe?
Which chart type did subjects find fastest to remember?
I think I made my point. There were more research questions than participants. Why is this bad?
Let's do a back-of-the-envelope calculation. First, think about any one
of these research questions. For a statistically significant result, we
would need roughly 15 of the 20 participants to pick one chart type
over the other. Now, if the subjects had no preference for one chart
type over the other, what is the chance that at least one of the 31
questions above will yield a statistically significant difference? The
answer is about 50%! Ouch. In other words, the probability of one or
more false positive results in this experiment is 50%.
For those wanting to see some math:
Let's say I give you a fair coin for each of the 31 questions. Then, I
ask you to flip each coin 20 times. What is the chance that at least
one of these coins will show heads more than 15 out of 20 flips? For
any one fair coin, the chance of getting 15 heads in 20 flips is very
small (about 2%). But if you repeat this with 31 coins, then there is a
47% chance that you will see one of the coins showing 15 heads out of
20 flips! The probability of at least one 2% event is 1 minus the
probability of zero 2% events; the probability of zero 2% events is the
product (31 times) of the probability of any given coin showing fewer
than 15 heads in 20 flips (= 98%).
Technically, this is known as the "multiple comparisons" problem,
and is particularly bad when a small sample size is juxtaposed with a large
number of hypotheses.
Another check is needed on the nature of the significance, which I defer to a future post.
On Junk Charts this past week, I posted the slides for a talk given at New York University, jointly with Dona Wong, which summarized five years of blogging about charts.
As part of the research for the above talk, I found that U.S. readers accounted for about half of my page views, followed by Europe. So it was only fitting that the other three posts had an international, especially European, flavor. Many readers contributed to a discussion of the "spinometer" used in British elections. I offered an alternative visualization of the web of debt among the PIIGS countries. And I posted a McCandless infographic on multiculturalism, which may or may not be tongue-in-cheek.