Blogging will resume after the holiday weekend. In the meantime, check out these photos from the book signing at Book Expo America (BEA) last week. The publisher set up a fun spinning wheel game as people lined up to get autographs. The books were snapped up in about a half hour, much to our surprise but a pleasant one. Some love numbers while many others know people who love numbers.
Vitamin A is commonly added to sunscreens because of its supposed anti-aging effect but an FDA study from ten years ago showed that Vitamin A accelerates the growth of cancerous tumors in rats.
Moral hazard: people who buy high-SPH sunblocks tend to stay out in the sun longer because they think they are better protected.
Lab conditions versus reality: people who buy high-SPH sunblocks fool themselves in a different way; they apply only a quarter of the recommended amount, which means that the protective effect reported by the manufacturer is vastly overstated.
Using a Freakonomics-style argument, one can say that Dr. Andrew Wakefield may have endangered lots of children. He was the one who published discredited research that purportedly linked autism to the combined vaccine for measles, mumps, rubella (MMR).
As a result, vaccination rates have dropped (roughly from 90% to 80% in the U.K.), and measles have made a comeback in Western countries, with worrisome consequences (from under 100 cases to 1400 cases). But note that the 10-fold increase most likely came from the 10% who switched from the vaccinated to the unvaccinated category. There have, thankfully, only been a few deaths.
In the wake of the controversy, Dr. Wakefield moved to Texas but has recently left the clinic he founded.
Several attempts to replicate his research have failed. He also was found guilty of various counts of unethical conduct, including testing a new vaccine on a kid without permission, and taking blood samples from unsuspecting kids attending his son's birthday party (by offering 5 quid each).
The original Wakefield study had a sample size of 12.
Ben Goldacre of the Guardian did exemplary work in bringing attention to the MMR scare in the UK. He believed that the blame should be placed squarely on the media for promulgating Dr. Wakefield's "research" for years while ignoring available evidence to the contrary.
Martin Gardner, 1914-2010
Brian Hayes remembers a man who entertained many with mathematical puzzles.
Jacques Bertin, 1918-2010
Reader from France Bernard L sent in this note:
It is with great sadness I've learnt of the recent departure, early May at the age of 92, of Jacques Bertin, author of the Semiology of graphics
Through his work he laid down the foundation of information visualization.
I'll keep the fond memories of the time I've spent with him when he
accepted to preface my book, of his wits and ever amused child gaze
when we discussed the data visualisation topics. He left us for a new
territory to charts and maps...
The last section of Chapter 5 may feel a little out of place: after examining the ins and outs of statistical testing, using flight safety and lotteries as a backdrop, I tagged on a coda on the laudable but poorly-executed attempt by the government to make flight safety data available to the public. I remarked: "a few well-chosen numbers paint a far richer picture than hundreds of thousands of disorganized data." (p.154)
At the time, I debated whether to drop this section because it has little to do with the key concept of the chapter (statistical testing). In hindsight, the decision to leave it intact is wise. An exact parallel has developed in the case of the Fed making credit card terms and conditions available to the public.
As reported in the New York Times, the Fed merely dug a hole in the ground, and filled it with piles of PDF files. They provide a simple search engine so if you know what you want to know, you may be in luck; and if you want to understand the big picture, you are on your own.
Lest you think this interface was designed for experts only, the left margin proclaimed this the "Consumer's Guide".
Maybe Ed Tufte will get around to fixing this (and a myriad other government databases).
On to political culture. I found two headlines for the NYT article, one more favorable to the administration and one more descriptive of reality. I leave you to decide. The paper version I looked at has the headline shown on the right.
Reference: "Credit Card Database is Heroic and Mystifying", Sewell Chan and Andrew Martin, New York Times, May 24 2010.
One fundamental tenet in statistical thinking concerns the signal-to-noise ratio.
Given a mass of data, some portion of it (perhaps most of it) are "noise". Noise is nuisance. Imagine an ear-splitting bar, and trying to hear what your friend is saying standing only a feet away. Noise covers up the signal; noise is the greatest enemy of the statistician.
Noise is everywhere. I didn't explicitly mention this in the book but noise is also everywhere in its pages. When I spoke of "sporadic" E.coli cases making it difficult to know if an outbreak is occurring in Chapter 2, these sporadic cases are noise. When I explained why we ought not fear plane crashes happening in bunches in Chapter 5, it's because statisticians showed the coincidence is noise, nothing to be alarmed about. (One way to think about statistical tests is that they evaluate the signal-to-noise ratio.)
What motivated this post is Twitter. The Library of Congress plans to archive "every tweet". This generated a lot of buzz; the tech community saw this as a legitimization of Twitter.
I say blah. Because there is really very little new knowledge to be found in Twitter. Almost everything is regurgitation, much of it would not survive scrutiny, and a lot of the tweets are "re-tweets". Here's Jeff Miller describing this issue, what he calls "Twitter's Garbage Problem".
In particular, because you either subscribe to someone's Twitter stream or not, you have to take in both the nonsense and the brilliant stuff. Jeff considers how one might implement noise-filtering schemes.
But he pretty much nailed the impossibility of noise filtering earlier in the post when he stated: "One person's garbage is another person's gold."
Think about "spam" for a second. A marketing email from Macy's is not always "spam" even if unsolicited - if you've been shopping for a sofa, and Macy's sends you an email offering you 50% off on their furniture collection, you will most likely not regard the email as spam. This is why spam filters are far from perfect. And Twitter filters have a similar challenge.
For Junk Charts readers, "signal-to-noise" ratio is manifested in Tufte's data-ink ratio. The data is the signal; the ink is the noise, roughly speaking.
The big news in the sports world right now is about Floyd Landis (again). He was the U.S. cyclist, former confidante of Lance Armstrong, who won the Tour de France in a historic recovery but then was caught doping and eventually stripped of his medal. After the positive test, he had waged an expensive legal campaign to discredit the steroid testing regime; many people believed him, and their fervor can still be experienced at the Trust but Verify blog (no new posts since Dec 08). I wonder how they feel now.
He thrust himself into the news again by admitting years of doping, essentially telling us he lied repeatedly during the legal fight. He detailed his own doping history, and in addition, implicated a lot of American cyclists, including Armstrong.
The initial reaction to Landis's bombshell from his peers is similar to the ice-cold reception Jose Canceso received when he exposed the underworld of baseball doping. Canceso has been proven right in pretty much every case he brought to the public, including his "guess" about Alex Rodriguez. Would Landis eventually be vindicated?
If you agree with my discussion in Chapter 4 of Numbers Rule Your World, you will be the least bit surprised by this development. I cover a lot in the chapter but the most relevant points are:
A "negative" test result has not much informational value: many disgraced athletes (track star Marion Jones, cyclist Tyler Hamilton, cyclist Bjarne Riis, etc.) pointed to hundreds of negative results but eventually were exposed as dopers. See my previous post on "negative predictive value" for the statistics behind this.
The media typically only obsesses about the "false positive" problem, i.e. star athletes being falsely accused of doping but misses the much more serious problem of the "false negative", i.e. drug testers failing to uncover dopers
The "false negative" error is typically hidden: unless people like Landis voluntarily implicate themselves, we will never find out. That's why testing labs are less afraid of this sort of error than the "false positive".
Unfortunately we don't have good lie detectors. If we had accurate lie detectors, we would hook these athletes up to them.
Even within cycling, Landis was not the first to confess to years and years of doping. Dutch Danish Tour de France champion Bjarne Riis also confessed after hiding behind negative test results for his entire career. Landis will be attacked relentlessly in the coming days. I'd not be surprised if he'd be proven right but of course, we may never know.
PS. This post was mentioned on the Cyclocosm blog, which has much more about pro cycling and the steroids/EPO/etc. scandals, as well as coverage of the on-going Giro d'Italia.
The small sample size used in the "useful chartjunk" paper is a major downer. Typically, small samples contain much "noise", making it difficult to find the "signal". (Recall the fallacy discussed by Howard Wainer concerning the small-schools movement.)
The authors, however, found several statistically significant differences. For example, participants were found to have greater ability to describe the "value message" of USA-Today-type (Holmes) charts relative to Tuftian (plain) charts showing the same dataset of 5 numbers. The chart below displays this result:
Even more shocking: the significance threshold was not merely passed but demolished. According to the paper, the p-values for the above tests were 0.003, 0.026 and 0.020 respectively. These are incredibly small p-values, especially when the sample size was only 20. (The p-value of 0.003 or 0.3% means that if both types of chart are equally effective, there is only a 0.3% chance that the 20 participants did as well as they did on the USA-Today charts relative to the Tuftian charts. Thus, the observed result presented an almost bulletproof case that chartjunk was better. For more on how this works, see Chapter 5 of Numbers Rule Your World.)
How did the researchers overcome the small sample size? The short answer is: it appeared that the experimenter consistently scored the Holmes charts higher than the plain charts for all participants, thus the "signal" was very strong, and able to rise above the noise.
It is hard to believe that Tuftian charts are so awful that everyone performs worse on those relative to Holmes charts. I'm more inclined to believe that this result is due to too much subjectivity in the design of the experiment.
Warning: the rest of the post is technical.
Fortunately, the authors provided just enough data in the paper to unravel this mystery. I'll focus attention on the description task (the first set of columns in the figure above). Since the sample size is so small, we may suspect that significance is a result of participants being very similar to one another.
The figure above tells us that the metric being evaluated is the difference in sum of scores between the Holmes charts and the plain charts. Recall that each participant saw 6 Holmes charts, 6 plain charts, and 2 training
charts (dropped from the analysis). Each chart is given a score by the
experimenter between 0 and 3. Thus, the sum of scores for any one
participant and one chart type could range from 0 to 18. The maximum difference in sum of scores would be 18-0 = 18.
Amazingly, the observed difference in sum of scores, averaged across 20 participants, was 1, since on average, the participants scored 5 on the Holmes charts and 4 on the plain charts. Put differently, on average, they scored 0.83 per Holmes chart, and 0.67 per plain chart. According to the scoring criteria, this means they were "mostly" to "all" incorrect for pretty much every chart.
Based on the t statistic (t=3.37) provided in the paper, we can also estimate the variability across participants. Since the difference was 1.0, the "standard error" (of the difference) is 0.3. This means the standard deviation of each chart type's sum of scores was approx. 0.21. As a first-order approximation, if we assume the sums were normally distributed and use the 3-sigma rule, this implies that for the Holmes charts, the participants scored between 4.4 and 5.6 while for the plain charts, between 3.4 and 4.6. (This estimation appeared to match the SE intervals shown in the figure above.)
So, incredibly, pretty much everyone did more poorly on the plain charts than the Holmes chart. Since the difference is so consistent, there is no need to have a large number of participants to prove the case!
The question is whether we believe in the scoring mechanism.
This post is a companion to my Junk Charts post on why we can't trust the research which purportedly showed that USA-Today chartjunk is "more useful" than Tuftian plain graphics. Here is an example of the two chart types they compared:
In this post, I discuss how to read a paper such as this that describes a statistical experiment, and evaluate its validity.
First, note the sample size. They only interviewed 20 participants.
This is the first big sign of trouble. Daniel Kahneman calls this "law
of small numbers", the fallacy of generalizing limited information from
small samples. For a "painless" experiment of this sort in which
subjects are just asked to read a bunch of charts, there is no excuse
to use such a small sample.
Next, tally up the research questions. At the minimum, the researchers claimed to have answered the following questions:
Which chart type led to a better description of subject?
Which chart type led to a better description of categories?
Which chart type led to a better description of trend?
Which chart type led to a better description of value message?
Did chart type affect the total completion time of the description tasks?
Which chart type led to a better immediate recall of subject?
Which chart type led to a better immediate recall of categories?
Which chart type led to a better immediate recall of trend?
Which chart type led to a better immediate recall of value message?
Which chart type led to a better long-term recall of subject?
Which chart type led to a better long-term recall of categories?
Which chart type led to a better long-term recall of trend?
Which chart type led to a better long-term recall of value message?
Which chart type led to more prompting during immediate recall of subject?
Which chart type led to more prompting during immediate recall of categories?
Which chart type led to more prompting during immediate recall of trend?
Which chart type led to more prompting during immediate recall of value message?
Which chart type led to more prompting during long-term recall of subject?
Which chart type led to more prompting during long-term recall of categories?
Which chart type led to more prompting during long-term recall of trend?
Which chart type led to more prompting during long-term recall of value message?
Which chart type did subjects prefer more?
Which chart type did subjects most enjoy?
Which chart type did subjects find most attractive?
Which chart type did subjects find easiest to describe?
Which chart type did subjects find easiest to remember?
Which chart type did subjects find easiest to remember details?
Which chart type did subjects find most accurate to describe?
Which chart type did subjects find most accurate to remember?
Which chart type did subjects find fastest to describe?
Which chart type did subjects find fastest to remember?
I think I made my point. There were more research questions than participants. Why is this bad?
Let's do a back-of-the-envelope calculation. First, think about any one
of these research questions. For a statistically significant result, we
would need roughly 15 of the 20 participants to pick one chart type
over the other. Now, if the subjects had no preference for one chart
type over the other, what is the chance that at least one of the 31
questions above will yield a statistically significant difference? The
answer is about 50%! Ouch. In other words, the probability of one or
more false positive results in this experiment is 50%.
For those wanting to see some math:
Let's say I give you a fair coin for each of the 31 questions. Then, I
ask you to flip each coin 20 times. What is the chance that at least
one of these coins will show heads more than 15 out of 20 flips? For
any one fair coin, the chance of getting 15 heads in 20 flips is very
small (about 2%). But if you repeat this with 31 coins, then there is a
47% chance that you will see one of the coins showing 15 heads out of
20 flips! The probability of at least one 2% event is 1 minus the
probability of zero 2% events; the probability of zero 2% events is the
product (31 times) of the probability of any given coin showing fewer
than 15 heads in 20 flips (= 98%).
Technically, this is known as the "multiple comparisons" problem,
and is particularly bad when a small sample size is juxtaposed with a large
number of hypotheses.
Another check is needed on the nature of the significance, which I defer to a future post.
A few more reviews of the book by bloggers have trickled in, thankfully all positive.
From Claus at planetwater, a blog about "ground water, engineering, science, geo-statistics":
I really loved reading those stories. They are well written, I think
well understandable for somebody who is not experienced or even trained
in “statistical thinking”. Finally, a big plus is a longer than normal
“conclusions” section, where Kaiser Fung tries to put the underlying
basic thoughts of each story into almost all the other stories’ context.
See also Claus's post on "Magnitudes of Extreme Weather Events", which is his response to a topic in my book.
I really don’t do book reviews, but this is an exception. And I’m still in the middle of reading it, too... For folks who have inquisitive minds about why stuff is there and what
happens, I suggest reading Fung’s book, which was recommended by a
friend who also seems to be into understanding innocuous bits of
This is one of the best books I have ever read next to Freakonomics by Steven Levitt and Stephen Dubner... This book has opened my eyes to many more ideas of what may be behind
my thoughts and it will help me think rationally according to
statistics when making a decision in the future.
Originally heard of this from reading Tom Peter's Twitter
feed and it is well worth your time. Everyone instinctively knows the
role of numbers in your life, but here you can delve deeper and get a
much greater understanding which could change the way you live.
Seriously. Check it out.
In addition to the Japanese version, Numbers Rule Your World will be coming out in Chinese and Korean.
Since I have many European readers, I hope they will translate it to French, German, Spanish, Italian, etc.
Many people still believe that cold weather or cold air causes the common cold. Even the name, 'cold' in English, 'raffreddore' in Italian, implies that it's somehow a result of getting cold. But it's not true. There is no such thing as the common cold in Antarctica because it's caused by a virus...
So why do we associate the common cold with winter? Because there is a connection: in winter, people spend more time indoors, in close contact with each other, in places where there is little air circulation, and the viruses spread.
Where else would I find this perfect little illustration of a hidden indirect explanation for an observed correlation? It's in The Italian Job, an intelligent book comparing the English and Italian football (soccer) cultures by Gianluca Vialli (former Azzurri striker and Champions League star) and Gabriele Marcotti. (Yes, I'm gearing up for the World Cup.)
On Junk Charts this past week, I posted the slides for a talk given at New York University, jointly with Dona Wong, which summarized five years of blogging about charts.
As part of the research for the above talk, I found that U.S. readers accounted for about half of my page views, followed by Europe. So it was only fitting that the other three posts had an international, especially European, flavor. Many readers contributed to a discussion of the "spinometer" used in British elections. I offered an alternative visualization of the web of debt among the PIIGS countries. And I posted a McCandless infographic on multiculturalism, which may or may not be tongue-in-cheek.