I was a guest on the Analytically Speaking series, organized by JMP. In this webcast (link, registration required), I talk about the coexistence of data science and statistics, why my blog is called "Junk Charts", what I look for in an analytics team, the tension between visualization and machine algorithms, two modes of statistical modeling, and other things analytical.
Andrew Gelman linked to this great reporting by Reuters on U.S. healthcare economics. It's a must-read. Be patient, and read through to the end even though it's a long piece.
Andrew cites statistician Don Berry who explains what "lead time bias" is, and why survival time is always the wrong metric to use in evaluating health outcomes. Survival time is the time from diagnosis to death. By doing more screening and diagnosing earlier, survival time will magically increase even if the patient's life expectancy stays put.
I ignored Andrew's warning and spent some time reading the Philipson, et. al. paper (link). Time that I want back but couldn't. To save you the trouble, I will discuss a few gaping holes other than the howler already identified by Berry - there are many other less significant issues.
The title of the paper purports to address the following "causal" relationship:
The reader immediately discovers that the authors analyzed a different "causal" relationship:
It may appear that the substitutions are harmless: spending on cancer care is a proxy for overall healthcare spending; survival gains for cancer patients is a proxy for overall health benefits. The authors hid the useful information in the Appendix (available online). In Table 3, we learn that spending on cancer care is only a single-digit percentage of total health care spending in almost every country. Besides, the total deaths by the 13 types of cancers counted in their study constitute only 31 percent of the total cancer deaths in the U.S. (using the 2011 statistics from this report - PDF). This list of included cancer types excludes the biggest killer (lung, over 150,000 deaths) while it includes testis which caused 350 deaths in 2011.
So, even if the analysis is correct, the result cannot be generalized to talk about cost and benefit of all health care spending. This is an instance of "availability bias": even though cancer makes a lot of news, most health care spending has nothing to do with cancer, and so we can't use cancer care spending as a proxy.
In assessing the value of cancer care spending, the authors decided to use a modeled change in death rates, rather than the actual observed data. Neither in the paper nor in the appendix is the actual model reported, nor is there any information on goodness of fit. However, we don't need to know the model to know it doesn't fit.
Take a look at the fourth column of Table 1 in the Appendix. This column shows the predicted deaths avoided or incurred in the U.S. (given the additional spending in the U.S. relative to "Europe").
Let's do a sanity check on these numbers. For colorectal cancer, the model claims that the extra spending has avoided 282,000 deaths over the 23 years (1982-2005), or roughly 12,300 deaths per year. According to the cancer death statistics, about 50,000 deaths from colon cancer actually occurred in the U.S. in 2011. That means the model claims that colon cancer deaths would have been 25% higher were it not for the extra spending. What is the miracle drug that caused this gigantic improvement? What prevents this amazing new treatment from crossing the Atlantic?
Maybe you believe in miracles. Then, take a look at stomach cancer. Here, the negative number seems to imply that the additional spending has induced 225,000 stomach cancer deaths over 23 years. That sounds really horrifying. Given that stomach cancer killed 10,300 Americans in 2011, the model claims that the extra spending has doubled the number of deaths from stomach cancer!
Now, go back to Table 3 in the Appendix and read the note. It says that missing data for percentage of health care spending that is cancer related is imputed as 6.5% (30% higher than the U.S. assumption of 5% which came from a totally different source), and we find that Iceland, Norway, Slovkia and Slovenia (40% of the countries) are all imputed.
The problem here is that the authors are not consistent in their treatment of missing data. In the main paper, they explain again and again that their sample of data is restricted by data availability (i.e. they didn't impute values for missing data). For example, the choice of the 10 European countries is because "only ten reported data consistently over the 1983-99 period". This means no Italy, no Spain but you have Wales and Scotland (but no England), and also Slovakia and Slovenia (why are they comparable to U.S.?).
Why those particular 13 cancers? Because "data were consistently available from both the European and US survival databases". This means including testis cancer and excluding lung cancer. Insteading of imputing values for lung cancer, they just drop the cancer type that causes the most deaths.
Why look at survival differences only for patients diagnosed from 1995 through 1999? You guessed it. It's because only in those periods can they find consistent data.
Given that they use models throughout the research, and they imputed values for proportion of spending on cancer treatment, they could have tried to impute values in these other decisions, and then the result could perhaps be generalized.
Dropping data because some variables are missing should be justified clearly. It's too easy to cherry-pick your dataset this way.
How about another non-sensical assumption? The average value of an additional year of life of someone who's dying from cancer is $150,000 to $360,000. They describe this as "standard figures for an extra year of life" and call the lower end of the range "conservative". Only 5% of Americans earn over $100,000 per year. The median personal income is less than $40,000. (From Wikipedia, for 2004, I think). Enough said.
It's sad that this paper gets publicity only because it makes a conclusion that is against "conventional wisdom". The clear evidence so far has been that while the U.S. spends twice as much on health care as other "wealthy" nations, our life expectancy is lower, and the bottom of the class. (See here, for example.)
The chart shown on the right is as clear as it can be. (I discussed this chart on Junk Charts.) The situation with science journalism is very dire, in my opinion, when outlets are chasing clicks and sales by publicizing bad studies that have eye-catching headlines.
Over at Junk Charts, I examined Nate Silver's ranking of New York neighborhoods (first published in New York magazine): Which factors affected the rankings? How did the factors correlate amongst themselves?
While analyzing the data (which I hand-transferred from the printed pages), I found a moderate number of typos, scores and ranks that don't make much sense. Now, I am not here to criticize their editors because as anyone who makes a living analyzing data knows, typos and other data issues are the norm, not the exception in this business. What I want to do here is to describe how I uncovered the typos, and more importantly, why statistical analyses are often immune to such typos.
On the right are plots of the scores against the ranks for each category (factor) being evaluated. We expect to see a monotonically decreasing function, i.e. as rank increases (moving from left to right), score must decrease (or stay put), score should not increase.
The sharp valleys and peaks in almost every one of these charts are typos. For example, the sharp valley in the "Creative Capital" corresponds to Parkchester, ranked 29th in this category, but its reported score of 63 is much lower than the Harlem (75, rank 28) and Astoria (74, rank 30).
I spent quite a bit time trying to fix these errors, trying to use the surrounding data to reason whether the rank or the score was mis-typed. It was a fruitless exercise.
(Look at "Green Space" for example. The line went up and down, indicating that there were many typos, and any fix would have involved a whole series of changes.)
In practice, data analysts do not fix typos unless they are extremely egregious and unambiguous -- and even then, the fix may just be to restate the value as "unknown". One reason is that one doesn't want to make a bad situation worse. Another is that statistical techniques by definition generalize the data, and thus are not very sensitive to individual values.
To illustrate this point, I did a linear regression of category scores and overall scores. According to Silver's ranking formula, the overall score should be a weighted average of the category scores, e.g. housing affordability had a weight of 25% in the formula.
The regression answers the question of how much of the overall ranking is explained by the individual category rankings. It should be 100% if there were no typos -- if you know the category scores, you should be able to derive the overall score without uncertainty. Because there are typos, the correlation will be slightly off.
The chart on the right shows that the correlation is almost but not perfect. The chart compares the actual overall score as reported in New York magazine with the "predicted" overall score as per the regression analysis.
The regression in effect "recovers" the weights used by Nate Silver in his algorithm (shown to the right). Despite the "noise" introduced by the typos, the weights found by the regression (shown in the column labelled "Estimate") are almost exactly those used by Silver.
This is why many statisticians are not overly concerned with small errors in the data. We expect that data is not clean, and we know many of our techniques can overcome those errors.
PS. Here is my post on Junk Charts on Nate Silver's rankings.
Vitamin A is commonly added to sunscreens because of its supposed anti-aging effect but an FDA study from ten years ago showed that Vitamin A accelerates the growth of cancerous tumors in rats.
Moral hazard: people who buy high-SPH sunblocks tend to stay out in the sun longer because they think they are better protected.
Lab conditions versus reality: people who buy high-SPH sunblocks fool themselves in a different way; they apply only a quarter of the recommended amount, which means that the protective effect reported by the manufacturer is vastly overstated.
Using a Freakonomics-style argument, one can say that Dr. Andrew Wakefield may have endangered lots of children. He was the one who published discredited research that purportedly linked autism to the combined vaccine for measles, mumps, rubella (MMR).
As a result, vaccination rates have dropped (roughly from 90% to 80% in the U.K.), and measles have made a comeback in Western countries, with worrisome consequences (from under 100 cases to 1400 cases). But note that the 10-fold increase most likely came from the 10% who switched from the vaccinated to the unvaccinated category. There have, thankfully, only been a few deaths.
In the wake of the controversy, Dr. Wakefield moved to Texas but has recently left the clinic he founded.
Several attempts to replicate his research have failed. He also was found guilty of various counts of unethical conduct, including testing a new vaccine on a kid without permission, and taking blood samples from unsuspecting kids attending his son's birthday party (by offering 5 quid each).
The original Wakefield study had a sample size of 12.
Ben Goldacre of the Guardian did exemplary work in bringing attention to the MMR scare in the UK. He believed that the blame should be placed squarely on the media for promulgating Dr. Wakefield's "research" for years while ignoring available evidence to the contrary.
Martin Gardner, 1914-2010
Brian Hayes remembers a man who entertained many with mathematical puzzles.
Jacques Bertin, 1918-2010
Reader from France Bernard L sent in this note:
It is with great sadness I've learnt of the recent departure, early May at the age of 92, of Jacques Bertin, author of the Semiology of graphics
Through his work he laid down the foundation of information visualization.
I'll keep the fond memories of the time I've spent with him when he
accepted to preface my book, of his wits and ever amused child gaze
when we discussed the data visualisation topics. He left us for a new
territory to charts and maps...
The small sample size used in the "useful chartjunk" paper is a major downer. Typically, small samples contain much "noise", making it difficult to find the "signal". (Recall the fallacy discussed by Howard Wainer concerning the small-schools movement.)
The authors, however, found several statistically significant differences. For example, participants were found to have greater ability to describe the "value message" of USA-Today-type (Holmes) charts relative to Tuftian (plain) charts showing the same dataset of 5 numbers. The chart below displays this result:
Even more shocking: the significance threshold was not merely passed but demolished. According to the paper, the p-values for the above tests were 0.003, 0.026 and 0.020 respectively. These are incredibly small p-values, especially when the sample size was only 20. (The p-value of 0.003 or 0.3% means that if both types of chart are equally effective, there is only a 0.3% chance that the 20 participants did as well as they did on the USA-Today charts relative to the Tuftian charts. Thus, the observed result presented an almost bulletproof case that chartjunk was better. For more on how this works, see Chapter 5 of Numbers Rule Your World.)
How did the researchers overcome the small sample size? The short answer is: it appeared that the experimenter consistently scored the Holmes charts higher than the plain charts for all participants, thus the "signal" was very strong, and able to rise above the noise.
It is hard to believe that Tuftian charts are so awful that everyone performs worse on those relative to Holmes charts. I'm more inclined to believe that this result is due to too much subjectivity in the design of the experiment.
Warning: the rest of the post is technical.
Fortunately, the authors provided just enough data in the paper to unravel this mystery. I'll focus attention on the description task (the first set of columns in the figure above). Since the sample size is so small, we may suspect that significance is a result of participants being very similar to one another.
The figure above tells us that the metric being evaluated is the difference in sum of scores between the Holmes charts and the plain charts. Recall that each participant saw 6 Holmes charts, 6 plain charts, and 2 training
charts (dropped from the analysis). Each chart is given a score by the
experimenter between 0 and 3. Thus, the sum of scores for any one
participant and one chart type could range from 0 to 18. The maximum difference in sum of scores would be 18-0 = 18.
Amazingly, the observed difference in sum of scores, averaged across 20 participants, was 1, since on average, the participants scored 5 on the Holmes charts and 4 on the plain charts. Put differently, on average, they scored 0.83 per Holmes chart, and 0.67 per plain chart. According to the scoring criteria, this means they were "mostly" to "all" incorrect for pretty much every chart.
Based on the t statistic (t=3.37) provided in the paper, we can also estimate the variability across participants. Since the difference was 1.0, the "standard error" (of the difference) is 0.3. This means the standard deviation of each chart type's sum of scores was approx. 0.21. As a first-order approximation, if we assume the sums were normally distributed and use the 3-sigma rule, this implies that for the Holmes charts, the participants scored between 4.4 and 5.6 while for the plain charts, between 3.4 and 4.6. (This estimation appeared to match the SE intervals shown in the figure above.)
So, incredibly, pretty much everyone did more poorly on the plain charts than the Holmes chart. Since the difference is so consistent, there is no need to have a large number of participants to prove the case!
The question is whether we believe in the scoring mechanism.
This post is a companion to my Junk Charts post on why we can't trust the research which purportedly showed that USA-Today chartjunk is "more useful" than Tuftian plain graphics. Here is an example of the two chart types they compared:
In this post, I discuss how to read a paper such as this that describes a statistical experiment, and evaluate its validity.
First, note the sample size. They only interviewed 20 participants.
This is the first big sign of trouble. Daniel Kahneman calls this "law
of small numbers", the fallacy of generalizing limited information from
small samples. For a "painless" experiment of this sort in which
subjects are just asked to read a bunch of charts, there is no excuse
to use such a small sample.
Next, tally up the research questions. At the minimum, the researchers claimed to have answered the following questions:
Which chart type led to a better description of subject?
Which chart type led to a better description of categories?
Which chart type led to a better description of trend?
Which chart type led to a better description of value message?
Did chart type affect the total completion time of the description tasks?
Which chart type led to a better immediate recall of subject?
Which chart type led to a better immediate recall of categories?
Which chart type led to a better immediate recall of trend?
Which chart type led to a better immediate recall of value message?
Which chart type led to a better long-term recall of subject?
Which chart type led to a better long-term recall of categories?
Which chart type led to a better long-term recall of trend?
Which chart type led to a better long-term recall of value message?
Which chart type led to more prompting during immediate recall of subject?
Which chart type led to more prompting during immediate recall of categories?
Which chart type led to more prompting during immediate recall of trend?
Which chart type led to more prompting during immediate recall of value message?
Which chart type led to more prompting during long-term recall of subject?
Which chart type led to more prompting during long-term recall of categories?
Which chart type led to more prompting during long-term recall of trend?
Which chart type led to more prompting during long-term recall of value message?
Which chart type did subjects prefer more?
Which chart type did subjects most enjoy?
Which chart type did subjects find most attractive?
Which chart type did subjects find easiest to describe?
Which chart type did subjects find easiest to remember?
Which chart type did subjects find easiest to remember details?
Which chart type did subjects find most accurate to describe?
Which chart type did subjects find most accurate to remember?
Which chart type did subjects find fastest to describe?
Which chart type did subjects find fastest to remember?
I think I made my point. There were more research questions than participants. Why is this bad?
Let's do a back-of-the-envelope calculation. First, think about any one
of these research questions. For a statistically significant result, we
would need roughly 15 of the 20 participants to pick one chart type
over the other. Now, if the subjects had no preference for one chart
type over the other, what is the chance that at least one of the 31
questions above will yield a statistically significant difference? The
answer is about 50%! Ouch. In other words, the probability of one or
more false positive results in this experiment is 50%.
For those wanting to see some math:
Let's say I give you a fair coin for each of the 31 questions. Then, I
ask you to flip each coin 20 times. What is the chance that at least
one of these coins will show heads more than 15 out of 20 flips? For
any one fair coin, the chance of getting 15 heads in 20 flips is very
small (about 2%). But if you repeat this with 31 coins, then there is a
47% chance that you will see one of the coins showing 15 heads out of
20 flips! The probability of at least one 2% event is 1 minus the
probability of zero 2% events; the probability of zero 2% events is the
product (31 times) of the probability of any given coin showing fewer
than 15 heads in 20 flips (= 98%).
Technically, this is known as the "multiple comparisons" problem,
and is particularly bad when a small sample size is juxtaposed with a large
number of hypotheses.
Another check is needed on the nature of the significance, which I defer to a future post.
On Junk Charts this past week, I posted the slides for a talk given at New York University, jointly with Dona Wong, which summarized five years of blogging about charts.
As part of the research for the above talk, I found that U.S. readers accounted for about half of my page views, followed by Europe. So it was only fitting that the other three posts had an international, especially European, flavor. Many readers contributed to a discussion of the "spinometer" used in British elections. I offered an alternative visualization of the web of debt among the PIIGS countries. And I posted a McCandless infographic on multiculturalism, which may or may not be tongue-in-cheek.