The previous post has elicited protests of "it's not that bad" from some corners. Well, it's bad. Let's look at it from another angle.
We start with the Economist chart, and ask what is the message.
The chart is saying that in 2006-8, there are 10 Indian states that have female-to-male-babies ratios below the world average (the so-called "natural ratio"). For those who know their Indian geography, the chart gives the names of these anomalous states. The chart also tells us among these 10 states, some have been gaining and others have been losing ground when compared to the 2001-3 period. There is no obvious pattern as to which states are gaining, and which losing.
That's pretty much everything that one can discern from this chart.
The problem is, the average Economist reader already knows that in India, as in China and many Asian countries, there are more male babies than female ones than in other parts of the world. If he or she doesn't know this fundamental statistic, the chart does not help because it says nothing about the other 24 states that make up the Indian average.
Worse, the chart raises the suspicion of voodoo statistics. It suggests that the other 24 states have a gender ratio that is at least equal to, if not above, the natural ratio. One would then have to believe that either the overall Indian average is higher than the natural ratio, or the negative deviations from the world average (as shown on the chart) are quite a bit larger than the positive deviations (not shown), or that the states with positive deviations (not shown) are generally less populous than the ones shown.
Either of the last two conclusions, if true, would be interesting because it implies that the cultural norms, typically claimed to explain this anomaly, are entrenched only within certain geographies. Then, it is inappropriate to speak of India's sex ratio, given this variability between states.
As I pointed out in the prior post, with two data series (two observation dates of the same statistic) at their disposal, the Economist chart focuses on the more recent data. This self-imposed restriction obscures meaningful differences between states over time.
The junkcharts version shows that the current 11 "worst" states could be clustered into two groups: the first group (black lines) has gained ground over the last decade, while the second group (gray) has stagnated, and in some cases, lost ground.
What's more, we learn that every one of the states in the gray group is ranked higher than those in the black group at the start of the decade.
Further, while the distance between the black and gray groups have narrowed over the decade, the gray group, despite the slight decline, is still ranked above the black grup except for Kerala, which has seen dramatic improvement.
Those who know their Indian geography might have further insights as to why the states cluster in this way.
In my view, these findings are much more interesting than the things one can learn from the original chart.
The headline of this Business Insider item reads: "MAP OF THE DAY: There's a 'Superbug' spreading around America killing 40% of the people who come into contact". The only thing missing are the 10 exclamation points that could have been added to the end of the sentence.
Unfortunately, in the mass media, this sort of sentence is quite typical.
Let's dissect the claim.
Indeed, a disease with a fatality rate of 40% is very serious, but one must stop for a second and ask 40% of what? Accidental falls are sometimes fatal but they just don't happen often enough for anyone to be worried. In the case of the new superbug, the article tells us there are 350 recent cases in Los Angeles county, which, last I checked, has 10 million residents. So, the chance of dying from this "superbug" is 140 out of 10 million, which is 0.0014% (1 in 72,000) compared to 1 in 14,000 for accidental falls.
If you have the bug, you have a 40% chance of dying. But the chance of catching the bug is miniscule. (They say "come into contact". Presumably, more than contact is needed to have the bug.)
They then show a map illustrating how this bug is "spreading around America".
If you mentally tally up the yellow area as a proportion of the whole country, you might think 2/3 of the country is an emergency zone. But this map is incredibly misleading. It is still the case that the average American would only have a 0.0014% chance of dying from the superbug. (Strictly speaking, the rate would be a tad higher in the yellow area but this distinction will go away as cases pop up in the rest of the states.)
If one were to plot a similar map for "2010 location of deaths due to accidental falls", the entire map would be yellow. The only thing missing would be the 10 exclamation points.
When you go to the library, you expect to find the books in an organized fashion, typically sorted first by subject matter, then by author, then by title, and so on. Imagine the frustration when you walk in and discover that books are spread out everywhere with no discernible order. We are very particular about tidiness: it would still be terrible if the books were arranged by author and title without first splitting by subject matter. We are annoyed because it would take too long to find a book.
I did run into such an exasperating bookstore -- I believe it is in Brooklyn. The (used) books in this store are arranged by the date on which the owner acquired them. Fiction, I recall, is ordered by alphabets of last names, and then, say within the 'A' authors, the books were sorted by date of acquisition. What a headache!
Reader Pat L. had a big headache trying to figure out this chart, found on Wikipedia: (I'm just excerpting a small part of it; the full chart is here).
To quote Pat:
I was overwhelmed by the information -- so many chemicals and so many units of measure. I quickly gave up and opened up the image in an picture editor. One-by-one, I erased the blood chemicals I wasn't interested in. Maybe if I was a doctor, the chart might have been useful.
One way to simplify this is using small multiples. Recognize that few if any users would need to directly compare every one of these chemicals. I'm guessing that groups of chemicals can go on separate charts. This is no different from a bookseller organizing shelves to help readers find books.
As you may have noticed, posting has been sluggish lately, and this means I have a backlog of submissions. So please be patient if you sent something to me.
Alex C., at NUS, wasn't so patient so he wrote up his own post on this gaping chart from Accenture reporting on a survey of people's attitudes towards health care. In particular, Alex does a great job taking down the questions in the survey: for appetizers, he noted "So, the interviewees were asked 'Do you think it’s important to focus
on delivering real improvements in the overall health of the nation?' I
wonder who answered no?"
(Amusing aside: I fibbed. Alex wasn't impatient. He just couldn't believe how rude I was. After spending some unsavory time manually eyeballing the data from these charts, I accidentally emailed the spreadsheet to Alex, thinking that I was emailing myself. So Alex received a reply from me, with the spreadsheet attached but with no comments, and he figured I couldn't be bothered. Perhaps I "gave minimalist, Ripleyesque replies", he wondered.)
Well, I digress. There are a few other problems with this "radar" chart that Alex was too kind to overlook:
All of the information is in the radius from the center of the circle to individual dots and yet lines were drawn to connect dots into a ragged circle, drawing attention away from the information
Only on one quadrant was the scale of the radii provided, which frustrated me to no end when I tried to "eyelift" the data off the charts
There are slips in craftsmanship as some of the dots seem to fall out of place, e.g. on this particular chart, the two dots for "issue 16" does not seem to be aligned with the label 16, similarly something seems off with the dots for "issue 13". (These minor slips become very obvious when you are lifting data off the charts.)
Each category contains four questions and occupy one quadrant but the way the information is arranged on this chart type, one cannot visually aggregate the four individual scores to arrive at a category score.
Even more fundamentally, what data are being plotted? Turns out this is convoluted:
For each of the 16 "issues", respondents are asked to rate the importance and the government's performance on separate 5-point scales.
The top 2 points are considered "favorable", and the data depicted are the top-2-point proportions.
So the "gap" that Accenture consultants have stuck their fingers into is the proportion of respondents rating the government's role in an issue as "very important" or "essential" minus the proportion of respondents rating the government performance on that same issue as "fairly well" or "very well".
I just don't share the enthusiasm about this metric. Not merely because we are bound to think an issue is more important if the government has done a poor job at it.
Just to surface another problem: from these charts, it's clear that people in different countries approach 5-point scales differently. Perhaps in some countries (see right), they just fill out a "straight-ticket" vote for all issues?!
Here's my version (pretty much any such grouped column charts can be replaced by line charts):
(Chart purists: I like profile charts which means I like to connect categorical data with lines.)
Anyhow, this data supposedly came from an FDA study, which the FDA has apparently now disowned, according to this AOL News report. Rats were used in this study, and the rate at which they developed significant tumor or lesion was measured. The graph illustrated a clear trend that the higher the doses of Vitamin A, the faster the rats developed cancer; this correlation was intact whether they were exposed to high or low levels of UV rays.
Notice that I switched the primary categorical axis to Vitamin A doses rather than high/low UV because the study concerned Vitamin A primarily, and levels of UV secondarily.
Using the Trifecta Checkup, we can see that they have the right question, and the right data but a suboptimal chart. Also, the original chart fails the self-sufficiency test: no point in printing the data on top of the columns when there is a vertical scale.
How will this save your life?
Vitamin A is widely added to sunblocks -- not because they have any screening value -- but because they may slow aging of the skin. But the study found that Vitamin A actually partially nullifies the screening ability of sunblocks.
About half of the 500 most popular sunblocks sold in the U.S. contain Vitamin A and only 39 out of the 500 are deemed safe by the Environmental Working Group, which has compiled a database of these products. (There are several other potentially harmful ingredients.)
The FDA denied that such a study existed although the reporter as well as EWG have copies of it. If this study is authentic, the FDA knew about this perhaps ten years ago.
Reference: "Study: Many Sunscreens May Be Accelerating Cancer", Andrew Schneider, AOL News, May 24 2010.
PS. I should explain to my non-U.S. readers that the U.S. is celebrating Memorial Day, the beginning of summer, on Monday so lots of people are going to beaches and other vacations.
Frequent contributor Bernard L. pointed me to this National Geographic "infographics". This surely belongs to the Art section of the infographics gallery, which I discussed in the "Whither Infographics" post. This fact is acknowledged by the editors who labeled this "Art: Fish Pharm".
It's a very pretty picture. And I'm cool to turn a blind eye to:
the uneven sizes of the pills
the dislocated, non-contiguous areas (diphenhydramine)
the dual-colored area (green-yellow), especially as the same green represented a different pill
the water bubbles treated as part of the fish
but I'm still debating:
Is it an artistic license taken too far to imply that pharma chemicals have completely stuffed the fish (so much as to also infect the exhaled bubbles) when the text actually said the fish contained "traces of pharmaceuticals and toiletries"?
The footnote apologizes for the percentages not adding up to 100 percent, but 100 percent of what?
And by the way, this is the first time I have seen the word "pharmaceutical" used as a noun to represent medicines manufactured by pharmaceutical companies. As a noun, I understand "pharmaceutical" to mean a company that designs and makes medicines.
The gulf between infographics and statistical graphics, that is.
Stan at Mashable praised "5 Amazing Infographics for the Health Conscious". They belong to the class of "pretty things" that are touted all over the Web but from a statistical graphics perspective, they are dull.
Reader Mike L. poked me about the snake oil chart (right) while I was writing up this post. The snake oil chart is by David McCandless whose Twitter chart I liked quite a bit.
This one, not very much.
If the location and cluster membership of the substances depicted have some meaning, I might even feel ok about the effervescence. But I don't think so.
I continue to love his pithy text labels though; the "worth it line", truly.
The data (if verified) is pretty useful though since there are so many health supplements out there, and as a consumer, it's impossible to know which ones are sham. (Ben Goldacre's site may help.)
Now, let's run through the low lights of the rest:
I'm still trying to figure out what plus-minus means in the Dirty Water graphic.
The fact that the four buildings are not considered one complete unit also trips me up. The Truckee Meadows is depicted as 7 buildings, not divisible by 4. In addition, if 2 short buildings + 1 tall + 1 medium = 200,000 people, how many people live in 2 tall + 1 medium + 4 short buildings?
The obesity charts are pinatas.
The cost of health care chart is boring, just a prettied up data table. Why are life expectancy statistics expressed in 2 decimal places, and not in years and months?
Why 78.11 years and not 78 years (or 78 years, 1 month)?
The scatter chart relating survival rates of people with various ailments and the survival rates of virues/bacteria left outside our bodies is alright but do we care about this correlation?
I hate to be so negative but I can't believe these are examples of good infographics.
My appeal for readers to send in positive examples still stand!
Economists have their misery index; dentists, it seems, have a mystery index.
Laird Harrison, senior editor at DrBicuspid.com, an online newsletter for the dental community, pointed me to this chart when he interviewed me about how to interpret the findings in the latest Quarterly Survey of Economic Confidence, conducted by the American Dental Association. (Note: you have to register to read his article. Registration is free.)
When faced with an index, the first thing to do is to find out what the reference level (here, the zero level) means. Although the report is littered with dozens of similar graphs showing all kinds of indices, I cannot find any definition of the reference level, not even in the methodology appendix. The closest is the following directive for reading the chart I printed above:
For example, [this figure] illustrates that the Net Income Index improved by approximately 10% between 3rd and 4th quarters in 2009, an increase that was driven by 6% fewer dentists responding that net income had declined, approximately 2% more dentists indicating that net income was about the same, and 5% more dentists reporting that net income had increased.
For this survey question, respondents could answer that their Net Income increased, stayed about the same or decreased, and correspondingly, these answers were scored +1, 0, -1. But we still do not know what zero means in the Net Income Index.
Fortunately, the raw data was also provided. I plotted the net score differential, essentially the difference in proportion between those who reported income increase and those who reported income decrease:
The shape of this line looks eerily familiar. But what is the zero level?
After some investigation, I found the answer. The reference level is the net score differential, averaged over the six quarters shown on the chart. In essence, the blue line from this chart, if shifted up by the average net score differential, becomes the green line from the first chart.
How would we interpret such an index? The current quarter's differential was about -40% which was 3% below the average net score differential between 2008Q3 and 2009Q4 (which was -37%).
This index is very problematic. The choice of the past six quarters seems completely arbitrary and ignores any seasonality effect. The use of an unweighted average to average the score differentials assumes that there are no quarterly variations in the data.
But the biggest problem surfaces if one focuses attention on, say, 2008Q3. The top chart says that the net score differential for 2008Q3 was 2% above the average differential from 2008Q3 to 2009Q4. But this is a forward-looking number because in 2008Q3, it was not yet known what the net score differentials would be in the next 5 quarters. Usually, indices are constructed using historical data to establish the reference level.
The mystery is why indexing is even needed. What's wrong with plotting the change in net score differentials?
Reference: "Quarterly Survey of Economic Confidence, Fourth Quarter 2009", American Dental Association, Jan 29 2010.