How to fail three tests in one chart

The November issue of Bloomberg Markets published the following pair of pyramid charts:


This chart fails a number of tests:

Tufte's data-ink ratio test

There are a total of six data points in the entire graphic. A mathematician would say only four data points, since the "no opinion" category is just the remainder. The designer lavishes this tiny data set with a variety of effects: colors, triangles, fonts of different tints, fonts of different sizes, solid and striped backgrounds, and legends, making something that is simple much more complex than necessary. The extra stuff impedes rather than improves understanding. In fact, there were so many parts that the designer even forgot to add little squares on the right panel beside the category labels.

Junk Charts's Self-sufficiency test

The data are encoded in the heights of the pyramids, not the areas. The shapes of the areas are inconsistent, which also makes it impossible to decipher. The way it is set up, one must compare the green, striped triangle with two trapezoids. This is when a designer realizes that he/she must print the data labels onto the chart as well. That's when self-sufficiency is violated. Cover up the data labels, and the graphical elements themselves no longer convey the data to the readers. More posts about self-sufficiency here.

Junk Charts's Trifecta checkup

The juxtaposition of two candidates' positions on two entirely different issues does not yield much insights. One is an economic issue, one is military in nature. Is this a commentary of the general credibility of the candidates? or their credibility on specific issues? or the investors' attitude toward the issues? Once the pertinent question is clarified, then the journalist needs to find the right data to address the question. More posts about the Trifecta checkup here.

Minimum Reporting Requirements for polls

Any pollster who doesn't report the sample size and/or the margin of error is not to be taken seriously. In addition, we should want to know how the sample was selected. What does it mean by "global investors"? Did the journalist randomly sample some investors? Did investors happen to fill out a survey that is served up somehow?


The following bar charts, while not innovative, speak louder.


The "data" corner of the Trifecta

TrifectaIn the JunkCharts Trifecta checkup, we reserve a corner for "data". The data used in a chart must be in harmony with the question being addressed, as well as the chart type being selected. When people think about data, they often think cleaning the data, processing the data but what comes before that is collecting the data -- specifically, collecting data that directly address the question at hand.

Our previous post on the smartphone app crashes focused on why the data was not trustworthy. The same problem plagues this "spider chart", submitted by Marcus R. (link to chart here)


Despite the title, it is impossible to tell how QlikView is "first" among these brands. In fact, with several shades of blue, I find it hard to even figure out which part refers to QlikView.

The (radial) axis is also a great mystery because it has labels (0, 0.5, 1, 1.5). I have never seen surveys with such a scale.

The symmetry of this chart is its downfall. These "business intelligence" software are ranked along 10 dimensions. There may not be a single decision-maker who would assign equal weight to each of these criteria. It's hard to imagine that "project length" is equally important as "product quality", for example.

Take one step backwards. This data came from responders to a survey (link). There is very little information about the composition of the responders. Are they asked to rate all 10 products along 10 dimensions? Do they only rate the products they are familiar with? Or only the products they actively use? If the latter, how are responses for different products calibrated so that a 1 rating from QlikView users equals a 1 rating from MicroStrategy users? Given that each of these products have broad but not completely overlapping coverage, and users typically deploy only a part of the solution, how does the analysis address for the selection bias?


The "spider chart" is, unfortunately, most often associated with Florence Nightingale, who created the following chart:


This chart isn't my cup of tea either.


Also note that the spider chart has so much over-plotting that it is impossible to retrieve the underlying data.



A data mess outduels the pie-chart disaster for our attention

Reader Daniel L. sends us to a truly horrifying pie chart. This:


Link to the original here.

The background: a smartphone monitoring company Crittercism compiled data on the frequency of app crashes by version of mobile operating systems (Android or Apple iOS). The data is converted into proportions adding to 100%.

If we spend our time trying to figure out the logic behind the ordering and placing of the data (e.g. why iOS is split on both sides? why pieces are not sorted by size?), we will miss the graver problem with this chart - the underlying data.


Here is a long list of potential issues:

  • Crittercism sells app monitoring tools for app developers. Presumably this is how it is able to count app crashes. But who are their customers? Are they a representative set of the universe of apps? Do we even know the proportion of Android/iOS apps being monitored?
  • There is reason to believe that the customer set is not representative. One would guess that more crash-prone apps are more likely to have a need for monitoring. Also, is Apple a customer? Given that Apple has many highly popular apps on iOS, omission of these will make the data useless.
  • The data wasn't adjusted for the popularity of apps. It's very misleading to count app crashes without understanding how many times the app has been opened. This is the same fallacy as making conclusions about flight safety based on the list of fatal plane accidents; the millions of flights that complete without incident provide lots of information! (See Chapter 5 of my book for a discussion of this.)
  • The data has severe survivorship bias. The blog poster even mentions this problem but adopts the attitude that such disclosure somehow suffices to render useless data acceptable. More recent releases are more prone to crashes just because they are newer. If a particular OS release is particularly prone to app crashes, then we expect a higher proportion of users to have upgraded to newer releases. Thus, older releases will always look less crash-prone, partly because more bugs have been fixed, and partly because of decisions by users to switch out. iOS is the older operating system, and so there are more versions of it being used.
  • How is a "crash" defined?  I don't know anything about Android crashes. But my experience with PC operating systems is that each one has different crash characteristics. I suspect that an Android crash may not be the same as an iOS crash.
  • How many apps and how many users were included in these statistics? Specifying the sample size is fundamental to any such presentation.
  • Given the many problems related to timing as described above, one has to be careful when generalizing with data that only span two weeks in December.
  • There are other smartphone OS being used out there. If those are omitted, then we can't have a proportion that adds up to 100% unless those other operating systems never have app crashes.


How to fix this mess? One should start with the right metric, which is the crash rate, that is, the number of crashes divided by the number of app starts. Then, make sure the set of apps being tracked is representative of the universe of apps out there (in terms of popularity).

Some sort of time matching is needed. Perhaps trace the change in crash rate over time for each version of each OS. Superimpose these curves, with the time axis measuring time since first release. Most likely, this is the kind of problem that requires building a statistical model because multiple factors are at play.

Finally, I'd argue that the question being posed is better answered using good old-fashioned customer surveys collecting subjective opinion ("how many crashes occurred this past week?" or "rate crash performance"). Yes, this is a shocker: a properly-designed small-scale survey will beat a massive-scale observational data set with known and unknown biases. You may agree with me if you agree that we should care about the perception of crash severity by users, not the "true" number of crashes. (That's covered in Chapter 1 of my book.)




Motion-sick, or just sick?

Reader Irene R. was asked by a client to emulate this infographic movie, made by UNIQLO, the Japanese clothing store.

Here is one screen shot of the movie:


This is the first screen of a section; from this moment, the globes dissolve into clusters of photographs representing the survey respondents, which then parade across the screen. Irene complains of motion sickness, and I can see why she feels that way.

Here is another screen shot:


Surprisingly, I don't find this effort completely wasteful. This is because I have read a fair share of bore-them-to-tears compilation of survey research results - you know, those presentations with one multi-colored, stacked or grouped bar chart after another, extending for dozens of pages.

There are some interesting ideas in this movie. They have buttons on the lower left that allow users to look at subgroups. You'll quickly find the limitations of such studies by clicking on one or more of those buttons... the sample sizes shrink drastically.

The use of faces animates the survey, reminding viewers that the statistics represent real people. I wonder how they chose which faces to highlight, and in particular, whether the answers thus highlighted represent the average respondent. There is a danger that viewers will remember individual faces and their answers more than they recall the average statistics.


If the choice is between a thick presentation gathering dust on the CEO's desk and this vertigo of a movie that perhaps might get viewed, which one would you pick?


Nothing is as simple as it seems

Thanks to reader Chris P. (again) for pointing us to this infographics about teacher pay. This one is much better than your run-of-the-mill infographics poster. The designer has set out to answer specific questions like "how much do teachers make?", and has organized the chart in this way.

This post is about the very first chart because I couldn't get past it. It's a simple bar chart, with one data series indexed by country, showing the relative starting salary of a primary-school teacher with minimal training. This one:


The chart tells us that the range of salaries goes from about $12,000 at the low end (Poland) to over $65,000 at the high end (Luxembourg), with U.S. roughly at the 67% percentile, running at $42,000 per year. The footnote says that the source was OECD.

The chart is clean and simple, as a routine chart like this should. One might complain that it would be easier to read if flipped 90 degrees, with country labels on the left and bars instead of columns. But that's not where I got stuck... mentally.

I couldn't get past this chart because it generated so many unanswered questions. The point of the chart is to compare U.S. teacher pay against the rest of the world (apologies to readers outside the U.S., I'm just going with the designer's intention). And yet, it doesn't answer that question satisfactorily.

Our perception of the percentile ranking of the U.S. is fully determined by the choice of countries depicted. One wonders how that choice was made. Do the countries provide a nice sampling of the range of incomes from around the world? Is Poland truly representative of low pay and Luxembourg of high pay? Why are Korea and Japan the only two Asian countries shown and not, say, China or India? Why is there a need to plot Belgium (Fl.) separately from Belgium (Fr.), especially since the difference between the two parts of Belgium is dwarfed by the difference between Belgium and any other country? This last one may seem unimportant but a small detail like this changes the perceived ranks.

Further, why is the starting salary used for this comparison? Why not average salary? Median salary? Salary with x years of experience? Perhaps starting salary is highly correlated to these other metrics, perhaps not.

Have there been sharp changes in the salaries over time in any of these countries? It's quite possible that salaries are in flux in less developed countries, and more stable in more developed countries.

Also, given the gap in cost of living between, say, Luxembourg and Mexico, it's not clear that the Mexican teacher earning about $20,000 is worse off than the Luxembourger taking home about $65,000. I was curious enough to do a little homework: the PPP GDP per capita in Luxembourg was about $80,000, compared to $15,000 in Mexico, according to IMF (source: Wikipedia), so after accounting for cost of living, the Mexican earns an above-average salary while the Luxembourger takes home a below-average salary. Thus, the chart completely misses the point.


  Jc_trifecta Using the Trifecta checkup, one would address this type of issues when selecting the appropriate data series for use to address the meaningful question.

Too often, we pick up any data set we can lay our hands on, and the data fails to answer the question, and may even mislead readers.




PS. On a second look, I realized that the PPP analysis shown above was not strictly accurate as I compared an unadjusted salary to an adjusted salary. A better analysis is as follows: take the per-capita PPP GDP of each country, and the per-capita unadjusted GDP to form the adjustment factor. Using IMF numbers, for Luxembourg, this is 0.74 and for Mexico, this is 1.57. Now, adjust the average teacher salary by those factors. For Luxembourg, the salary adjusted for cost of living is $48,000 (note that this is an adjustment downwards due to higher cost of living in that country), and for Mexico, the adjusted salary was inflated to $31,000. Now, these numbers can be appropriately compared to the $80,000 and $15,000 respectively. The story stays the same.



Unscientific American 1: misreadings

Chris P. sent me to this set of charts / infographics with the subject line "all sorts of colors and graphs."  I let the email languish in my inbox, and I now regret it. For two reasons: one, the topic of how scientists can communicate better with, and thus exert stronger influence on, the public is very close to my heart (as you can tell from my blogs and book), and this article presents results from a poll on this topic done on on-line readers of Scientific American, and Nature magazines; two, some of the charts are frankly quite embarrassing, to have appeared in venerable publications of a scientific nature (sigh); three, these charts provide a convenient platform to review some of the main themes on Junk Charts over the years.

Since the post is so long, I have split it into two parts. In part 1, I explore one chart in detail. In part 2, I use several other charts to illustrate some concepts that have been frequently deployed on Junk Charts.


Exhibit A is this chart:


First, take a look at the top left corner. At first glance, I took the inset to mean: among scientists, how much do they trust scientists (i.e., their peers) on various topics?  That seemed curious, as that wouldn't be a question I'd have thought to ask, certainly not as the second question in the poll.

 Sa_howmuchdopeople1On further inspection, that is a misreading of this chart. The "scientists" represented above are objects, not subjects, in the first question. As the caption tells us, the respondents rated scientists at 3.98 overall, which is an average rating across many topics. The bar chart below tells us how the respondents rated scientists on individual topics, thus providing us information on the spread of ratings.

 Unfortunately, this chart raises more questions than it answers. For one, you're working out how the average could be 3.98 (at that 4.0 white line) when all but three of the topic ratings were below 3.98. Did they use a weighted average but did not let on?

Oops, I misread the chart, again. I think, what I stumbled on here is the design of the poll itself. The overall rating is probably a separate question, and not at all related to the individual topic ratings. In theory, each person can assign a subjective importance as well as a rating to each topic; the average of the ratings weighted by their respective importance would form his or her overall rating of scientists. That would impose consistency to the two levels of ratings. In practice, that makes an assumption that the topics span the space of what topics each person considers when rating the scientists overall.


The bar chart has a major problem... it does not start at zero.  Since the bars are half as long as the longest, you might think the level of trust associated with nuclear power or climate change would be around 2 (negative). But it's not; it's in the 3.6 range. This is a lack of self-sufficiency. The reader cannot understand the chart without fishing out the data.

Now, ask this question: in a poll in which respondents are asked to rate things on a scale of 1, 2, 3, 4, 5, do you care about the average rating to 2 decimal places?  The designer of the graphic seems to think not, as the rating was rounded up to the nearest 0.5, and presented using the iconic 5-star motive. I think this is a great decision!

Citizensvsjournalists But then, the designer fell for loss aversion: having converted the decimals to half-stars, he should have dropped the decimals; instead, he tucked them at the bottom of each picture. This is no mere trivia. Now, the reader is forced to process two different scales showing the same information. Instead of achieving simplification by adopting the star system, now the reader is examining the cracks: is the trust given citizens groups the same as journalists (both 2.5 stars) or do "people" trust citizens groups more (higher decimal rating)?


The biggest issues with this chart concern the identification of the key questions and how to collect data to address those questions. This is the top corner of the Trifecta checkup.

1) The writer keeps telling us "people" trust this and that but the poll only covered on-line readers of Scientific American and Nature magazines. One simply cannot generalize that segment of the population to the common "people".

2) Insufficient attention has been paid to selecting the right wording in the questions. For example, in Exhibit A, while the overall trust question was phrased as trusting the "accuracy" of the information provided by scientists vs. other groups, the trust questions on individual topics mentioned only a generic "trust".  Unless one thinks "trust" is a synonym of "accuracy", the differential choice of words makes these two set of responses hard to compare. And compairing them is precisely what they chose to do.


In part 2, I examine several other charts, taking stops at several concepts we use on Junk Charts a lot.


Lessons from propaganda

Political wordsmith (euphemism) Frank Luntz's presentation is all over the Web. I saw it on Business Insider. In the debate between words and numbers, Luntz obviously takes the side of words.

He used a few simple charts in the presentation, which is interesting by itself since he fundamentally is a words guy, not a numbers guy.

The charts, while simple, are very instructive:

Luntz_bar This bar chart sent me running to the maligned pie chart!  (Almost, read on).

While the total responses were almost evenly split between the three choices, the bar chart drew our attention to the first bar, which is inapt.

If plotted as a pie chart, I thought, the reader would see three almost equal slices. This effect occurs because we are much less precise at determining the areas of slices than the areas of bars.  Wouldn't that turn our usual advice on its head?


How the Bar Chart is Saved

The one thing that the pie chart has as a default that this bar chart doesn't is the upper bound.  Everything must add up to 100% in a circle but nothing forces the lengths of the bars to add up to anything.

We save the bar chart by making the horizontal axis stretch to 100% for each bar.  This new scaling makes the three bars appear almost equal in length, which is as it should be.


Another Unforgivable Pie Chart

On the very next page, Luntz threw this pie at our faces:

Luntz_pie Make sure you read the sentence at the bottom.

It appears that he removed the largest group of responses, and then reweighted the CEO and Companies responses to add to 100%.

This procedure is always ill-advised - responders responded to the full set of choices, and if they were only given these two responses, they very well might have answered differently.

It also elevated secondary responses while dispensing with the primary response.

Leaving ink traces

Stefan S. at the UNEP GEO Data Portal sent me some intriguing charts, made from data about the environment.  The following shows the amount of CO2 emissions by country, both in aggregate and per capita.  We looked at some of their other charts before.


These "inkblot" charts are visually appealing, and have some similarities with word clouds.  It's pretty easy to find the important pieces of data; and while in general we should not sort things alphabetically, here, as in word clouds, the alphabetical order is actually superior as it spaces out the important bits.  If these were sorted by size, we'll end up with all the big blots on top, and a bunch of narrow lines at the bottom - and it will look very ugly.

The chart also breaks another rule. Each inkblot is a mirror image about a horizontal line. This arrangement is akin to arranging a bar chart with the bars centered (this has been done before, here).  It works here because there is no meaningful zero point (put differently, many zero points) on the vertical scale, and the data is encoded in the height of each inkblot at any given time.

Breaking such a rule has an unintended negative.  The change over time within each country is obscured: the slope of the upper envelope now only contains half of the change, the other half exists in the lower envelope's slope.  Given that the more important goal is cross-country comparison, I think the tradeoff is reasonable.


Colors are chosen to help readers shift left and right between the per capita data and the aggregate data.  Gridlines and labels are judicious.

As with other infographics, this chart does well to organize and expose interesting bits of data but doesn't address the next level of questions, such as why some countries contribute more pollution than others.

One suggestion: restrict the countries depicted to satisfy both rules (per capita emissions > 1000 kg AND total emissions > 10 million tonnes).  In this version, a country like Albania is found only on one chart but not the other.  This disrupts the shifting back and forth between the two charts.

Following one's nose 2

This is the second post on the immigration paradox study, first discussed on the Gelman blog.  My prior post on the graphing aspect is here; this post focuses on the statistical aspects. I am working backwards on Andrew's discussion points.

Which difference is most interesting?

Interaction 5. Agree with Andrew; they should publish similar analyses on other minority groups as soon as possible.  One thing that strikes me when looking at the interaction plot is that the U.S. born non-Latino whites have a much higher incidence of mental illness.  The difference between different subgroups of Latinos paled in comparison to the difference between non-Latinos and the Latinos.  This latter difference is particularly acute among the U.S. born than the immigrants. The importance of the Latino analysis hinges upon whether the "paradox" is also found among other minority groups.

(Chris P also pointed this out in his comment on the previous post.)

Disaggregation, Practical Significance, and the Meaning of Not Significant

2. Andrew is also right in expressing moderate skepticism about this sort of disaggregation exercise.  He connects this to the subtle statistical point that "the difference between significant and not significant is not significant."  A related but less obtruse issue is that as one disaggregates any data, the chance of seeing variations that stray from the average gets higher and higher.  This is because the sample size is decreasing, and so the statistical estimates are less reliable.

(To give a flavor of the scale, there were a total of 2500 Latinos in the sample, with 500 Puerto Rican Latinos. The analysis drilled down to the level of different types of mental disorders, subgroups of Latinos, and also adjusted for demographics.  The details of the demographic adjustment are not available but in any case, one should be concerned about whether there were sufficient numbers of say, male immigrant Puerto Rican Latinos age 18-25 with income < $10,000 living in a rental apartment, for such an elaborate exercise.)

Expanding on this point further, one observes that the measured gap between U.S. born and immigrant Puerto Rican Latinos was about 5%.  But this 5% is probably of considerable practical significance since the base rate of incidence is about 30% (I say probably since I am not an expert in mental illness).  The current statistical analysis judged this to be insignificant -- if the sample size were larger, this difference could conceivably be statistically significant, and also practically significant.

But, doesn't the significance test deal with the small sample size problem?  Yes, if the authors merely described the Puerto Rico result as inconclusive.  Here, as is done very commonly, insignificance is equated to "no difference": they said

No differences were found in lifetime prevalence rates between migrant and U.S.-born Puerto Rican subjects.

In reality, a difference of 5% was found in the sample that was analyzed.  The statistical procedure found that this difference could have been a result of chance -- notice "could", not "must".  If the measured difference was 0.5% on 30%, then I might be willing to accept a finding of "no difference"; when it was 5% on 30%, I would like to see a larger sample analyzed.

The Meaning of Paradox

1. Andrew was perplexed by why the phenomenon is known as a "paradox". I had the same issue until I read the paper. The authors were a bit sloppy in the abstract. In the paper itself, they explained that the conventional wisdom has it that immigrants should be more likely to have mental illness because of the stress from the immigration process, and yet the statistics showed the exact opposite. That is the paradox.

Publication Bias

I was a little shocked to see the data tables that gave all the estimates of the various effects at the various subgroup levels: shocked because the authors were allowed (or asked) to include only the p-values that were below some unspecified level (which I surmised is 10% although a 5% significance level is used to judge significance as per convention). This is publication bias within publication bias. P-values that are not significant still provide valuable information and should not be omitted. They did provide confidence intervals but for each subgroup separately, rather than for the difference -- and as they noted, such intervals by themselves are inconclusive when they overlap moderately.


Life-enabling charts

In response to my call for positive examples, reader Merle H. sent in an example of how good charts can make our lives simpler and easier.

All of us have seen the following presentation of air travel data.

Travelocity Not trying to pick on Travelocity - it's the same format whether you use Expedia or any of the airline sites.  For those customers who are looking to decide what dates to travel so as to minimize their air fare, this format is very cumbersome to use.

Flight_chart What about this fare chart at

As you mouse along the line chart, the average fare for each day is visible.  Clicking on a particular day will fix the departure or return dates.

So much easier, isn't it?

A few caveats, though:

  • Instead of just providing the historical averages, they should consider including information on variability, such as bars that indicate the middle 50% or 75% of prices.  Also, what about a sliding control for customers to decide which period of past history the averages should use?  More recent data may be more representative.
  • This particular feature appeals to the price-sensitive, date-flexible customer segment.  Not everyone will pick itineraries based on those criteria.  There is an easy fix. If some controls are available for customers to indicate other preferences, e.g. exclude all British Airways flights, include only evening flights, etc., and the chart can update itself based on such selections, then the chart becomes a lot more flexible, and useful to many more customers.
  • As with many automatically generated charts, the chosen labels on the vertical axis are laughable.  That should be relatively easy to fix, you'd think.
A great start.  I happen to notice that Travelocity has a beta feature that shows a similar chart.  A revolution in how travel sites present data to us is long overdue.