A data mess outduels the pie-chart disaster for our attention

Reader Daniel L. sends us to a truly horrifying pie chart. This:

Bgr-crashes-ios-android-1

Link to the original here.

The background: a smartphone monitoring company Crittercism compiled data on the frequency of app crashes by version of mobile operating systems (Android or Apple iOS). The data is converted into proportions adding to 100%.

If we spend our time trying to figure out the logic behind the ordering and placing of the data (e.g. why iOS is split on both sides? why pieces are not sorted by size?), we will miss the graver problem with this chart - the underlying data.

***

Here is a long list of potential issues:

  • Crittercism sells app monitoring tools for app developers. Presumably this is how it is able to count app crashes. But who are their customers? Are they a representative set of the universe of apps? Do we even know the proportion of Android/iOS apps being monitored?
  • There is reason to believe that the customer set is not representative. One would guess that more crash-prone apps are more likely to have a need for monitoring. Also, is Apple a customer? Given that Apple has many highly popular apps on iOS, omission of these will make the data useless.
  • The data wasn't adjusted for the popularity of apps. It's very misleading to count app crashes without understanding how many times the app has been opened. This is the same fallacy as making conclusions about flight safety based on the list of fatal plane accidents; the millions of flights that complete without incident provide lots of information! (See Chapter 5 of my book for a discussion of this.)
  • The data has severe survivorship bias. The blog poster even mentions this problem but adopts the attitude that such disclosure somehow suffices to render useless data acceptable. More recent releases are more prone to crashes just because they are newer. If a particular OS release is particularly prone to app crashes, then we expect a higher proportion of users to have upgraded to newer releases. Thus, older releases will always look less crash-prone, partly because more bugs have been fixed, and partly because of decisions by users to switch out. iOS is the older operating system, and so there are more versions of it being used.
  • How is a "crash" defined?  I don't know anything about Android crashes. But my experience with PC operating systems is that each one has different crash characteristics. I suspect that an Android crash may not be the same as an iOS crash.
  • How many apps and how many users were included in these statistics? Specifying the sample size is fundamental to any such presentation.
  • Given the many problems related to timing as described above, one has to be careful when generalizing with data that only span two weeks in December.
  • There are other smartphone OS being used out there. If those are omitted, then we can't have a proportion that adds up to 100% unless those other operating systems never have app crashes.

***

How to fix this mess? One should start with the right metric, which is the crash rate, that is, the number of crashes divided by the number of app starts. Then, make sure the set of apps being tracked is representative of the universe of apps out there (in terms of popularity).

Some sort of time matching is needed. Perhaps trace the change in crash rate over time for each version of each OS. Superimpose these curves, with the time axis measuring time since first release. Most likely, this is the kind of problem that requires building a statistical model because multiple factors are at play.

Finally, I'd argue that the question being posed is better answered using good old-fashioned customer surveys collecting subjective opinion ("how many crashes occurred this past week?" or "rate crash performance"). Yes, this is a shocker: a properly-designed small-scale survey will beat a massive-scale observational data set with known and unknown biases. You may agree with me if you agree that we should care about the perception of crash severity by users, not the "true" number of crashes. (That's covered in Chapter 1 of my book.)

 

 

 


Motion-sick, or just sick?

Reader Irene R. was asked by a client to emulate this infographic movie, made by UNIQLO, the Japanese clothing store.

Here is one screen shot of the movie:

Uniqlo

This is the first screen of a section; from this moment, the globes dissolve into clusters of photographs representing the survey respondents, which then parade across the screen. Irene complains of motion sickness, and I can see why she feels that way.

Here is another screen shot:

Uniqlo2

Surprisingly, I don't find this effort completely wasteful. This is because I have read a fair share of bore-them-to-tears compilation of survey research results - you know, those presentations with one multi-colored, stacked or grouped bar chart after another, extending for dozens of pages.

There are some interesting ideas in this movie. They have buttons on the lower left that allow users to look at subgroups. You'll quickly find the limitations of such studies by clicking on one or more of those buttons... the sample sizes shrink drastically.

The use of faces animates the survey, reminding viewers that the statistics represent real people. I wonder how they chose which faces to highlight, and in particular, whether the answers thus highlighted represent the average respondent. There is a danger that viewers will remember individual faces and their answers more than they recall the average statistics.

***

If the choice is between a thick presentation gathering dust on the CEO's desk and this vertigo of a movie that perhaps might get viewed, which one would you pick?

 


Nothing is as simple as it seems

Thanks to reader Chris P. (again) for pointing us to this infographics about teacher pay. This one is much better than your run-of-the-mill infographics poster. The designer has set out to answer specific questions like "how much do teachers make?", and has organized the chart in this way.

This post is about the very first chart because I couldn't get past it. It's a simple bar chart, with one data series indexed by country, showing the relative starting salary of a primary-school teacher with minimal training. This one:

Sosashable_teacherpay

The chart tells us that the range of salaries goes from about $12,000 at the low end (Poland) to over $65,000 at the high end (Luxembourg), with U.S. roughly at the 67% percentile, running at $42,000 per year. The footnote says that the source was OECD.

The chart is clean and simple, as a routine chart like this should. One might complain that it would be easier to read if flipped 90 degrees, with country labels on the left and bars instead of columns. But that's not where I got stuck... mentally.

I couldn't get past this chart because it generated so many unanswered questions. The point of the chart is to compare U.S. teacher pay against the rest of the world (apologies to readers outside the U.S., I'm just going with the designer's intention). And yet, it doesn't answer that question satisfactorily.

Our perception of the percentile ranking of the U.S. is fully determined by the choice of countries depicted. One wonders how that choice was made. Do the countries provide a nice sampling of the range of incomes from around the world? Is Poland truly representative of low pay and Luxembourg of high pay? Why are Korea and Japan the only two Asian countries shown and not, say, China or India? Why is there a need to plot Belgium (Fl.) separately from Belgium (Fr.), especially since the difference between the two parts of Belgium is dwarfed by the difference between Belgium and any other country? This last one may seem unimportant but a small detail like this changes the perceived ranks.

Further, why is the starting salary used for this comparison? Why not average salary? Median salary? Salary with x years of experience? Perhaps starting salary is highly correlated to these other metrics, perhaps not.

Have there been sharp changes in the salaries over time in any of these countries? It's quite possible that salaries are in flux in less developed countries, and more stable in more developed countries.

Also, given the gap in cost of living between, say, Luxembourg and Mexico, it's not clear that the Mexican teacher earning about $20,000 is worse off than the Luxembourger taking home about $65,000. I was curious enough to do a little homework: the PPP GDP per capita in Luxembourg was about $80,000, compared to $15,000 in Mexico, according to IMF (source: Wikipedia), so after accounting for cost of living, the Mexican earns an above-average salary while the Luxembourger takes home a below-average salary. Thus, the chart completely misses the point.

***

  Jc_trifecta Using the Trifecta checkup, one would address this type of issues when selecting the appropriate data series for use to address the meaningful question.

Too often, we pick up any data set we can lay our hands on, and the data fails to answer the question, and may even mislead readers.

 

 

 

PS. On a second look, I realized that the PPP analysis shown above was not strictly accurate as I compared an unadjusted salary to an adjusted salary. A better analysis is as follows: take the per-capita PPP GDP of each country, and the per-capita unadjusted GDP to form the adjustment factor. Using IMF numbers, for Luxembourg, this is 0.74 and for Mexico, this is 1.57. Now, adjust the average teacher salary by those factors. For Luxembourg, the salary adjusted for cost of living is $48,000 (note that this is an adjustment downwards due to higher cost of living in that country), and for Mexico, the adjusted salary was inflated to $31,000. Now, these numbers can be appropriately compared to the $80,000 and $15,000 respectively. The story stays the same.

 

 


Unscientific American 1: misreadings

Chris P. sent me to this set of charts / infographics with the subject line "all sorts of colors and graphs."  I let the email languish in my inbox, and I now regret it. For two reasons: one, the topic of how scientists can communicate better with, and thus exert stronger influence on, the public is very close to my heart (as you can tell from my blogs and book), and this article presents results from a poll on this topic done on on-line readers of Scientific American, and Nature magazines; two, some of the charts are frankly quite embarrassing, to have appeared in venerable publications of a scientific nature (sigh); three, these charts provide a convenient platform to review some of the main themes on Junk Charts over the years.

Since the post is so long, I have split it into two parts. In part 1, I explore one chart in detail. In part 2, I use several other charts to illustrate some concepts that have been frequently deployed on Junk Charts.

***

Exhibit A is this chart:

Sa_howmuchdopeople

First, take a look at the top left corner. At first glance, I took the inset to mean: among scientists, how much do they trust scientists (i.e., their peers) on various topics?  That seemed curious, as that wouldn't be a question I'd have thought to ask, certainly not as the second question in the poll.

 Sa_howmuchdopeople1On further inspection, that is a misreading of this chart. The "scientists" represented above are objects, not subjects, in the first question. As the caption tells us, the respondents rated scientists at 3.98 overall, which is an average rating across many topics. The bar chart below tells us how the respondents rated scientists on individual topics, thus providing us information on the spread of ratings.

 Unfortunately, this chart raises more questions than it answers. For one, you're working out how the average could be 3.98 (at that 4.0 white line) when all but three of the topic ratings were below 3.98. Did they use a weighted average but did not let on?

Oops, I misread the chart, again. I think, what I stumbled on here is the design of the poll itself. The overall rating is probably a separate question, and not at all related to the individual topic ratings. In theory, each person can assign a subjective importance as well as a rating to each topic; the average of the ratings weighted by their respective importance would form his or her overall rating of scientists. That would impose consistency to the two levels of ratings. In practice, that makes an assumption that the topics span the space of what topics each person considers when rating the scientists overall.

***

The bar chart has a major problem... it does not start at zero.  Since the bars are half as long as the longest, you might think the level of trust associated with nuclear power or climate change would be around 2 (negative). But it's not; it's in the 3.6 range. This is a lack of self-sufficiency. The reader cannot understand the chart without fishing out the data.

Now, ask this question: in a poll in which respondents are asked to rate things on a scale of 1, 2, 3, 4, 5, do you care about the average rating to 2 decimal places?  The designer of the graphic seems to think not, as the rating was rounded up to the nearest 0.5, and presented using the iconic 5-star motive. I think this is a great decision!

Citizensvsjournalists But then, the designer fell for loss aversion: having converted the decimals to half-stars, he should have dropped the decimals; instead, he tucked them at the bottom of each picture. This is no mere trivia. Now, the reader is forced to process two different scales showing the same information. Instead of achieving simplification by adopting the star system, now the reader is examining the cracks: is the trust given citizens groups the same as journalists (both 2.5 stars) or do "people" trust citizens groups more (higher decimal rating)?

***

The biggest issues with this chart concern the identification of the key questions and how to collect data to address those questions. This is the top corner of the Trifecta checkup.

1) The writer keeps telling us "people" trust this and that but the poll only covered on-line readers of Scientific American and Nature magazines. One simply cannot generalize that segment of the population to the common "people".

2) Insufficient attention has been paid to selecting the right wording in the questions. For example, in Exhibit A, while the overall trust question was phrased as trusting the "accuracy" of the information provided by scientists vs. other groups, the trust questions on individual topics mentioned only a generic "trust".  Unless one thinks "trust" is a synonym of "accuracy", the differential choice of words makes these two set of responses hard to compare. And compairing them is precisely what they chose to do.

***

In part 2, I examine several other charts, taking stops at several concepts we use on Junk Charts a lot.

 


Lessons from propaganda

Political wordsmith (euphemism) Frank Luntz's presentation is all over the Web. I saw it on Business Insider. In the debate between words and numbers, Luntz obviously takes the side of words.

He used a few simple charts in the presentation, which is interesting by itself since he fundamentally is a words guy, not a numbers guy.

The charts, while simple, are very instructive:

Luntz_bar This bar chart sent me running to the maligned pie chart!  (Almost, read on).

While the total responses were almost evenly split between the three choices, the bar chart drew our attention to the first bar, which is inapt.

If plotted as a pie chart, I thought, the reader would see three almost equal slices. This effect occurs because we are much less precise at determining the areas of slices than the areas of bars.  Wouldn't that turn our usual advice on its head?

 

How the Bar Chart is Saved

The one thing that the pie chart has as a default that this bar chart doesn't is the upper bound.  Everything must add up to 100% in a circle but nothing forces the lengths of the bars to add up to anything.

We save the bar chart by making the horizontal axis stretch to 100% for each bar.  This new scaling makes the three bars appear almost equal in length, which is as it should be.

Redo_luntz

Another Unforgivable Pie Chart

On the very next page, Luntz threw this pie at our faces:

Luntz_pie Make sure you read the sentence at the bottom.

It appears that he removed the largest group of responses, and then reweighted the CEO and Companies responses to add to 100%.

This procedure is always ill-advised - responders responded to the full set of choices, and if they were only given these two responses, they very well might have answered differently.

It also elevated secondary responses while dispensing with the primary response.


Leaving ink traces

Stefan S. at the UNEP GEO Data Portal sent me some intriguing charts, made from data about the environment.  The following shows the amount of CO2 emissions by country, both in aggregate and per capita.  We looked at some of their other charts before.

Co2emission

These "inkblot" charts are visually appealing, and have some similarities with word clouds.  It's pretty easy to find the important pieces of data; and while in general we should not sort things alphabetically, here, as in word clouds, the alphabetical order is actually superior as it spaces out the important bits.  If these were sorted by size, we'll end up with all the big blots on top, and a bunch of narrow lines at the bottom - and it will look very ugly.

The chart also breaks another rule. Each inkblot is a mirror image about a horizontal line. This arrangement is akin to arranging a bar chart with the bars centered (this has been done before, here).  It works here because there is no meaningful zero point (put differently, many zero points) on the vertical scale, and the data is encoded in the height of each inkblot at any given time.

Breaking such a rule has an unintended negative.  The change over time within each country is obscured: the slope of the upper envelope now only contains half of the change, the other half exists in the lower envelope's slope.  Given that the more important goal is cross-country comparison, I think the tradeoff is reasonable.

Co2emission2

Colors are chosen to help readers shift left and right between the per capita data and the aggregate data.  Gridlines and labels are judicious.

As with other infographics, this chart does well to organize and expose interesting bits of data but doesn't address the next level of questions, such as why some countries contribute more pollution than others.

One suggestion: restrict the countries depicted to satisfy both rules (per capita emissions > 1000 kg AND total emissions > 10 million tonnes).  In this version, a country like Albania is found only on one chart but not the other.  This disrupts the shifting back and forth between the two charts.




Following one's nose 2

This is the second post on the immigration paradox study, first discussed on the Gelman blog.  My prior post on the graphing aspect is here; this post focuses on the statistical aspects. I am working backwards on Andrew's discussion points.


Which difference is most interesting?

Interaction 5. Agree with Andrew; they should publish similar analyses on other minority groups as soon as possible.  One thing that strikes me when looking at the interaction plot is that the U.S. born non-Latino whites have a much higher incidence of mental illness.  The difference between different subgroups of Latinos paled in comparison to the difference between non-Latinos and the Latinos.  This latter difference is particularly acute among the U.S. born than the immigrants. The importance of the Latino analysis hinges upon whether the "paradox" is also found among other minority groups.

(Chris P also pointed this out in his comment on the previous post.)


Disaggregation, Practical Significance, and the Meaning of Not Significant

2. Andrew is also right in expressing moderate skepticism about this sort of disaggregation exercise.  He connects this to the subtle statistical point that "the difference between significant and not significant is not significant."  A related but less obtruse issue is that as one disaggregates any data, the chance of seeing variations that stray from the average gets higher and higher.  This is because the sample size is decreasing, and so the statistical estimates are less reliable.

(To give a flavor of the scale, there were a total of 2500 Latinos in the sample, with 500 Puerto Rican Latinos. The analysis drilled down to the level of different types of mental disorders, subgroups of Latinos, and also adjusted for demographics.  The details of the demographic adjustment are not available but in any case, one should be concerned about whether there were sufficient numbers of say, male immigrant Puerto Rican Latinos age 18-25 with income < $10,000 living in a rental apartment, for such an elaborate exercise.)

Expanding on this point further, one observes that the measured gap between U.S. born and immigrant Puerto Rican Latinos was about 5%.  But this 5% is probably of considerable practical significance since the base rate of incidence is about 30% (I say probably since I am not an expert in mental illness).  The current statistical analysis judged this to be insignificant -- if the sample size were larger, this difference could conceivably be statistically significant, and also practically significant.

But, doesn't the significance test deal with the small sample size problem?  Yes, if the authors merely described the Puerto Rico result as inconclusive.  Here, as is done very commonly, insignificance is equated to "no difference": they said

No differences were found in lifetime prevalence rates between migrant and U.S.-born Puerto Rican subjects.

In reality, a difference of 5% was found in the sample that was analyzed.  The statistical procedure found that this difference could have been a result of chance -- notice "could", not "must".  If the measured difference was 0.5% on 30%, then I might be willing to accept a finding of "no difference"; when it was 5% on 30%, I would like to see a larger sample analyzed.


The Meaning of Paradox

1. Andrew was perplexed by why the phenomenon is known as a "paradox". I had the same issue until I read the paper. The authors were a bit sloppy in the abstract. In the paper itself, they explained that the conventional wisdom has it that immigrants should be more likely to have mental illness because of the stress from the immigration process, and yet the statistics showed the exact opposite. That is the paradox.


Publication Bias


I was a little shocked to see the data tables that gave all the estimates of the various effects at the various subgroup levels: shocked because the authors were allowed (or asked) to include only the p-values that were below some unspecified level (which I surmised is 10% although a 5% significance level is used to judge significance as per convention). This is publication bias within publication bias. P-values that are not significant still provide valuable information and should not be omitted. They did provide confidence intervals but for each subgroup separately, rather than for the difference -- and as they noted, such intervals by themselves are inconclusive when they overlap moderately.


Publication_bias




Life-enabling charts

In response to my call for positive examples, reader Merle H. sent in an example of how good charts can make our lives simpler and easier.

All of us have seen the following presentation of air travel data.

Travelocity Not trying to pick on Travelocity - it's the same format whether you use Expedia or any of the airline sites.  For those customers who are looking to decide what dates to travel so as to minimize their air fare, this format is very cumbersome to use.


Flight_chart What about this fare chart at FuncTravel.com?

As you mouse along the line chart, the average fare for each day is visible.  Clicking on a particular day will fix the departure or return dates.

So much easier, isn't it?


A few caveats, though:

  • Instead of just providing the historical averages, they should consider including information on variability, such as bars that indicate the middle 50% or 75% of prices.  Also, what about a sliding control for customers to decide which period of past history the averages should use?  More recent data may be more representative.
  • This particular feature appeals to the price-sensitive, date-flexible customer segment.  Not everyone will pick itineraries based on those criteria.  There is an easy fix. If some controls are available for customers to indicate other preferences, e.g. exclude all British Airways flights, include only evening flights, etc., and the chart can update itself based on such selections, then the chart becomes a lot more flexible, and useful to many more customers.
  • As with many automatically generated charts, the chosen labels on the vertical axis are laughable.  That should be relatively easy to fix, you'd think.
A great start.  I happen to notice that Travelocity has a beta feature that shows a similar chart.  A revolution in how travel sites present data to us is long overdue. 






A tale of four charts

Speaking of rules for making charts, I think the most important is "if at first you don't succeed, try, try and try again."  It's absolutely essential to produce multiple looks before settling on the one that helps tell the story.

While researching U.S. consumer credit this week, I came across these four views of presumably the same data set.

Credit_charts

The chart on the top left (via Business Insider) shows a downward sloping line, with a steep decline on the right edge of the chart.  The authors clearly wanted to show us that consumer credit in the U.S. was collapsing.  The bottom was falling out.  To aid in this effort, they chose to:

  • start the plot at the peak of the time series (2000);
  • set the minimum value of the vertical scale to coincide with the lowest value attained thus far (2009) -- we have reached rock bottom, ladies and gentlemen; and
  • pay no particular attention to the 0% line, which is placed in the lower half of the chart, rather than the middle, thus obscuring the fact that credit grew at faster rates during the height of the boom than it has been declining during the recession.
(Thanks to Excel's default setting, we are treated to a centipede effect on the horizontal axis.  Also thick lines and line shadows.)

Readers who previously complained about my willingness to draw a line through things that shouldn't be connected will be none too amused by the use of a line chart to plot year-on-year changes.  Thus, the zig-zagging of the line represented the change of change, which has been dubbed the "second derivative" during a period when optimists used technical wizardry to show us "green shoots".  To read this chart properly, one should focus on the actual annual decline (the dots) rather than the line.

If one takes a longer-term view, as in the chart on the top right (via Rolfe Winkler), the recent drop in consumer credit can be put in perspective.  Since the 1960s, consumer credit has been positively exploding with few years of decline.  Note, again, that the falling line between 2000 and 2008 represented a slowing rate of growth, not a decline.  The question to ask is: after so many years of almost continuous growth, is the current correction such a big cause for alarm?

This situation of exploding growth followed by a slight correction is better visualized in the third chart (lower left, via chartingtheeconomy.com)  The author also answered a question asked by Rolfe, which is to compare the total consumer credit to the population, as he plotted the per-capita consumer credit.  An eyeball estimate tells us that consumer credit jumped by more than 800 percent since the 1970s, and the current retrenchment is a blip if we take a long view.  The explosion in consumer credit has no doubt enhanced wellbeing in the U.S. for decades; even though credit might well have been over-extended in the recent past, it is far from something evil.

The final chart (bottom right, via The Big Picture) should carry a health warning.  It really should not be shown by itself, or without comment.  Both charts on the right, in fact, came from the same presentation, by David Rosenberg.  This chart, not surprisingly, is the most dramatic -- the chart designer is going for the jugular.  The decline on the right side is much more exaggerated than in any of the other three.  The trick here is:

  • to plot the dollar value of credit change, rather than percentage value, and thus ensuring that the further back the time, the more insignificant the change;
  • to connect dots which bring out the steep decline when in fact, the steep drop reflected the second derivative; and
  • to put the 0% line below the middle of the chart, which causes the bottom "half" of the chart to be smaller than the top "half", which plays with our perception so that we may not realize that there were multiple years of growth above the absolute level of decline recently experienced.

If this chart were to be believed, our focus should not be on the cliff-diving at the far right -- instead, the chart has the hallmark of a system getting completely out of control, and oscillating to oblivion.  One hopes that is not the message intended by its creator.

For anyone creating charts, it would be a great idea to have attempted all of these versions, and more.










Food art

Adam, who is the designer behind the Wired graphics special on "The Future of Food", asked about the rest of the series.  We previously made some comments on a set of mini donut charts.

The first thought that came to mind after browsing through all the charts was: what a great job they have done to generate interest in food data, which has no right to be entertaining.  Specifically, this is a list of things I appreciated:

  • An obvious effort was undertaken to extract the most thought provoking data out of a massive amount of statistics collected by various international agencies.  There weren't any chart that is overstuffed, which is a common problem.
  • It would be somewhat inappropriate to use our standard tools to critique these charts.  Clearly, the purpose of the designer was to draw readers into statistics that they might otherwise not care for.   Moreover, the Wired culture has long traded off efficiency for aesthetics, and this showed in a graph such as this, which is basically a line chart with two lines, and a lot of mysterious meaningless ornaments:
  • Wired_feedtheworld
  • A nice use of a dual line chart, though.  It works because both data series share the same scale and only one vertical axis is necessary, which is very subtly annotated here.
  • The maintenance of the same motifs across several charts is well done.  (See the pages on corn, beef, catfish)


Further suggestions:

  • Wired_bar It would be nice if Wired would be brave enough to adopt the self-sufficiency principle, i.e. graphs should not contain a copy of the entire data set being depicted.  Otherwise, a data table would suffice.  The graphical construct should be self-sufficient.  This rule is not often followed because of "loss aversion"; there is the fear that a graph without all the data is like an orphan separated from the parents.  Since, as I noted, these graphs are mostly made for awe, there is really no need to print all the underlying data.  For instance, these "column"-type charts can stand on their own without the data (adding a scale would help).
  • Not sure if sorting the categories alphabetically in the column chart is preferred to sorting by size of the category.  The side effect of sorting alphabetically is that it spreads out the long and the short chunks, which simplifies labelling and thus reading.
  • Not a fan of area charts (see below).  Although it is labelled properly, it is easy at first glance to focus on the orange line rather than the orange area.  That would be a grave mistake.  The orange line actually plots the total of the two types of fish rearing, not the aquaculture component.  The chart is somewhat misleading because it is difficult to assess the growth rate of aquaculture.  Much better to plot the size of both markets as two lines (either indiced or not).
  • Wired_aquaculture 


Reference: "The Future of Food", Wired, Oct 20 2008.