Cat and dog food, for thought

My friend Rhonda (@RKDrake) sends me to this pair of charts (in BusinessWeek). They are fun to look at, and ponder at. 

Bw_catdogHere's the first chart:

 Should the countries be colored according to the distance from the Equator?

Is this implying that cats and dogs have different preferential habitats?

Is there a lurking variable that is correlated with distance from equator?

What is the relationship between cat and dog owners?

Is there any significance to countries sitting on that diagonal, whereby the porportion of households owning dogs is the same as that owning cats?

In particular, what proportion of these households have both dogs and cats?

If 20% of households have cats, and 20% of households have dogs, how much of these households are the same ones?

How are the countries selected?

Where does the data come from?

The data provider is named but is the data coming from surveys? Are those randomized surveys?

Are the criteria used to collect data the same across all these countries?

***
The other chart is about cat and dog food. Again, nice aesthetics, clean execution. Lots of questions but worth looking at. Enjoy.

 


Breaking every limb is very painful

This Financial Times chart is a big failure:

Ft_hb1_locations

Look at the axis. Usually a break in the axis is reserved for outliers. If there is one bar in a bar chart that extends way beyond the rest of the data, then you would sever that bar to let readers know that the scale is broken. Here, the designer broke every bar in the entire chart. It's as if the designer knows we'll complain about not starting the chart at zero -- so the bars all start at zero except they jump from zero to 70 right away.

***

Trifecta_checkupThe biggest issue with this chart is not its graphical element. It's the other two corners of the Trifecta checkup: what is the question being asked? And what data should be used to address that question?

The accompanying article complains about the dearth of HB1 H-1B visas for technical talent at businesses. But it never references the data being plotted.

It's hard for me to even understand what the chart is saying. I think it is saying that in Bloomington-Normal, IL, 94.8 percent of its HB1 H-1B visa requests are science related. There is no way to interpret this number without knowing the percentage for the entire country. It is most likely true that HB1 H-1B visas are primarily used to recruit technical talent from overseas, and the proportion of such requests that are STEM related is high everywhere. In this sense, it's not clear that the proportion of HB1 H-1B requests is a useful indicator of the dearth of technical talent.

Secondly, it is highly unlikely that the decimal point is meaningful. Given the highly variable total number of requests across different locations, the decimal point would represent widely varying numbers of requests.

I'd prefer to look at absolute number of requests for this type of analysis, given that Silicon Valley has orders of magnitude more technical jobs than most of the other listed locations. Requests aren't even a good indicator of labor shortage. Typically HB1 H-1B visas run up against the quota sometime during the year, and companies will stop requesting new visas since there is no chance of getting approved. This is a form of survivorship bias. Wouldn't it be easier to collect data on the number of vacant technical jobs in each location?

 

 


How to fail three tests in one chart

The November issue of Bloomberg Markets published the following pair of pyramid charts:

Bb_pyramids

This chart fails a number of tests:

Tufte's data-ink ratio test

There are a total of six data points in the entire graphic. A mathematician would say only four data points, since the "no opinion" category is just the remainder. The designer lavishes this tiny data set with a variety of effects: colors, triangles, fonts of different tints, fonts of different sizes, solid and striped backgrounds, and legends, making something that is simple much more complex than necessary. The extra stuff impedes rather than improves understanding. In fact, there were so many parts that the designer even forgot to add little squares on the right panel beside the category labels.

Junk Charts's Self-sufficiency test

The data are encoded in the heights of the pyramids, not the areas. The shapes of the areas are inconsistent, which also makes it impossible to decipher. The way it is set up, one must compare the green, striped triangle with two trapezoids. This is when a designer realizes that he/she must print the data labels onto the chart as well. That's when self-sufficiency is violated. Cover up the data labels, and the graphical elements themselves no longer convey the data to the readers. More posts about self-sufficiency here.

Junk Charts's Trifecta checkup

The juxtaposition of two candidates' positions on two entirely different issues does not yield much insights. One is an economic issue, one is military in nature. Is this a commentary of the general credibility of the candidates? or their credibility on specific issues? or the investors' attitude toward the issues? Once the pertinent question is clarified, then the journalist needs to find the right data to address the question. More posts about the Trifecta checkup here.

Minimum Reporting Requirements for polls

Any pollster who doesn't report the sample size and/or the margin of error is not to be taken seriously. In addition, we should want to know how the sample was selected. What does it mean by "global investors"? Did the journalist randomly sample some investors? Did investors happen to fill out a survey that is served up somehow?

***

The following bar charts, while not innovative, speak louder.

Redo_pyramid1
Redo_pyramid2


The "data" corner of the Trifecta

TrifectaIn the JunkCharts Trifecta checkup, we reserve a corner for "data". The data used in a chart must be in harmony with the question being addressed, as well as the chart type being selected. When people think about data, they often think cleaning the data, processing the data but what comes before that is collecting the data -- specifically, collecting data that directly address the question at hand.

Our previous post on the smartphone app crashes focused on why the data was not trustworthy. The same problem plagues this "spider chart", submitted by Marcus R. (link to chart here)

Qlikview_Performance

Despite the title, it is impossible to tell how QlikView is "first" among these brands. In fact, with several shades of blue, I find it hard to even figure out which part refers to QlikView.

The (radial) axis is also a great mystery because it has labels (0, 0.5, 1, 1.5). I have never seen surveys with such a scale.

The symmetry of this chart is its downfall. These "business intelligence" software are ranked along 10 dimensions. There may not be a single decision-maker who would assign equal weight to each of these criteria. It's hard to imagine that "project length" is equally important as "product quality", for example.

Take one step backwards. This data came from responders to a survey (link). There is very little information about the composition of the responders. Are they asked to rate all 10 products along 10 dimensions? Do they only rate the products they are familiar with? Or only the products they actively use? If the latter, how are responses for different products calibrated so that a 1 rating from QlikView users equals a 1 rating from MicroStrategy users? Given that each of these products have broad but not completely overlapping coverage, and users typically deploy only a part of the solution, how does the analysis address for the selection bias?

***

The "spider chart" is, unfortunately, most often associated with Florence Nightingale, who created the following chart:

Nightingale

This chart isn't my cup of tea either.

***

Also note that the spider chart has so much over-plotting that it is impossible to retrieve the underlying data.

 

 


A data mess outduels the pie-chart disaster for our attention

Reader Daniel L. sends us to a truly horrifying pie chart. This:

Bgr-crashes-ios-android-1

Link to the original here.

The background: a smartphone monitoring company Crittercism compiled data on the frequency of app crashes by version of mobile operating systems (Android or Apple iOS). The data is converted into proportions adding to 100%.

If we spend our time trying to figure out the logic behind the ordering and placing of the data (e.g. why iOS is split on both sides? why pieces are not sorted by size?), we will miss the graver problem with this chart - the underlying data.

***

Here is a long list of potential issues:

  • Crittercism sells app monitoring tools for app developers. Presumably this is how it is able to count app crashes. But who are their customers? Are they a representative set of the universe of apps? Do we even know the proportion of Android/iOS apps being monitored?
  • There is reason to believe that the customer set is not representative. One would guess that more crash-prone apps are more likely to have a need for monitoring. Also, is Apple a customer? Given that Apple has many highly popular apps on iOS, omission of these will make the data useless.
  • The data wasn't adjusted for the popularity of apps. It's very misleading to count app crashes without understanding how many times the app has been opened. This is the same fallacy as making conclusions about flight safety based on the list of fatal plane accidents; the millions of flights that complete without incident provide lots of information! (See Chapter 5 of my book for a discussion of this.)
  • The data has severe survivorship bias. The blog poster even mentions this problem but adopts the attitude that such disclosure somehow suffices to render useless data acceptable. More recent releases are more prone to crashes just because they are newer. If a particular OS release is particularly prone to app crashes, then we expect a higher proportion of users to have upgraded to newer releases. Thus, older releases will always look less crash-prone, partly because more bugs have been fixed, and partly because of decisions by users to switch out. iOS is the older operating system, and so there are more versions of it being used.
  • How is a "crash" defined?  I don't know anything about Android crashes. But my experience with PC operating systems is that each one has different crash characteristics. I suspect that an Android crash may not be the same as an iOS crash.
  • How many apps and how many users were included in these statistics? Specifying the sample size is fundamental to any such presentation.
  • Given the many problems related to timing as described above, one has to be careful when generalizing with data that only span two weeks in December.
  • There are other smartphone OS being used out there. If those are omitted, then we can't have a proportion that adds up to 100% unless those other operating systems never have app crashes.

***

How to fix this mess? One should start with the right metric, which is the crash rate, that is, the number of crashes divided by the number of app starts. Then, make sure the set of apps being tracked is representative of the universe of apps out there (in terms of popularity).

Some sort of time matching is needed. Perhaps trace the change in crash rate over time for each version of each OS. Superimpose these curves, with the time axis measuring time since first release. Most likely, this is the kind of problem that requires building a statistical model because multiple factors are at play.

Finally, I'd argue that the question being posed is better answered using good old-fashioned customer surveys collecting subjective opinion ("how many crashes occurred this past week?" or "rate crash performance"). Yes, this is a shocker: a properly-designed small-scale survey will beat a massive-scale observational data set with known and unknown biases. You may agree with me if you agree that we should care about the perception of crash severity by users, not the "true" number of crashes. (That's covered in Chapter 1 of my book.)

 

 

 


Motion-sick, or just sick?

Reader Irene R. was asked by a client to emulate this infographic movie, made by UNIQLO, the Japanese clothing store.

Here is one screen shot of the movie:

Uniqlo

This is the first screen of a section; from this moment, the globes dissolve into clusters of photographs representing the survey respondents, which then parade across the screen. Irene complains of motion sickness, and I can see why she feels that way.

Here is another screen shot:

Uniqlo2

Surprisingly, I don't find this effort completely wasteful. This is because I have read a fair share of bore-them-to-tears compilation of survey research results - you know, those presentations with one multi-colored, stacked or grouped bar chart after another, extending for dozens of pages.

There are some interesting ideas in this movie. They have buttons on the lower left that allow users to look at subgroups. You'll quickly find the limitations of such studies by clicking on one or more of those buttons... the sample sizes shrink drastically.

The use of faces animates the survey, reminding viewers that the statistics represent real people. I wonder how they chose which faces to highlight, and in particular, whether the answers thus highlighted represent the average respondent. There is a danger that viewers will remember individual faces and their answers more than they recall the average statistics.

***

If the choice is between a thick presentation gathering dust on the CEO's desk and this vertigo of a movie that perhaps might get viewed, which one would you pick?

 


Nothing is as simple as it seems

Thanks to reader Chris P. (again) for pointing us to this infographics about teacher pay. This one is much better than your run-of-the-mill infographics poster. The designer has set out to answer specific questions like "how much do teachers make?", and has organized the chart in this way.

This post is about the very first chart because I couldn't get past it. It's a simple bar chart, with one data series indexed by country, showing the relative starting salary of a primary-school teacher with minimal training. This one:

Sosashable_teacherpay

The chart tells us that the range of salaries goes from about $12,000 at the low end (Poland) to over $65,000 at the high end (Luxembourg), with U.S. roughly at the 67% percentile, running at $42,000 per year. The footnote says that the source was OECD.

The chart is clean and simple, as a routine chart like this should. One might complain that it would be easier to read if flipped 90 degrees, with country labels on the left and bars instead of columns. But that's not where I got stuck... mentally.

I couldn't get past this chart because it generated so many unanswered questions. The point of the chart is to compare U.S. teacher pay against the rest of the world (apologies to readers outside the U.S., I'm just going with the designer's intention). And yet, it doesn't answer that question satisfactorily.

Our perception of the percentile ranking of the U.S. is fully determined by the choice of countries depicted. One wonders how that choice was made. Do the countries provide a nice sampling of the range of incomes from around the world? Is Poland truly representative of low pay and Luxembourg of high pay? Why are Korea and Japan the only two Asian countries shown and not, say, China or India? Why is there a need to plot Belgium (Fl.) separately from Belgium (Fr.), especially since the difference between the two parts of Belgium is dwarfed by the difference between Belgium and any other country? This last one may seem unimportant but a small detail like this changes the perceived ranks.

Further, why is the starting salary used for this comparison? Why not average salary? Median salary? Salary with x years of experience? Perhaps starting salary is highly correlated to these other metrics, perhaps not.

Have there been sharp changes in the salaries over time in any of these countries? It's quite possible that salaries are in flux in less developed countries, and more stable in more developed countries.

Also, given the gap in cost of living between, say, Luxembourg and Mexico, it's not clear that the Mexican teacher earning about $20,000 is worse off than the Luxembourger taking home about $65,000. I was curious enough to do a little homework: the PPP GDP per capita in Luxembourg was about $80,000, compared to $15,000 in Mexico, according to IMF (source: Wikipedia), so after accounting for cost of living, the Mexican earns an above-average salary while the Luxembourger takes home a below-average salary. Thus, the chart completely misses the point.

***

  Jc_trifecta Using the Trifecta checkup, one would address this type of issues when selecting the appropriate data series for use to address the meaningful question.

Too often, we pick up any data set we can lay our hands on, and the data fails to answer the question, and may even mislead readers.

 

 

 

PS. On a second look, I realized that the PPP analysis shown above was not strictly accurate as I compared an unadjusted salary to an adjusted salary. A better analysis is as follows: take the per-capita PPP GDP of each country, and the per-capita unadjusted GDP to form the adjustment factor. Using IMF numbers, for Luxembourg, this is 0.74 and for Mexico, this is 1.57. Now, adjust the average teacher salary by those factors. For Luxembourg, the salary adjusted for cost of living is $48,000 (note that this is an adjustment downwards due to higher cost of living in that country), and for Mexico, the adjusted salary was inflated to $31,000. Now, these numbers can be appropriately compared to the $80,000 and $15,000 respectively. The story stays the same.

 

 


Unscientific American 1: misreadings

Chris P. sent me to this set of charts / infographics with the subject line "all sorts of colors and graphs."  I let the email languish in my inbox, and I now regret it. For two reasons: one, the topic of how scientists can communicate better with, and thus exert stronger influence on, the public is very close to my heart (as you can tell from my blogs and book), and this article presents results from a poll on this topic done on on-line readers of Scientific American, and Nature magazines; two, some of the charts are frankly quite embarrassing, to have appeared in venerable publications of a scientific nature (sigh); three, these charts provide a convenient platform to review some of the main themes on Junk Charts over the years.

Since the post is so long, I have split it into two parts. In part 1, I explore one chart in detail. In part 2, I use several other charts to illustrate some concepts that have been frequently deployed on Junk Charts.

***

Exhibit A is this chart:

Sa_howmuchdopeople

First, take a look at the top left corner. At first glance, I took the inset to mean: among scientists, how much do they trust scientists (i.e., their peers) on various topics?  That seemed curious, as that wouldn't be a question I'd have thought to ask, certainly not as the second question in the poll.

 Sa_howmuchdopeople1On further inspection, that is a misreading of this chart. The "scientists" represented above are objects, not subjects, in the first question. As the caption tells us, the respondents rated scientists at 3.98 overall, which is an average rating across many topics. The bar chart below tells us how the respondents rated scientists on individual topics, thus providing us information on the spread of ratings.

 Unfortunately, this chart raises more questions than it answers. For one, you're working out how the average could be 3.98 (at that 4.0 white line) when all but three of the topic ratings were below 3.98. Did they use a weighted average but did not let on?

Oops, I misread the chart, again. I think, what I stumbled on here is the design of the poll itself. The overall rating is probably a separate question, and not at all related to the individual topic ratings. In theory, each person can assign a subjective importance as well as a rating to each topic; the average of the ratings weighted by their respective importance would form his or her overall rating of scientists. That would impose consistency to the two levels of ratings. In practice, that makes an assumption that the topics span the space of what topics each person considers when rating the scientists overall.

***

The bar chart has a major problem... it does not start at zero.  Since the bars are half as long as the longest, you might think the level of trust associated with nuclear power or climate change would be around 2 (negative). But it's not; it's in the 3.6 range. This is a lack of self-sufficiency. The reader cannot understand the chart without fishing out the data.

Now, ask this question: in a poll in which respondents are asked to rate things on a scale of 1, 2, 3, 4, 5, do you care about the average rating to 2 decimal places?  The designer of the graphic seems to think not, as the rating was rounded up to the nearest 0.5, and presented using the iconic 5-star motive. I think this is a great decision!

Citizensvsjournalists But then, the designer fell for loss aversion: having converted the decimals to half-stars, he should have dropped the decimals; instead, he tucked them at the bottom of each picture. This is no mere trivia. Now, the reader is forced to process two different scales showing the same information. Instead of achieving simplification by adopting the star system, now the reader is examining the cracks: is the trust given citizens groups the same as journalists (both 2.5 stars) or do "people" trust citizens groups more (higher decimal rating)?

***

The biggest issues with this chart concern the identification of the key questions and how to collect data to address those questions. This is the top corner of the Trifecta checkup.

1) The writer keeps telling us "people" trust this and that but the poll only covered on-line readers of Scientific American and Nature magazines. One simply cannot generalize that segment of the population to the common "people".

2) Insufficient attention has been paid to selecting the right wording in the questions. For example, in Exhibit A, while the overall trust question was phrased as trusting the "accuracy" of the information provided by scientists vs. other groups, the trust questions on individual topics mentioned only a generic "trust".  Unless one thinks "trust" is a synonym of "accuracy", the differential choice of words makes these two set of responses hard to compare. And compairing them is precisely what they chose to do.

***

In part 2, I examine several other charts, taking stops at several concepts we use on Junk Charts a lot.

 


Lessons from propaganda

Political wordsmith (euphemism) Frank Luntz's presentation is all over the Web. I saw it on Business Insider. In the debate between words and numbers, Luntz obviously takes the side of words.

He used a few simple charts in the presentation, which is interesting by itself since he fundamentally is a words guy, not a numbers guy.

The charts, while simple, are very instructive:

Luntz_bar This bar chart sent me running to the maligned pie chart!  (Almost, read on).

While the total responses were almost evenly split between the three choices, the bar chart drew our attention to the first bar, which is inapt.

If plotted as a pie chart, I thought, the reader would see three almost equal slices. This effect occurs because we are much less precise at determining the areas of slices than the areas of bars.  Wouldn't that turn our usual advice on its head?

 

How the Bar Chart is Saved

The one thing that the pie chart has as a default that this bar chart doesn't is the upper bound.  Everything must add up to 100% in a circle but nothing forces the lengths of the bars to add up to anything.

We save the bar chart by making the horizontal axis stretch to 100% for each bar.  This new scaling makes the three bars appear almost equal in length, which is as it should be.

Redo_luntz

Another Unforgivable Pie Chart

On the very next page, Luntz threw this pie at our faces:

Luntz_pie Make sure you read the sentence at the bottom.

It appears that he removed the largest group of responses, and then reweighted the CEO and Companies responses to add to 100%.

This procedure is always ill-advised - responders responded to the full set of choices, and if they were only given these two responses, they very well might have answered differently.

It also elevated secondary responses while dispensing with the primary response.


Leaving ink traces

Stefan S. at the UNEP GEO Data Portal sent me some intriguing charts, made from data about the environment.  The following shows the amount of CO2 emissions by country, both in aggregate and per capita.  We looked at some of their other charts before.

Co2emission

These "inkblot" charts are visually appealing, and have some similarities with word clouds.  It's pretty easy to find the important pieces of data; and while in general we should not sort things alphabetically, here, as in word clouds, the alphabetical order is actually superior as it spaces out the important bits.  If these were sorted by size, we'll end up with all the big blots on top, and a bunch of narrow lines at the bottom - and it will look very ugly.

The chart also breaks another rule. Each inkblot is a mirror image about a horizontal line. This arrangement is akin to arranging a bar chart with the bars centered (this has been done before, here).  It works here because there is no meaningful zero point (put differently, many zero points) on the vertical scale, and the data is encoded in the height of each inkblot at any given time.

Breaking such a rule has an unintended negative.  The change over time within each country is obscured: the slope of the upper envelope now only contains half of the change, the other half exists in the lower envelope's slope.  Given that the more important goal is cross-country comparison, I think the tradeoff is reasonable.

Co2emission2

Colors are chosen to help readers shift left and right between the per capita data and the aggregate data.  Gridlines and labels are judicious.

As with other infographics, this chart does well to organize and expose interesting bits of data but doesn't address the next level of questions, such as why some countries contribute more pollution than others.

One suggestion: restrict the countries depicted to satisfy both rules (per capita emissions > 1000 kg AND total emissions > 10 million tonnes).  In this version, a country like Albania is found only on one chart but not the other.  This disrupts the shifting back and forth between the two charts.