« February 2011 | Main | April 2011 »

Game over, Tetris

This is the image in my head at the moment (link here):

White-tetris-game-over5-sweatshirts_design

That's when @yawnathan pointed me to the following infographics via twitter:

Social-Networking-Demographic-Statistics-Infographic-1

Here are some issues with the choice of graphics:

  • Not sure how tetris-shaped pieces are better than a standard stacked bar chart, or a line chart
  • Adding a one-liner for each analysis summarizing the key insight is essential, and much more engaging than dry titles like "by gender"
  • Ordering each section of the poster in a sensible way would help bring out the message; maintaining the same order in all four sections has little benefit but adds to the confusion
  • Many of the corporate logos are not popular enough to yield recognition; they do not resemble their company names enough to elicit free association

But the chart also fails to ask the right question. In thinking about "who uses which sites?", it would be much more informative to cut the data in a different way -- tell us among males, what proportion uses Digg v. Stumbleupon v. Facebook, etc. The problem with the current graphic is that it offers no information about scale. For example, Ning may have 1/1000-th of the total traffic compared to Facebook (I made this up) but you wouldn't know since everything is expressed as a proportion of each site's user base.

Besides, what is the objective behind asking the question, who uses which sites? Are readers asked to draw conclusions about the relative viability of the business models of these companies? Is there some significance associated with an elderly skew or female skew?

Finally, the chart hits the trifecta! It also fails from the data collection perspective. While it discloses the source of the data as "Google Ad Planner", it is impossible for readers to make sense of the data. How reliable is this data? Did the income levels come from surveys of users (self-reported and probably biased)? Or from users associated with a specific advertising campaign? Did they come from matching users' IP addresses to Census data? If so, how much actual household-level data are used? Or perhaps a statistical model was built to predict income levels?  Of which period is the data representative? Does that period generalize to other periods? Were there any (or many) missing values? Were these values imputed or set to the average? If a sample was used, how do we know that it is unbiased?

***

In this form, the infographics poster is nothing more than a done-up data dump.


Yellow fever rolling over America

The headline of this Business Insider item reads: "MAP OF THE DAY: There's a 'Superbug' spreading around America killing 40% of the people who come into contact". The only thing missing are the 10 exclamation points that could have been added to the end of the sentence.

Unfortunately, in the mass media, this sort of sentence is quite typical.

Let's dissect the claim.

Indeed, a disease with a fatality rate of 40% is very serious, but one must stop for a second and ask 40% of what? Accidental falls are sometimes fatal but they just don't happen often enough for anyone to be worried. In the case of the new superbug, the article tells us there are 350 recent cases in Los Angeles county, which, last I checked, has 10 million residents. So, the chance of dying from this "superbug" is 140 out of 10 million, which is 0.0014% (1 in 72,000) compared to 1 in 14,000 for accidental falls.

If you have the bug, you have a 40% chance of dying. But the chance of catching the bug is miniscule. (They say "come into contact". Presumably, more than contact is needed to have the bug.)

***

They then show a map illustrating how this bug is "spreading around America".

Us-bug

If you mentally tally up the yellow area as a proportion of the whole country, you might think 2/3 of the country is an emergency zone. But this map is incredibly misleading. It is still the case that the average American would only have a 0.0014% chance of dying from the superbug. (Strictly speaking, the rate would be a tad higher in the yellow area but this distinction will go away as cases pop up in the rest of the states.)

If one were to plot a similar map for "2010 location of deaths due to accidental falls", the entire map would be yellow. The only thing missing would be the 10 exclamation points.

 

 

 

 


An achievable target. And how?

The Wall Street Journal tells us that GM car buyers may react to the "volatility" of gas prices by demanding higher miles-per-gallon from their vehicles. They commissioned an analysis which finds that new GM cars sold today on average have an MPG of only about 21, and suggested that 30 would be a "challenging but achievable target". (Article here).

There are many problems with the analysis, such as no specification of when such a target should be met, and whose target this is, nor any comment on the potential impact on car sales (since higher-priced vehicles tend to have lower MPG) or the existence of government subsidies for larger (lower-MPG) vehicles. Complete silence too on reducing pollution or lowering our dependence on gasoline.

In any case, let's focus on the chart that comes with the article. First, take a look at the caption:

Wsj_mpg3

Then, the chart itself:

Wsj_mpg2

***
Readers is presented with this puzzle: how could a minor shuffling in the mix of cars lower the average MPG to 30 when today, the least gas-guzzling vehicle class (subcompact) only averages 30.6 MPG, barely above the target?

Oops, the chart portrays (less than) half of the solution. Tucked into the caption, the analyst tells us that she has assumed an across-the-board increase of 25% in MPG for every class of vehicle. Think this information is important? Perhaps so. A 25% improvement is about 5 MPG, bringing the average MPG (at the current mix of sales) to 26.5, so the shift in mix of vehicles accounts for about 3.5 MPG of the targeted improvement.

While the chart designer very sensibly ordered the vehicle classes from highest to lowest MPG, it is baffling why the row of MPG data is not labelled directly but given a dark background so as to justify adding a third item to the legend.

The use of stacked columns to represent data at two points in time is confusing. This type of data is much better presented in a Bumps-style chart (left chart below):

Redo_wsj_mpg

The chart on the right shows an across-the-board increase in MPG and gives a sense of how the different vehicle classes stack up along this dimension. (I should've put a marker on the current average of about 21 and the targeted average of 30 but didn't.)

There is a data error in the current sales data as the proportions add up to about 115% rather than 100%. (The last three categories alone add up to about 50%.)

***

This analysis has the flavor of the Facebook valuation projection I discussed on the sister blog a while ago. Both require several assumptions to all come true in order to be realized. Not only must the MPG for every vehicle grow by 25% but a large proportion of new-car buyers must also choose to purchase higher-MPG vehicles. From the chart above, one sees that the proportion buying subcompacts must increase 7-fold from about 2% to 14% while the proportion buying large vans must drop from 15% to 3%, cut by four-fifths!

According to the analyst, this is an "achievable" target.  

 


The best way to handle two dimensions may be to not use two dimensions

Guess what the designer at Nielsen wanted to tell you with this chart:

Smartphone-age-os
Reader Steven S. couldn't figure it out, and chances are neither can you.

What about...

  • The smartphone (OS) market is dominated by three top players (Android, Apple and Blackberry) each having roughly 30% share, while others split the remaining 10%.
  • The age-group mix for each competitor is similar (or are they?)

Maybe those are the messages; if so, there is no need to present a bivariate plot (the so-called "mosaic" plot, or in consulting circles, the Marimekko). Having two charts carrying one message each would accomplish the job cleanly.

***

Trying to do too much in one chart is a disease; witness the side effects.  Smartphone_sm1

The two columns, counting from the right, contain rectangles that appear to be of different sizes, and yet the data labels claim each piece represents 1%, and in some cases "< 1%".  The simultaneous manipulation of both the height and the width plays mind tricks.

Also, while one would ordinarily applaud the dropping of decimals from a chart like this, doing so actually creates the ugly problem that the five pieces of 1% (on the left column shown here) have the same width but clearly varying heights!

Smartphone_sm2 What about this section of the plot shown on the left? Does the smaller green box look like it's less than 1/3 the size of the longer green box? This chart is clearly not self-sufficient, and as such one might prefer a simple data table.

The downfall of the mosaic plot is that it gives the illusion of having two dimensions but only an illusion: in fact, the chart is dominated by one dimension, as all proportions are relative to the grand total.

For instance, the chart says that 6% of all smartphone users are between the ages of 18 and 24 AND uses an Android phone. It also tells us that 2% of all smartphone users are between 35 and 44 AND uses a Palm phone. Those are not two numbers anyone would desire to compare. There are hardly any practical questions that require comparing them.

Sometimes, the best way to handle two dimensions is not to use two dimensions.

***

 The original article notes that "Of the three most popular smartphone operating systems, Android seems to attract more young consumers." In the chart shown below,  Redo_phoneos we assume that the business question is the relative popularity of phone operating systems across age groups. 

The right metric for comparison is the market share of each OS within an age group.

For example, tracing the black line labeled "Android", this chart tells us that Android has 37% of the 18-24 market while it has about 20% of the 65 and up market. 

Android has an overall market share of about 30%, and that average obscures a youth bias that is linear with age.

On the other hand, the iPhone (green line) has also an average market share of about 30% but its profile is pretty flat in all age groups except 65 and up where it has considerable strength.

Further, the gap between Android and iPhone at the older age group actually opens up at 55 years and up. In the 55-64 age group, the iPhone holds a market share that is similar to its overall average while the Android performs quite a bit worse than its average. We note that Palm OS has some strength in the older age groups as well while the Blackberry also significantly underperforms in 65 and over.

Why aren't all these insights visible in the mosaic chart? It all because the chosen denominator of the entire market (as opposed to each age group) makes a lot of segments very small, and then the differences between small segments become invisible when placed beside much larger segments.

Now, the reconstituted chart gives no information about the relative sizes of the age groups. The market size for the older groups is quite a bit smaller than the younger groups. This information should be provided in a separate chart, or as a little histogram tucked under the age-group axis.

 

 


Light entertainment: enjoy the shower!

Reader Chris P. sends us to the shower:

Showerchart

I'm not sure who created this chart. But great work! Love it.

***

When I first came to the States, it puzzled me that two dimensions (flow, temperature) could be condensed onto one control. I still don't understand why.


Have data graphics progressed in the last century?

Received a wonderful link via reader Lonnie P. to this website that presents a historical reconstruction of W.E.B. DuBois's exhibit of the "American negro" at the 1900 Paris Expo. Amusingly, DuBois presented a large series of data graphics to educate the world on the state (plight) of blacks in America over a century ago.

You can really spend a whole afternoon examining these charts (and more); too bad the charts have poor resolution and it is often hard to make out the details.

***

Judging from this evidence, we must face up to the fact that data graphics have made little progress during these eleven decades. Ideas, good or bad, get reinvented. Disappointingly, we haven't learned from the worst ones.

Exhibit A 

  Dubois_a

(see discussion here)

Exhibit B

Dubois_b

 (see discussion here)

Exhibit C 

  Dubois_c

(See discussion here.)

Exhibit D

Dubois_dd
 (see the Vampire chart here)

Exhibit E

Dubois_e
(see the discussion here.)

Exhibit F

Dubois_f
(see discussion here.)