When simple is too simple

30coontz-gr1-popup-v2Thanks to reader Don M, I came across this fascinating chart published in the New York Times Review recently (link). The main article, about gender segregation in job categories, is found here.

This is one of those charts that require a reader's guide.

The chart shows the proportion of women in each job category in year 1980 and in year 2010 (and nothing in between). The jobs are divided into three large chunks: the top chunk (shaded) consists of jobs in which women account for more than 70 percent of the total; the middle chunk (white background) are those jobs with 30 to 70 percent women; the bottom chunk (also shaded) are jobs with more than 70 percent men.

The designer then uses the red, green and gray colors (apologies to the color-blind folks) to group the jobs into three clusters. This is usually a great idea except that it is poorly executed here. Don is very annoyed with this because these colors lead the readers to the wrong conclusion, and I agree.


The color scheme is unnecessarily convoluted. Here is an alternative I prefer:

  • if the change is 5 percent or less, color as gray no matter where the line is. (It is insane to color the line for housekeepers "red" for going from 87 to 89 percent in 30 years). 
  • if the change is over 5 percent in the female direction, color it red to indicate the occupation is becoming more female. (There would be many red lines, such as for managers in education, HR staff, social workers, architects, etc.)
  • if the change is over 5 percent in the male direction, color it blue to indicate the occupation is becoming more male (There would be only one blue line, and that is for welfare service aides.)

This would mean the lines for dentists and architects would be labelled progress. So too with most of the jobs that were predominantly male in 1980. In fact, there really isn't any occupation that went backwards--all those red lines in the bottom shaded chunk indicate shifts of only 1 to 4 percent, over 30 years!

This conclusion usurps the premise of the column in which the author claims that the conventional wisdom is wrong.


The other precaution in reading this chart is to realize that each occupation is put on equal footing in this chart even though some job categories employ a lot more people than others. Also confounded with this data is the differential growth/decline in job categories over the 30-year period. Further, the proportion of women entering the labor force must be accounted for.

This is a case in which less is less. The structure of the problem is complex, and it requires a more sophisticated approach.

Hedge-fund bubbles are not nice

Reader Sushil B. offers this chart from Business Week on hedge fund returns. (link)


Unmoored bubbles, slanted text, positive and negative returns undifferentiated, bubble within bubble, paired data scattered apart, and it's not even that attractive.


Here is a Bumps-chart style version of this data:


The author never explained how the five funds were chosen so it's hard to know what's the point of the chart. It appears like Harbinger Capital Partners had a similar experience as Paulson. In addition, given the potentially huge gyrations from year to year, it's very odd that we are not shown the annual returns between 2007 and 2011... we can't be sure that some of the three other funds suffered a particularly bad year in between the end points shown here.


Is that my third leg?

We look at another idea from the visualization project "Gaps in the US Healthcare System" (link). This was a tip from reader Jordan G. (link). One of the bright points about this project is the conscious attempt to try something different although the end result is not always successful.

A tree-like branching chart was used to represent cancer death rates, broken down by racial group, gender and type of cancer, in that order.


The tree structure loses its logic after the race and gender splits. Why link different types of cancers (the gray squares) together in a sequence? Stranger still is the existence of a third branch coming out of every race node (the four closest to the center). One branch is male, the other branch is female, what's the third leg? It appears to be prostate cancer which is male only--why doesn't it go with the male branch?

It's not easy to find the connection between what's depicted here, and the idea of "gaps" in the US healthcare system. I think the question is ill-posed to begin with.  The rate of death reflects both the possible differential quality of healthcare between groups and the differential incidence of cancers between groups so no visualization tricks could be used to find reliable answers to the question being posed.

Jc_trifectaThe chart fails the first corner of the Trifecta checkup. The chart type also does not fit the data.


The following chart plots the same data in a Bumps style.


I separated the male and female data since certain cancers are limited to one gender, and the gender difference is not likely to be the primary interest. The gender difference, incidentally, is clearly observed: the male death rates are generally about twice as high as the female rates of the same type of cancer, except for colorectal.

In terms of the "race gap", we find that black death rates are generally quite a bit higher than white death rates, especially for prostate cancer but except for lung cancer in females.

Asians and American Indians have practially the same death rates but in both cases the sample sizes are small.

The raw data can be found at the CDC website here.


Showing off the world in charts

Un_lifexpectStefan S. who works for the UN data project and is a regular contributor to this blog, points us to a new report they have issued that contain a host of charts. The report is an update on what has happened to our Earth since 1992 (The Earth Summit). Link to the PDF file here.


This life expectancy chart (shown on left) uses a Bumps-type chart, and is very nicely done, clean and informative. 


Un_agedistThis age distribution chart shown on the right is unusual. It's a case of the data defeating the chart type. The magnitude of the 5-year changes is just not large enough as a percentage of the total to register. On a different data set, I can see this chart type being more effective.


Now, this criss-cross chart (bottom left) reminds me of Friedman's foolish attempt some time ago. It has various issues, like dual axes, excessive labels and inattentive titles (not indicating that the base population was only of developing countries).


  Instead, I attempted an area chart, using population size as the primary metric. Perhaps a more direct way to illustrate this point is to plot the growth rate of the slum population versus that of the total population.


This map is excellent, showing the spatial distribution of the countries with above-average and below-average GDP per capita. It would be even better if smaller geographic units can be used so that the distribution within each country can also be seen.



I'd like to salute all the people around the world who work at statistical agencies and who collect and make sense of all of this data, without which any of these charts would not have been possible.

Rebirth of the twin towers

Perhaps it's this week's anniversary of the WTC disaster. Perhaps it's the New York-centric viewpoint of Citibank. One wonders what inspired Citibank analysts to make this absurdity.


(Via Business Insider.)

First, we must fix the vertical scale. For column charts, one must start at zero, without exceptions. The effect of not starting at zero is to chop off an equal length piece from the bottom of each column, and in doing so, the relative lengths/areas of the columns are distorted. The amount of distortion can be very severe. For example look at the fourth set of columns as shown below:


In both charts, I made the length of the first column the same so we are staring at comparative charts. The data plotted is exactly the same; the only difference is that the left chart starts the axis at zero. Notice that the huge difference seen on the right chart for the 4th pair of columns does not appear as extraordinary when the proper scale is used.

A multitude of other problems exist, not the least this is a chart that is highly redundant. The same data (10 numbers) show up three times, once as data labels, once as column lengths (distorted), and once as levels on the vertical scale.


An alternative way to look at this data is the Bumps chart. Like this:


What this chart brings out is the variability of the estimated vehicle densities. In theory, the density estimate should be quite accurate for the "today" numbers. You'd think that in surveying 2,000+ people about how many vehicles they currently own, most people should be able to provide accurate counts.

The data paint a different picture. From quarter to quarter, the estimated "today" density shows a range of 1.90x to 2.00x in the 5 periods analyzed, which is roughly 5%, a difference which, according to the analyst, equates to 5 million vehicles!  Given current vehicle sales of about 13 million per year, 5 million is almost 40% of the market.

So, one wonders how this survey was done, and one wants to know how large is the margin of error of this estimate. I also want to know if the survey produces estimates of number of households as well since the vehicle per household metric has two variable components.

Nielsen's cross-platform crossing diagram crosses up readers

My friend Augustine F., who's a data-savvy guy, couldn't figure out what's going on with this chart in Nielsen's cross-platform report.


It's a case of a Bumps chart done poorly.

The reader must first read the beginning pages of the report to find one's bearing. The two charts are supposed to investigate the correlation between streaming video and regular TV. What causes the confusion is that the populations being analyzed are different between the two charts.

In the left chart, they exclude anyone who do not watch streaming video (35% of the sample), and then divide those who watch streaming video into five equal-sized segments based on how much they watch. Then, they look at how much regular TV each segment watches on average.

In the right chart, they exclude anyone who do not watch regular TV (just 0.5% of the sample), and then divide those who watch regular TV into five equal-sized segments based on how much they watch. Then, they look at how much online streaming video each segment watches on average.


What crosses us up is the relative scales. The scale for regular TV viewing is tightly clustered between 212 and 247 daily minutes on the left chart but has a wide range between 24 and 522 on the right chart. The impression given by the designer is that the same population (18-34 year olds) is divided into five groups (quintiles) for each chart, albeit using different criteria. It just doesn't make sense that the group averages do not match.

The reason for this mismatch is the hugely divergent rates of exclusion as described above. What the chart seems to be saying is that the 65% who use streaming video have very similar TV viewing behavior (about 220 daily minutes). In other words, we surmise that most of those people on the left chart map to groups 2 and 3 on the right chart.

Who are the people in groups 1, 4 and 5 on the right chart? It appears that they are the 35% who don't watch streaming video. Thus, the real insight of this chart is that there are two types of people who don't watch streaming video: those who watch very little regular TV at all, and those who watch twice the average amount of regular TV.


Here's another puzzle: Nielsen claims that high streaming = low TV and low streaming = high TV. Is it really true that high streaming = low TV? Take the segment of highest streaming (#1 on the left chart). This group, which is 13% of the survey population, accounts for 83% of the streaming minutes -- almost 71,000 out of 86,000 minutes. Now look at the right chart. It turns out that the streaming minutes are quite evenly distributed among those TV-based quintiles, ranging from 15,000 minutes to 23,000 minutes each.

So, it is impossible to fit all of the top streaming quintile into any one TV quintile - they have too many streaming minutes. In fact, the top streaming quintile must be quite spread out among the TV quintiles since each of the TV quintiles is 1.5 times the size of a streaming quintile!

So, we must conclude that customers who stream a lot include both fervent TV fans as well as those who watch little TV.


In a return-on-effort analysis, this is a high-effort, low-reward chart.


A skewed view of ten Indian states

Economist_indiasexratio The Economist published this chart to illustrate the problem of the "missing girls" in Indian society.

The girls-to-boys ratio (ages 0-6) should be about 952 but in India, it is 914. That's an average number for 35 territories, and the most skewed ratio was 830 in Punjab.

Curiously, the Economist chose to focus on only 11 states instead of showing all 35. The first 10 of these had sex ratio below the natural number of 952 while the last one was over the average. Nowhere on the chart or in the article is it explained whether the unmentioned 24 states all had above-average sex ratios: unlikely, unless certain states have much higher youth population than others.

In fact, the reference line of 952 is misplaced. Readers will find that there are two metrics depending on which survey one is looking at, either sex ratio at birth or sex ratio for children aged 0-6. The natural ratio of 952 is for the 0-6 measure but the data by territory are all for the at-birth measure. Instead, the dotted red line needs to be at 904, which is the national average sex ratio at birth for India for the 2006-8 period.


The lethal error in this chart is not starting the horizontal axis at zero. 
Redo_indiangirls1 By cutting off the same amount from each bar, this chart messes up the ratio of lengths, and presents a misleading picture of the relative sex ratio between territories.  We may think Punjab's sex ratio is half that of Gujarat (in the original chart) but as the chart on the right shows, that is far from the truth!


The other unfortunate practice, typical of the Economist, is to stick a second set of data on the right of the chart as an afterthought. In fact, that data representing the change in the sex ratio over time is more interesting than what the exact sex ratio was in each territory in 2006-8.

A much better way to present the data, without favoring one series or another, is the Bumps chart, as shown below. We can clearly see that the improvement in sex ratio is concentrated on those states that started out the decade in a worse shape.



An achievable target. And how?

The Wall Street Journal tells us that GM car buyers may react to the "volatility" of gas prices by demanding higher miles-per-gallon from their vehicles. They commissioned an analysis which finds that new GM cars sold today on average have an MPG of only about 21, and suggested that 30 would be a "challenging but achievable target". (Article here).

There are many problems with the analysis, such as no specification of when such a target should be met, and whose target this is, nor any comment on the potential impact on car sales (since higher-priced vehicles tend to have lower MPG) or the existence of government subsidies for larger (lower-MPG) vehicles. Complete silence too on reducing pollution or lowering our dependence on gasoline.

In any case, let's focus on the chart that comes with the article. First, take a look at the caption:


Then, the chart itself:


Readers is presented with this puzzle: how could a minor shuffling in the mix of cars lower the average MPG to 30 when today, the least gas-guzzling vehicle class (subcompact) only averages 30.6 MPG, barely above the target?

Oops, the chart portrays (less than) half of the solution. Tucked into the caption, the analyst tells us that she has assumed an across-the-board increase of 25% in MPG for every class of vehicle. Think this information is important? Perhaps so. A 25% improvement is about 5 MPG, bringing the average MPG (at the current mix of sales) to 26.5, so the shift in mix of vehicles accounts for about 3.5 MPG of the targeted improvement.

While the chart designer very sensibly ordered the vehicle classes from highest to lowest MPG, it is baffling why the row of MPG data is not labelled directly but given a dark background so as to justify adding a third item to the legend.

The use of stacked columns to represent data at two points in time is confusing. This type of data is much better presented in a Bumps-style chart (left chart below):


The chart on the right shows an across-the-board increase in MPG and gives a sense of how the different vehicle classes stack up along this dimension. (I should've put a marker on the current average of about 21 and the targeted average of 30 but didn't.)

There is a data error in the current sales data as the proportions add up to about 115% rather than 100%. (The last three categories alone add up to about 50%.)


This analysis has the flavor of the Facebook valuation projection I discussed on the sister blog a while ago. Both require several assumptions to all come true in order to be realized. Not only must the MPG for every vehicle grow by 25% but a large proportion of new-car buyers must also choose to purchase higher-MPG vehicles. From the chart above, one sees that the proportion buying subcompacts must increase 7-fold from about 2% to 14% while the proportion buying large vans must drop from 15% to 3%, cut by four-fifths!

According to the analyst, this is an "achievable" target.  


Be guided by the questions

Information graphics is one of many terms used to describe charts showing data -- and a very ambitious one at that. It promises the delivery of "information". Too often, readers are disappointed, sometimes because the "information" cannot be found on the chart, and sometimes because the "information" is resolutely hidden behind thickets.

Statistical techniques are useful to expose the hidden information. They work by getting rid of the extraneous or misleading bits of data, and by accentuating the most informative parts. A statistical graphic distinguishes itself by not showing all the raw data.

Guardian_pisa_sm Here is the Guardian's take on the OECD PISA scores that were released recently. (Perhaps some of you are playing around with this data, which I featured in the Open Call... alas, no takers so far.) I only excerpted the top part of the chart.

This graphic is not bad, could have been much worse, and I'm sure there are much worse out there.

But think about this for a moment: what question did the designer hope to address with this chart? The headline says comparing UK against other OECD countries, which is a simple objective that does not justify such a complex chart.

The most noticeable feature are the line segments showing the correlation of ranks among the three subject areas within each country. So, South Korea is ranked first in reading and math, and third in science. Equally prominent is the rank of countries shown on the left-hand-side of the chart (which, on inspection, shows the ranking of reading scores); this ranking also determines the colors used, another eye-catching part of this chart. (The thick black UK line is, of course, important also.)

In my opinion, those are not the three or four most interesting questions about this data set. In such a rich data set, there could be dozens of interesting questions. I'm not arguing that we have to agree on which ones are the most prominent. I'm saying the designer should be clear in his or her own mind what questions are being answered -- prior to digging around the data.

With that in mind, I decided that a popular question concerns the comparison of scores between any pair of countries. From there, I worked on how to simplify the data to bring out the "information". Specifically, I used a little statistics to classify countries into 7 groups; countries within each group are judged to have performed equally well in the test and any difference could be considered statistical noise. (I will discuss how I put countries into these groups in a future post, just focusing on the chart here.)

Here is the result: (PS. Just realized the axis should be labelled "PISA Reading Score Differentials from the Reference Country Group" as they show pairwise differences, not scores.)


Each row uses one of the country groups as the reference level. For example, the first row shows that Finland and South Korea, the two best performing countries, did significantly better than all other country groups, except those in A2. The relative distance of each set of countries from the reference level is meaningful, and gives information about how much worse they did. 

(The standard error seems to be about 3-6 based on some table I found on the web, which may or may not be correct. This value leads to very high standardized score differentials, indicating that the spread between countries are very wide.

I have done this for the reading test only. The test scores were standardized, which is not necessary if we are only concerned about the reading test. But since I was also looking at correlations between the three subjects, I chose to standardize the scores, which is another way of saying putting them on an identical scale.)

Before settling on the above chart, I produced this version:


This post is getting too long so I'll be brief on this next point. You may wonder whether having all 7 rows is redundant. The reason why they are all there is that the pairwise differences lack "transitivity": e.g., the difference between Finland and UK is not the difference between Finland and Sweden plus the difference between Sweden and the UK. The right way to read it is to cling to the reference country group, and only look at the differences between the reference group and each of the other groups. The differences between two country groups neither of which is a reference group should be ignored in this chart: instead look up the two rows for which those countries are a reference group.

Before that, I tried a more typical network graph. It looks "sophisticated" and is much more compact but it contains less information than the previous chart, and gets murkier as the number of entities increases. Readers have to work hard to dig out the interesting bits.