« July 2009 | Main | September 2009 »

A second lease on life

I hinted at it in the last post, and some readers also made similar suggestions.  What happens if we plot the U.S. life expectancy data in relative terms (indiced) rather than in absolute terms?

The result is highly revealing, and that is why we should always look at the data many ways.  While in the original chart, the differences in the race/gender segments were essentially obscured by the overall slowly-growing trend, in our new chart, we took out the trend, isolating the growth rates.


Redo_lifeexpect

The reconstructed chart showed that:

  • Between 1970 and roughly 1990, blacks of both genders gained in life expectancy at a rate higher than the national average, while white females lagged behind white males
  • However, almost all of the gain by blacks were attained between 1970 and 1984, and in the 10-15 years following, this excess gain was wiped out so that by 1992 or so, the black male, black female and white male lines again converged.
  • Starting in 1995, black males again achieved significant improvement in life expectancy.  This time, black females did not follow their male counterparts.  Meanwhile, white females continue to lag behind.

Not being a health care specialist, I can't say what happened to the cohorts of the 1970s, the 1980s and 1995.  One thing is for sure: these insights are hard to glean from the original.

Ap_lifeexpect 


Reference: "CDC says life expectancy in the US is up, deaths not", Miami Herald, Aug 19 2009.  CDC Life expectancy data.


Extra reading


The New York Fed has put up some impressive infographics, depicting the credit conditions around the country.

Nyfed_credit
(I was going to praise the producers for this effort - a rarity among government sites but when I went back to the site today, I find that the graphic is loading only occasionally, and try as I might, I could not get rid of the big brown patch that obscures the West coast.  Hope you have better luck!)

As with similar maps, the unresolved problem is that the pattern is strongly dominated by the relative density of people around the country.  As we click through different maps, they all look similar.

But this map -- as do most infographics -- does well to put structure onto unwieldy, large data sets.  It acts as a table of contents to the data set.



Nick Rapp's team at AP produced this great chart on US life expectancy:

Ap_lifeexpect

Very sensitive axis labels, putting the line labels into the chart (rather than in a legend box far away), showing only the current data, not every data point, subtle coloring to bring out the decades, etc.

I'm not sure about varying the thickness and tinge of the lines.  Is it necessary?  It hinders a bit as we try to compare the slopes of the lines.  Adding the initial data labels (for 1975) would also help us judge the net change over the period depicted.  If it is the change in life expectancy over time that is the main story, consider indicing all lines to the respective 1975 value.


Reference: "CDC says life expectancy in the US is up, deaths not", Miami Herald, Aug 19 2009. 

PS. This is really an awful title for the article.  It reads as if something bad has happened to the death rate while something good happened to the life expectancy, as if there is a paradox but the article makes clear that the death rate declined, which would make the point redundant.

A world full of bubbles

As promised, we stick to bubbles.  Like the street artist blowing soap bubbles at passers-by, this map -- published in the Guardian (UK) -- is a gift of bubbles.

Guk_carbon_world


And our reader Frederic M. is not amused.  "A tremendous failure", he said.


In terms of conveying the data, a simple bar chart would do a better job in exposing the biggest polluters, as well as the relative magnitude between the biggies and the small fish.

The chart reveals more problems if one clicks on, say, Europe, and sees the following:

Guk_carbon_eur


For starters, compare the bubbles labeled 858, 468, 586, 418 with those labeled 23, 20, 18.  And look at the little ones in the periphery labeled 133, 174, 128.  Baffling, isn't it?

What they did was to print the ranks for every country, except the top four in Europe for which the ranks are placed next to the country name (in small font), and the actual amounts are placed in the middle of the bubbles.  The ranks, of course, are pretty useless, and they obliterate the scale of the differences between countries.

Besides, the bigger the polluter, the smaller the rank but the larger the bubble.  This built-in disconnect can also be disorienting.

Every bubble chart typically contains lots of data labels, and the reason is that the bubble form lacks self-sufficiency.  Without the data labels, the reader has trouble comparing the areas.


Reference: "The Carbon Atlas", Guardian, Dec 9 2008.




Pie Cubed

Cascading_Pie_charts Omegatron also did away with a set of cascading pie charts on Wikipedia, a particularly ineffective use of this chart type.  Whenever there are more than two or three categories, the necessary use of many colors can really make one's head spin.

Here, the cascade is being used like a log scale, to artificially elevate the small pieces, which unfortunately are also the least significant pieces of the energy pie.  There is no reason for nuclear, bio-mass, hydro and "others" to add to 100% except that the author decides to group them together.  The 41% nuclear or the 41% solar heating in the second and third charts, respectively, have no meaning in the larger context.











In deference to the original author, Omegatron's new version preserves the arbitrary three-level cascade.  He converts to stacked bar charts, which brings out the differences better.

734px-World_energy_usage_width_chart.svg
He also sensibly exposes the original proportions rather than the arbitrary relative proportions.  For example, nuclear energy accounts for 6% of the total, not 41% of the arbitrary "others" bucket which in turn contains 14% of the total.

I'd prefer an even cleaner presentation with unstacked bar charts.  This can be done in either one chart with all eleven categories, or in two charts, as shown below.  The two-chart version assumes that the reader have two key questions: alternative energy sources as a proportion of the total, and the mix of different sources within the alternative category.

Redo_energy

With the ordinary bar chart, many fewer colors are needed, and there is no need to print out each data point, nor a need to use guides to point to labels and data.  The trouble of the latter is its tendency to draw attention to the least important aspects of the data.

With this further example, I continue to find the Wikipedia rule to discourage text annotations on graphs bewildering.   Such a rule apparently does not apply to data labels, as can be seen here.  Of course, a graph without any labeling of categories is robbed of meaning but if labels can be saved, so should annotations!





Community outreach

Omegatron re-cycled some Wiki charts and we are happy to report that they are great improvements over the originals.  I welcome other readers to alert us when you have done your bit of community outreach, by ridding the world of chartjunk.  The email address is the name of the blog at gmail for any submissions.


Not surprisingly, the originals were pie charts.

Wiki_abort
The recycled art: (can be seen here and here)

Omegatron_abort

Why are pie charts such poor tools for communications?  Think about where we can place the message, and you'll find that this chart type is far too rigid.  In a pie chart, the key resource is the relative size of the sectors, followed by the number of sectors, and sometimes the size of the total pie.  Other than those, there are little else useful in a pie chart.

The histograms used by Omegatron are much more flexible.  There can be information encoded in the height of the bars, the width of the bars, the total area, the relative distribution of bar areas, the existence and location of peaks and troughs, etc. etc.

For comparing data collected in slightly different formats, the pie charts are hopeless.  Notice that the lowest category on the left (pink) corresponds to 8 weeks or less, which would include two and a half sectors on the right plus potentially a missing sector for 4 weeks or less.  The histograms below handle this easily.

Omegatron asked for some feedback.  I think the new ones are significantly better.  A few minor points:
  • Instead of coloring the background to the chart, I'd color the bars themselves into green/yellow/orange according to the trimester
  • I'd put the trimester labels under the horizontal axis, close to the "week" labels 
  • The charts obviously need to identify the country and year of the data (which I added).  Omegatron pointed me to an inexplicable Wiki convention of not putting text inside charts (see here).  I must disagree with this convention.  Annotations on charts are some of the most useful things.
  • If these two charts are to be placed side by side for comparison, then we need to sort out the vertical scale.  It cannot be the absolute number of abortions but some kind of relative scale in proportion to the population size, or some similar metric. 
  • In addition, if comparison is the point, I'd suggest an overlapping histogram with bars having no fill.  

Great work!  And I love to see more of it! 

Bubble after bubble

I finally checked the Junk Charts mailbox again, and I found an uprising against bubble charts and pie charts.   It appears that despite their shortcomings amply demonstrated here and elsewhere, editors everywhere continue to believe that the public has a lovefest with these creatures.

I will start off the parade with this one from the Wall Street Journal, purportedly showing that the Bank of England has continued to inject cash into the economy, and at ever increasing rates.  The headline said Bank of England to expand bond-buy plan.

Wsj_bofe
This chart has a variety of problems, in addition to the use of overlapping bubbles.  As has been documented, it is almost impossible to gauge the relative sizes of circular areas, especially when they are overlapping.

If we remove all but one of the data labels, the chart is non-functional.  This is what we mean by not self-sufficient: the interpretation of this chart requires, indeed demands, that all the underlying data be printed on the same chart.  The only way readers can understand what is going on is by reading the data itself!

Wsj_bofe_inset2

The horizontal axis (indicating time) is also non sensical.  The separation from month to month is variable.  Besides, and this is the key flaw of the chart, the projected number is a three-month total cumulative growth being treated like a monthly figure.

Wsj_bofe_2 

Since the Bank is projected to inject 175 50 billion extra pounds in the next three months, that would work out to be roughly 60 16 billion per month.  That would turn the story upside down: one would conclude that the Bank is gradually slowing the rate of injection.  The following bar chart points this out with little fuss:

Redo_wsj_bofe_2  

When bars are used, there is no need to print every single data point.  The relative lengths of the bars can be estimated easily.  The months are equally spaced.

One final point: the exchange rate cited is not very helpful.  What would have been more useful for readers would be the scale of the cash injection with respect to each nation's GDP.


Reference: "Bank of England Expands Bond-Buy Plan", Wall Street Journal, Aug 7 2009.


PS. Per Andrew's comment, here is a line chart, where the growth/decline in the injection is encoded in the slope of the line segments:

Redo_bofe_ag




Degrees of likeness 2

We left off the other day with an interactive graphic with the ability to peer into subgroups.  This feature assumes implicitly that the overall average obscures differences within subgroups.  What statisticians do with this type of data is to compare the subgroups, and identify the factors that make someone different from the average.


For example, there is a clear distinction between the employed and the unemployed in how they spend the day (not surprising).

Nyt_timeuse_employement
This happens to be what NYT printed in the paper edition that day.  (Note, though, that the graphic loses quite a bit without the interactivity.)

On the other hand, there appears to be little differentiation between men and women.

Nyt_timeuse_gender
Nor is there much difference between blacks and whites.

Nyt_timeuse_race
One factor that matters is age.  Older people are not exactly like the young.  A lot of these factors (for example, age and employment status) are correlated, by the way.

Nyt_timeuse_age

I showed all these in order to talk about the statistical concept of "aggregation".  We noted that the distribution of time use of the employed is different from that of the unemployed.  Thus, we cannot use the "average" distribution to describe both groups, and so we show the data in disaggregated form.  Similarly for time use and age.  

But there is not much gain in disaggregating race and gender: the "average" is representative of the subgroups for these two factors.  This is one distinction I see between information graphics and statistical graphics: the former typically shows all possible subgroups while in the latter, the designer zooms in on the factors that matter.





Degrees of likeness 1

The NYT team just put up a fantastic visualization of the American Time Use Survey data, which purports to measure how the average American spends the time of a day.  (Apparently, thousands of people recalled what they did every minute of an average day.)  The amount of data collected is massive, and this graphic allows readers to explore the data in intuitive ways.

Nyt_timeuse_all


The chart shows for each minute of the day (horizontal axis) the proportion of people doing specific activities.  Not surprisingly, we spend more time sleeping than any other type of activity.  The axis and data labeling as well as gridlines are very restrained.

Normally, I am not a big fan of these proportional area charts because the only relevant dimension to look at is the vertical distance from one curve to the next but the focus on areas put equal weight on the horizontal and vertical distances.   The horizontal distance is meaningless, and thus the area is meaningless.

These designers found a solution to the problem, and good for them!  Because of the mouse-over effect, I could not save the actual appearance -- here, I show what it looks like.

Nyt_timeuse_mouseover

By mousing over different parts of the graph (say, moving vertically), we can compare the actual proportions.  Terrific!


The key interest of this graphic is the following legend.

Nyt_timeuse_factors

While the above graphic shows the use of time by all Americans in aggregate, this panel allows us to zoom in on specific groups of Americans.  How alike are Americans?


Reference: "How Different Groups Spend Their Day", New York Times, July 31 2009.