« October 2005 | Main | December 2005 »

Review: Gapminder 3

The next chapters of Gapminder take the scatter plot of income and child mortality further.

V. Income and health of countries

Hdr15gdphealth_2Concept used: standard deviation, measure of location and of dispersion

Highlight: this chapter is an amazing illustration of why it is dangerous to look at only averages but not dispersion.  The screen shot on the left shows that Mauritius is nothing like the rest of Africa in terms of income or of child survival rate.

Alert readers will notice that Gapminder has switched the y-axis from child mortality to child survival, which is significantly easier to grasp even though the data is the same.  (Did they read Hadley's comment?)

Food for thought: 1) The labeling of the log scale for child survival rate may confuse some.  2) The population size dimension as rendered in bubbles interferes with our understanding of the correlation between GDP per capita and child survival while adding little if any value.

Presumably, population size is shown so that the reader can observe the correlation between population and GDP per capita, and that between population and child survival.  The reader can judge for themselves whether the bubble chart is effective in presenting such correlations (see charts below).Hdr15gdphealthbii

3) The log-log scale can easily mislead us in judging the magnitude of dispersion.  Even though the countries in OECD (aquamarine bubble) look relatively less dispersed, in reality, this may not be so because small distances on the right side of the page must be translated conceptually to large distances (to reverse the log scale).

VI. Same Income, Different Health

Concepts used: scatter plot

Hdr16sameincHighlight: This chapter is a tour de force in explaining how to read scatter plots.  Besides, it proves how animation can significantly improve instruction.  The screen shot on the left is but one example.

VII. Development directions
VIII.  Differences within countries

We discussed development paths last time.  Chapter 8 drills down further into distributions within countries; its only disappointment is the lack of data, especially for OECD countries (wanting to hide social inequality?)

This is an all-around fantastic effort to bring color to the voluminous data in the Human Development Report.  Many important statistical concepts are included and carefully explained (histograms, means and dispersion, different levels of analysis, scatter plots, etc.). In some cases, the choice of graphical construct exposes its limitation.  What's more, the producers apparently are open to feedback; I have detected some improvements already and a Chapter 9 has appeared after I completed my review.  Here are my reviews of earlier chapters.

Chapter 1-3
Chapter 4 (touching on 7)

When not to use bars

Hope everyone had a great Thanksgiving.  This weekend, I came across two examples of poorly-executed bar charts, both from the Economist.  (More on bar charts here, here, here, here.)

In both cases, an additional symbol (a range, a dot) was superimposed on the bar chart, which is an act both obfuscating and ugly.  It is made painfully clear that each bar contains only one piece of data, completely indicated by its top edge; in other words, one can replace any bar with just its top edge, which is what I have done in each case.



In the first example, the baseline estimates of people living with HIV show up more clearly.  (I'm not sure why upper and lower estimates are included for years past as they should have official counts.)

In the second example, the focus on the gap between official and actual retirement ages is restored and emphasized.

It would not be proper to sign off without revisiting the start-at-zero rule (start here or here).  In both the above charts, I have chosen not to start at zero.  I assume that the point of these charts is to illustrate recent changes in the depicted variables (Andrew will want to see longer time series, I'm sure.)  If I start these charts at zero, I run into difficulty deciding the separation of the tick labels: in order to capture the differences which are squeezed into a small range (due to the narrow date range), I'd have to use a lot of ticks, most of which are useless outside the range of the data!

Reference: "Spin Doctors" and "Must Try Harder", Economist, Nov 26, 2005.

Managing the gap

Sophisticated ideas are difficult to get across in a chart.  For instance, the NYT recently described the gender gap in the workplace by comparing the proportion of men versus women in managerial positions relative to the overall proportion.  Two simultaneous comparisons are taking place, one between men and women, and the other between managerial positions and overall employment.

The published chart (below left) used eight pie charts.  To my eyes, this graphic is confusing, not least because the primary comparison between managers and overall is set far apart.  The junkchart version (right) tries to fix this by graphically showing the gender gap using a horizontal line segment. Also, the 50% gray dotted line allows the reader to see quickly that in the three industries where men comprise the minority overall employment, they take up the majority of managerial positions.


Reference: "Stuck at the Edges of the Ad Game", New York Times, Nov 22, 2005.

Review: Gapminder 2

I continue to review Gapminder's animations of the Human Development Report 2005.

IV. Regional Differences in Health & Income

Concepts used: bubble chart, scatter plot, three dimensions on one plot, log-log scale, bivariate correlation, linear and non-linear correlation


The scatter plot construct allows easy understanding of correlation between two variables, the average annual income and child mortality.  Moreover, the "sliding timebar" shows this correlation was quadratic/non-linear in 1975 but linear in 2000 (but note log-log scale).

Food for thought:Hdr14health

a) Readers of this blog won't want to hear my rant on bubble charts again (here, here and here).  The most important message of this chart concerns regional differences in per capita income and yet the visual message is dominated by the overlapping bubbles (i.e. population size).


Hdr14healthbb) The size of bubbles also distracts when used in a scatter plot (see right), where the reader must identify the center of a bubble to figure out its x- and y- dimensions.  This becomes difficult when bubbles of differing sizes overlap and obscure one another.

c) The log-log scale requires careful interpretation.  For example, the statement cited on the right is erroneous because the change in child mortality rate in Africa actually dropped 18% (from 22% to 18%), rather than remaining "almost the same".  The confusion arises because Africa appears in the larger end of the log scale (for child mortality) where even small visual distances represent large separations in the raw data.

Redohdr14bd) A clearer way to investigate changes over time is shown on the left.  Each line traces the development path from 1975 through 1990 to 2003.  Africa and Eastern Europe experienced negative per capita GDP growth.  Meanwhile, Latin America and the Arab States had stagnant economies but rapidly declining child mortality.  Asia, on the other hand, experienced fast growth but modest gain in child mortality.  Finally, the distance between the "high-income" OECD countries and the rest of the world is vast, especially when we note the use of a log scale.

Again, Gapminder anticipated this line of thinking and addressed this in Chapter 7.  However, the insistent use of bubbles makes their development path chart more muddled than our version here.

Overall, Chapter 4 is another valuable contribution to our knowledge of this data; the result would have been even better if they had omitted the population size dimension.

PS. Thanks to Hadley for pointing out the need to indicate the directions of the paths.  I have updated the chart now.

Review: Gapminder 1

Speaking of promoting statistical thinking among students, I am mightily impressed by the Gapminder site, which visualizes data from the United Nation's Human Development Report 2005.  In a couple of posts, I will review the nine chapters of animation from their site.

This first post concerns primarily the first three chapters which examine the distribution of income across the world.

Gapminder is a Swedish non-profit dedicated to utilizing visualization software to help enliven and disseminate social science data.  The quality of these animations is impeccable, taking full advantage of innovative web-based technologies to explain graphical constructs visually.  The site is invaluable for policy analysts needing to interpret the voluminous data, and for students of data visualization although seasoned statisticians will find the pacing too sluggish.

I. World Income Distribution

Concepts used:

histogram, empirical probability density function, logarithmic scaling, purchasing-power parity (PPP), skewness, uncertainty of projections


a "sliding timebar" (see right) allows visitors to explore the change in the shape of income distribution over time

Food for thought:

The log scale for $ per day was not explained.  In general, using the log scale involves a tradeoff: visual clarity is gained with better spacing between data but distances are distorted so that an inch on the left-side of the distribution is not the same (in terms of $ per day) as an inch on the right-side of the distribution.  In fact, the log transformation artificially changes the visual shape of the distribution.
b) The important concept of PPP, which holds the key to comparability, is used but not explained.  It is scary and sad to know that large portions of the world's population live on less than $10 per day (around US$3,600 per year), after already adjusting the number upwards to account for cheaper costs of living.
c) The tails of any income distribution get short shift in this presentation even though the tails are crucial to understanding income inequality.

An immediate question comes to mind, which is where more and less developed nations fall in this distribution.  Part of the amazing experience with this site is that the designers anticipated our questions and address them in later chapters.

II. Regional Income Distribution

Concepts used: stacked area chart, distribution by segment

This was not my favorite chapter and I note these

The stacked area chart does not do the data justice!  The only distribution that is clearly visible is that of Africa (in pink) because it sits at the bottom.  All other distributions are layered on top of one another, totally distorting their shapes.  For example, the Latin America & East Europe distributions (orange, green) look like flat pancakes on this chart.
b) The above problem is multiplied during the "sliding timebar" animation. Changes in shapes over time as visualized are highly problematic.

III. The Changing Face of Poverty

Concepts used: stacked area chart, distribution by segment, analysis within segment

Hdr13poverty_1This chapter continues the graphical construct from the previous but hones in on the segment living below the "poverty line".

The same problems apply.  It is easy enough to note the temporal changes of those living at the poverty line but much harder to visualize these changes for people earning less than $1 per day.

In conclusion, Chapter 1 is chock-full of important concepts, and clever visual explanation but the graphical construct chosen for Chapters 2 & 3, that is, the stacked area chart, leaves something to be desired.  I'm eager to look at the other six chapters and will let you know what I think.

Via: Mahalanobis

The sad tally 5: comparing quantiles

Today I return to analysis of the sad tally, or are suicide locations on the Golden Gate Bridge random, or how does one determine if a sequence of numbers is random?  The visual evidence, from cumulative distributions and box plots, tells us that the shape of distribution matters.  One way to directly compare two distributions is by comparing quantiles.

The following chart shows the (smoothed) cumulative distribution of some non-random data (Dataset 9) on the left, and randomly generated data on the right.  It is clear that the two lines are not the same shape; is there a systematic way to compare them?

The orange line identifies the point at which the number of suicides equal 40% of the total.  On the left, this means the number of suicides committed between locations 41 and 72 is 40% of the total.  On the right, the same number occurred between locations 41 and 70.  The pink line similarly compares the point at which the suicides equal 20% of the total.  Notice that at this point on the distribution, the locations are significantly different, 41-65 on the left versus 41-58 on the right.

Such comparisons can be made at different points on the distribution, 10%, 20%, 30%, etc.  The result is a qqplot (quantile-quantile plot) as shown below.  Each distribution is compared to an ideal "uniform" distribution (i.e. random) which is the straight line.  Not surprisingly, the data on the right, generated randomly, is much more likely to be random.  The left line is consistently above the straight line, which indicates systematic difference from random.


P.S.  I have neglected the tricky issue of how much difference from random is required to pronounce the visual evidence conclusive.  Usually, after inspecting graphs, we have to resort to mathematics by running statistical tests.  But statistical tests, with the omnipresent p-values, often give a false sense of security, particularly where the theory is incomplete, as is the case in tests of randomness.  Running statistical tests without visualizing the data is dangerous.

Light entertainment IV

Our friends at Northern Trust have created another classic:


My favorite feature here is the little stripes within each column.  Notice that the column with 5 stripes denotes 5 managers, not 5 hours worked... so in essence, both upwards and sideways, the data shift from 1 manager to 19 managers.  Even the kids look incredulous.  Classic!

Other gems from Northern Trust:

two points make a line
memories are double the fun

Science in the news

Recently I've read several great pieces on science, or its role in education, and I highly recommend them.

Here is Princeton Prof. Appiah on two gaping holes in our humanities curriculum.  (Thanks to Sam Cook for the link.)

Here is Washington Post columnist Charles Krauthammer on "intelligent design". (You may need to register to read it.)

Here is Edward O. Wilson on "Charles Darwin's Difficult Legacy", writing in the NYT.  (The link only provides an abstract of this article; you need TimeSelect to read the full version.)  You can also get Wilson's new edition of Darwin's seminal books.

Standardization by default

Of myriad abuses of statistics, standardization of variables is one not  discussed enough so I address it here. 

This cautionary tale is inspired by a misleading comment in my first post on the suicide data.  In presenting the nine scatter plots, I had said:

The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.  The standardized scale allows us to compare across graphs.

Later, I realized I applied standardization by default.  When faced with several variables with different scales, we often standardize, i.e. put them onto the same scale, before making comparisons.  Sometimes this strategy backfires...

A typical standardization procedure involves two steps: 1) centering i.e. shift the average to 0; 2) scaling i.e. express the scale in units of standard deviation.  The following table refers to Dataset 9 in the suicide data, followed by centered data and standardized data.

Std Dev11.211.21.0

The effect of centering (second column) is to move the average from 21.6 to 0.0 while keeping the spread of the data (see standard deviation or range) the same.  Scaling squeezes the spread, shortening the range from 57.0 to 5.1 (third column).  In effect, standardization puts the data onto a new scale:  0 in the new scale is 21.6 (the average) in the old scale; 1 in the new scale is the old average + 1 unit of standard deviation, or 21.6 + 1x11.2 = 32.8; 2 indicates 21.6 + 2x11.2 = 44.0, etc.

Observe that standardizing (or more precisely, scaling) compresses the data spread, which means artificially changing the shape of the distribution.  It turns out that to show that the suicide locations are not random, we need to discover that Dataset 9 has significantly larger spread than the other datasets (see here and here).  By standardizing the variables, I had inadvertently thrown away this key feature!

Proof 1: (cumulative distribution)  Recall the red staircase line is Dataset 9 (real data); all other lines are random data.

Proof 2: (boxplot) 
The rightmost boxplot summarizes Dataset 9.

In both cases, Dataset 9 did not stand out in the standardized data plot.

So my advice is: never standardize by default; always understand how standardizing changes the distribution.

A recap of the series exploring the nature of randomness:

The Problem
The Data
First Analysis
The Boxplot

You can uncover the trail of my mischief by looking at the axes of the various graphs in this series.  When the scale is from -3 to 3, I was using standardized data; when it is from 0 to 60, I was using raw data.

Early Christmas

Frequent readers may have noticed I jumped the gun on Christmas and put my wish list in the right column.  If you feel charitable this holiday season, please help me build my library of books.  (This lucky woman received a gift of 1,082 books; I wished only eight!)  In any case, do browse the list because these are important books for all serious data sleuths.