Perhaps the Economist doesn't take its own advice
Light entertainment: volume visuals

Ten parts don't make a whole

Reader Sigve I. thinks we should clean up Wikipedia. This is a good idea but would take up a lot of time. Some of our previous contributions include these entries.

In making this suggestion, Sigve sends us to the following chart about population growth (related to this entry):

World_Population_by_Continent_and_10_Most_Populated_Countries

 

The problems here are many. Starting with the detached chart titles: it takes a little while to realize that the graphical elements depict the share of population from 1950 to 2010 while the population growth is written in parentheses next to the legend while the third series of numbers displays the ranking -- not of growth, but of share of population -- among the continents or countries depicted.

That's quite a mouthful.

A forensic scientist is on call to tell us which software might have generated these charts. The telltale clue would be the padded "00.8%". This one can't be blamed on Excel since Excel always banish the padding (even if you deliberately put it there).

I won't mention the variety of chartjunk that serves no purpose. But I do want to point out that setting the year labels 15 years apart is wacky.

***
Now, let's zoom in on the bottom chart. "10 most populated countries" is the title. Why does the vertical axis display proportions that add up to 100%? Surely, these 10 parts don't add up to a whole!

Even though this is not a pie chart for which this state of confusion is fairly routine (unfortunately), as we've even stumbled on examples in teaching materials for (gasp) numeracy, the same error can show up in stacked column or area charts.

Take a step back. Apart from the obvious fact that China followed by India are the two most populous countries by far, what insight is being conveyed by this chart?

Next, consider the following version:

Redo_UNPopGrowCountry2 On this one, we notice that the top 10 countries fall into roughly three types in terms of their growth trajectory since 1950. The green group has a parabolic growth pattern, with a growth rate that reaches an apex in the front part of this period; these countries all have slowing growth in the most recent decades.

The black group, which includes biggies like China, Russia, Japan and Brazil, has by and large experienced slowing growth throughout the time window. They are still growing but the growth rate has been declining.

Finally, USA stands alone as a country where the growth rate has been generally stable over much of this period.

The other thing to notice is that while most countries had similar growth rates back in the 50s, by 2010 these countries experience a much wider range of growth.

One of the tricks that help surface these trends is the smoothing applied to the data. The real data, as you may suspect, would not fall neatly into parabolas. Just for comparison, below is the same chart without smoothing. Nothing is lost by smoothing while the result is significantly cleaner.

Redo_UNPopGrowCountry
***

Growth rate is not the only thing of note. By focusing on growth rates, one loses the important fact that countries with larger populations contribute more to the growth of world population. The following chart displays this trend. Risking the ire of some, I elected to lump almost all the countries into one group -- there are indeed differences among these countries in terms of their growth trajectories but one cannot escape the conclusion that these differences are only drops in a large bucket.

Redo_UNPopByCountry

 

Looks like Wikipedia needs some cleaning up. Who's pitching in? 

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Rhett

if you think it needs cleaning up, clean it up! Errors on Wikipedia are like litter on a sidewalk. Clean it up.

Rick Wicklin

I disagree that "nothing is lost by smoothing." The smoothed chart, as you say, is superior for showing trends, but is inferior for faithfully representing the data. With the (unsmoothed) line chart, I can accurately determine the population growth for India in 1985. The smoothed chart underestimates that value, but I have no way of knowing that fact. Also, the smoothed version gives the erroneous impression that the population growth for Indonesia in 2010 was less than Brazil's, when in fact the opposite is true.

In general, I don't think that it is a good idea to replace a line plot with smoothers. I think that a graph should show the true data unless it explicitly indicates otherwise. When you substitute a model (=smoother) for the data, it can misrepresent the data. For example, a sharp drop off in population growth (due to a war or natural disaster) would not show in the smoothers. If you are going to show smoothers, superimpose them on a scatter plot of the data.

Lastly, it is important to note that that there is not a unique smoother for these data. Each smoother depends not only on the data but also on smoothing parameters. An unscrupulous analyst might choose a value for the smoothing parameter which shows aspects of the data that do not truly exist. Therefore, when you use smoothers it is best to specify the smoothing technique (cubic spline, loess,...) and how the fit was constructed.

Aaron

I'm not terribly experienced with smoothing. Perhaps it's my ignorance, but my first thought was that it can't be true in this case that "nothing is lost by smoothing." While the point of the graph may not be to compare two countries, the chart does lend itself to doing so, and could result in false conclusions.

For example, try compare Pakistan and Nigeria. In the smoothed chart, it looks like for a good 30 years, Nigeria's growth rate outstripped Pakistan's. The other chart suggests there was actually quite a bit more variability. You also might get the impression from the smoothed chart that over the past 15 years Nigeria's growth rate has slowed much more dramatically than Pakistan's. The second chart suggests that is probably not the case.

The little smoothing I've done in the past has made me a little nervous, so it's a topic I'd be interesting in seeing discussed.

Chris Johnson

I also disagree that "nothing is lost by smoothing.". By smoothing the data so heavily, you reduce the amount of information that can be read from the graph from 120 pieces of real data (the growth rate in 10 countries times 12 years) to 10 pieces of inferred data (whether each country is in the green, black or blue categories).

What actually does the y-axis represent in these plots? I assumed it was actual population growth per year, but the values should be around 1.5% in this case, not 0.15%. Perhaps it is growth per month?

Google will display this data (for every year, rather than every five) in a rather nice interactive format: http://www.google.com/publicdata?ds=wb-wdi&ctype=l&strail=false&nselm=h&met_y=sp_pop_grow&scale_y=lin&ind_y=false&rdim=country&idim=country:USA:IND:IDN:BRA:PAK:BGD:NGA:JPN:RUS:CHN&tdim=true&tstart=-315619200000&tunit=Y&tlen=49&hl=en&dl=en

which also indicates where the data is from (http://data.worldbank.org/indicator/SP.POP.GROW?cid=GPD_2 - with sparklines!)

The comments to this entry are closed.