« January 2011 | Main | March 2011 »

Why should charts exist?

That sounds like a silly question. Isn't the answer self-evident? Am I suggesting that we banish the discipline of charting?

Maybe I won't go so far. But it's difficult not to have such a destructive thought when one stares at charts like this:


Now, compare the above with this version shown on the right ...  Redo_howrich2 and it's clear all the squares and bubbles and colors gave us nothing. Readers have to read the fine print in order to take in the unequal distribution of income. This chart violates the notion of self-sufficiency we often speak about.

Peering back at the original chart, we find that the entire square grid edifice only serves to explain that 0.01% is one-tenth of 0.1%, which is one-tenth of 1%, etc. On the other hand, the part that has a chance of conveying the main message -- the relative size of the biggest bubble versus the smallest bubble -- is shoved off the screen. The gigantic yellow bubble being mostly off the chart, readers are essentially asked to read the data labels.


The same article (via Yahoo!) contains other charts that are well executed. 

This one, for instance, shows the increasing inequality very well. (The legend is on the left panel which I did not include here: the top red line is the top 1%, the other five lines are the quintiles or 20% buckets). At least four-fifths of the country is worse-off now than in 1980 in terms of their share of after-tax income.


Charts should not be used as map lessons

Like Australia-based reader Ken B., I don't understand why many chart designers insist on using charts to deliver lessons to the public on map geography. Here is a recent example from Down Under, on earthquakes: (click on this link for the interactive version)


Was there a quake that shook the middle of the Pacific? Did a new geological formation give New Zealand a Pinocchio nose? No and no. The ugly presentation of the 2010 and 2011 Christchurch earthquakes -- as two ends of a dumbbell -- makes clear the straitjacket that maps are when it comes to delivering quantitative information.

Besides, the bubbles represent the relative magnitude of the quakes when one would hope that their sizes represent the geographical extent of the damage; at least, that would be information that has a spatial dimension.

The location of the quake is the only data with a spatial dimension surfaced on this plot. The only purpose of the map background is to tell us where Christchurch, Sichuan, etc. are on a map. In order to deliver this map lesson, the designer has to hide all of the more interesting data, like the relative magnitudes, the time-lines, the extent of the damage, the mortality rates, etc. In my mind, that is a very poor tradeoff.

The chart that reveals a mysterious death

I agree with Business Insider that the following chart is attractively drawn. It nicely illustrates the rise and fall of various music media over time.


Area charts are more visually appealing than line charts, largely because line charts frequently leave large patches of white space. But one should be aware of some shortcomings of area charts.

Notice that the outer envelope of the area chart represents the growth in music sales across all media, not to be mistaken for the growth of any particular media. However, the primary message of this chart relates to the change in mix among different media, not the growth of the total market. Because of the stacking of different areas on top of each other, it is not an easy task to read the growth of any individual piece, such as CDs.


Unlike Business Insider who found some answers on this chart, I find that this chart raises a mysterious -- and important -- question: what happened around 2001 to damage CD sales? Since according to this chart, digital sales didn't really show up till 2004-ish, there is a gap of two years or so when CD sales dropped drastically, seemingly of its own volition.

From text documents to our eyes

Today we look at an example of a powerful visualization of some unstructured data. The data team at Guardian (UK) organized the Wikileaks data concerning reported incidence of IEDs in Afghanistan.

Guardian_relief A scatter plot on a map provides an overview of the intensity of attacks from a spatial perspective. (A part of this map is shown on the right.) The background data -- the relief map of Afghanistan, and the major thoroughfares -- add to our understanding of why attacks were concentrated in certain parts of the country. It is always a great idea to add (con)textual data to help readers grasp the information shown on the chart.

Readers may want to understand the temporal pattern of attacks as well. The designer chose a small-multiples format to show this data, disaggregated by year of occurrence. This graphical construct is very versatile, and illustrates this data well... even though there has been little change over time, apart from a general increase in the number of reported attacks across the country.

Guardian_smallmultiples  It is a good idea to track the total number of attacks over time -- but not with those bubbles! The bubble chart almost always fails the self-sufficiency test; our eyes are not  equipped to read relative areas of circles, and so any information we obtain about the aggregate number of attacks comes from reading the data directly. Switching to a bar chart, or removing the bubbles, leaving just the data, is recommended.

The major problem with a dataset like this is reporting bias: only attacks that were reported by U.S. personnel were included. The following chart helps close the gap a little by also showing the number of defused attacks, reported in the U.S. database. I'd have preferred a stacked column chart here since the total of defused (gray) and detonated (red) IEDs is an interesting statistic.


A stock trading volume type chart would also be nice, something like this:




Reading behind the chart

I could have filed this one under Light Entertainment but it's too good a chart to lay to waste:

(The chart is from Internet Retailer.)

Let's focus on the (mis)match between the question being addressed and the data collected to address it. The intention of the analyst is fully divulged in the title of the accompanying article: "Don't sell Twitter short: those 140-character messages reach an affluent and engaged audience". The chart supports this claim by showing that Twitter users disproportionately represent the types of consumers that marketers most covet, i.e. those with advanced education, and those earning higher incomes.


Up top, the concept "user" is very pliable. Is it someone who has a Twitter feed or is it someone who reads Twitter feeds or is it someone who subscribes to Twitter feeds? Is it someone who is registered or everyone who visits? Is it someone who has visited the site in the last x months? or posted a tweet in the last x months? If a writer, does it include someone with no subscribers? no page views? What about people who simultaneously publishes multiple feeds (like John Cook who writes one of my recommended blogs)?

Now, we don't expect the analyst to describe fully how a "user" is defined but while interpreting this chart, we should ask appropriate questions of the data.

Next, the analyst establishes a reference level called "general population". Is this the right metric? This depends on what the chart is used for. If you are choosing between spending money on Twitter, and say spending the money on national TV advertising, then perhaps this comparison is valid. If you are selecting between Twitter and say Google, then absolutely not. For most readers, I think a more relevant point of comparison would be the general Web user, rather than the general population. This is an important distinction because the general Web user also earns higher incomes and has higher educational attainment than the general population, thus the current set of data exaggerates the "value" of Twitter exposure.

Finally, if you are a marketer looking to spend with Twitter, you are also worried about "reach". Say, 50% of website ABC's users are rich people compared to 10% of your reference population. That sounds like an amazing opportunity. Well, only if website ABC has sufficient number of users! If ABC is a niche website serving only 1% of your reference population, then despite the benefit of targeting, its scale is too small for your need.

Don't take a chart on its surface. Read behind the chart!



Light entertainment: volume visuals

Frequent contributer Julien D. sent me this link to a video, with the intriguing title "Beer visualization". Since I don't speak French, maybe some of you will tell me what's going on?


The video immediately reminds me of this visualization of the size of the new Starbucks "trenta" cup. (Health warning: a primary cause of obesity is the portion sizes served to Americans.)



Ten parts don't make a whole

Reader Sigve I. thinks we should clean up Wikipedia. This is a good idea but would take up a lot of time. Some of our previous contributions include these entries.

In making this suggestion, Sigve sends us to the following chart about population growth (related to this entry):



The problems here are many. Starting with the detached chart titles: it takes a little while to realize that the graphical elements depict the share of population from 1950 to 2010 while the population growth is written in parentheses next to the legend while the third series of numbers displays the ranking -- not of growth, but of share of population -- among the continents or countries depicted.

That's quite a mouthful.

A forensic scientist is on call to tell us which software might have generated these charts. The telltale clue would be the padded "00.8%". This one can't be blamed on Excel since Excel always banish the padding (even if you deliberately put it there).

I won't mention the variety of chartjunk that serves no purpose. But I do want to point out that setting the year labels 15 years apart is wacky.

Now, let's zoom in on the bottom chart. "10 most populated countries" is the title. Why does the vertical axis display proportions that add up to 100%? Surely, these 10 parts don't add up to a whole!

Even though this is not a pie chart for which this state of confusion is fairly routine (unfortunately), as we've even stumbled on examples in teaching materials for (gasp) numeracy, the same error can show up in stacked column or area charts.

Take a step back. Apart from the obvious fact that China followed by India are the two most populous countries by far, what insight is being conveyed by this chart?

Next, consider the following version:

Redo_UNPopGrowCountry2 On this one, we notice that the top 10 countries fall into roughly three types in terms of their growth trajectory since 1950. The green group has a parabolic growth pattern, with a growth rate that reaches an apex in the front part of this period; these countries all have slowing growth in the most recent decades.

The black group, which includes biggies like China, Russia, Japan and Brazil, has by and large experienced slowing growth throughout the time window. They are still growing but the growth rate has been declining.

Finally, USA stands alone as a country where the growth rate has been generally stable over much of this period.

The other thing to notice is that while most countries had similar growth rates back in the 50s, by 2010 these countries experience a much wider range of growth.

One of the tricks that help surface these trends is the smoothing applied to the data. The real data, as you may suspect, would not fall neatly into parabolas. Just for comparison, below is the same chart without smoothing. Nothing is lost by smoothing while the result is significantly cleaner.


Growth rate is not the only thing of note. By focusing on growth rates, one loses the important fact that countries with larger populations contribute more to the growth of world population. The following chart displays this trend. Risking the ire of some, I elected to lump almost all the countries into one group -- there are indeed differences among these countries in terms of their growth trajectories but one cannot escape the conclusion that these differences are only drops in a large bucket.



Looks like Wikipedia needs some cleaning up. Who's pitching in? 

Perhaps the Economist doesn't take its own advice

Given the recent post questioning the value of the MBA degree, one would think the Economist powers-that-be would not be staffing up MBAs. But then, if not useless MBAs, how would the Economist explain this chart they printed next to the said article?

This chart appears to tell us that all the top MBA programs succeed in reducing their students' earning potential. In each case, the "pre-MBA salary" exceeds the "salary on graduation".

More likely, the red part is the incremental salary, possibly explained by the value of the degree while the gray part is the pre-MBA salary.

However, since the author has few nice words to say about business schools, one can never be 100% sure if he is presenting some counter-intuitive data.


6a00d8341e992c53ef0133f5f3e09a970b In the Trifecta checkup, one would find nothing wrong with the chart type, nor is there anything wrong with asking the return on investment of an MBA degree. 

The third component -- having the right data -- is what renders this effort a failure. It is too simplistic to measure return on investment on the salary upon graduation. Surely, one must also include future career paths, intangible benefits from network relationships, personal development, etc.


New is not always better, and some Indians are in fact wealthier than Americans

This chart highlighted by the Economix blog at the New York Times caught a bit of attention.


Catherine Rampell wrote "awesome chart" on the margin of Branko Milanovic's book which first published this, also conceding that it is a chart that can "take a few minutes" to understand, "but trust me, it's worth it".

The question for me is: is the reward worth the effort?


The answer is no. This chart does not address an interesting question, and it tempts readers to infer things that the chart doesn't say.

Gapminder_income The message of this data is that there are rich people (by world standard) in poor countries. For me, this isn't very interesting but I can understand if others find it shocking, edifying or even satisfying.

I'd point you to a different visualization, done by the now-famous Hans Rosling, years ago (I discussed his team's work here). If he used lines instead of areas for the distributions, the chart would be even better. 

I much prefer this chart.

Comparing the two also surfaces another difference. The four countries chosen in the Milanovic chart are highly selective. (I snicker at the title which announces "Inequality in the world".) It's comparing the U.S. against three developing nations with high income inequality. What about showing us also a few lines of nations with lower inequality, like Scandinavian countries?

Rampell's conclusions, in particular, are not well supported by her beloved chart. First, she said:

All people born in rich countries thus receive a location premium or a location rent; all those born in poor countries get a location penalty.

I don't disagree that it's better to be born with money. But my takeaway from the chart is the opposite of hers: that you can't generalize entire countries; that you can live in a poor country, and you can be extremely rich. As Rampell pointed out earlier in the article, "[Brazil] this one country covers a very broad span of income groups". So, if anything, the chart undermines the point that "all people" in any one country receive a location premium/rent.

She also said the following:

How can there be so many people in the world who make less than America's poorest, many of whom make nothing each year? Remember that we're looking at the entire bottom chunk of Americans, some of whom make as much as $6700; that may be extremely poor by American standards, but that amounts to a relatively good standard of living in India, where about a quarter of the population lives on $1 a day.

Given that the data has been adjusted for PPP, or in Rampell's words, different costs of living around the world, or really, it has been adjusted for different standards of living, it makes little sense to explain a difference in the adjusted amounts based on "standard of living". In fact, my understanding (unless something changed recently) is that the PPP adjustment uses the US living standard as the reference level.

The $6700 that she describes as the maximum income of the bottom chunk of Americans--if this amount is earned by the Indian, would put him/her in the very top bucket of Indians, according to the Milanovic chart. I'd call that a super high standard of living in India, not merely "relatively good".


A few comments on the statistics.

The last quotation above shows a confusion between averages and extreme values. The $6,700 is the maximum income of the bottom chunk of Americans; it cannot be compared to the $1 a day, which by the way, should be written as $365 a year, but in any case, this amount is the average income of the bottom 25% of Indians. One can't compare an average to a maximum, nor an annual number to a daily number.

A number of readers conclude from the chart that the income inequality problem in the U.S. is overblown. You just can't see it on that chart. That's because the chart literally hides this information. As we know, the top 20% of the U.S. population holds 84% of the wealth, and it gets worse with the top 1%, top 0.1%, etc. The precision of the horizontal axis of the chart is the "ventile", which are 5% buckets.

Also, notice that this type of chart is used to compare one distribution against another distribution. The notion of currencies has been entirely removed. It's similar to converting data from absolute units to rankings. You lose that sense of scale. (This is the reason why it appears as if no one in India makes more than anyone in the States. If a finer scale were to be used, at the upper end of the Indian income distribution, I'm sure you find otherwise.)