« May 2012 | Main | July 2012 »

Spacing out on the space race

Jordan G. sent us to this Wikipedia image. (link)


Intriguing concept to try to show the tit-for-tat in the space race between US and USSR. But it's almost impossible to fish any information out of it. While the voluminous text turned sideways is annoying enough, I find the color scheme to be the most offensive. Would like to know who is the intended audience.

U.S. exceptionalism and billionaires

Ryan McCarthy linked to a post by Ruchir Sharma running on Ezra Klein's blog analyzing global billionaires.

It has an accompanying chart, which fails our self-sufficiency test. That test involves erasing raw data from a chart, and figuring out how much information the graphical elements themselves convey.


The primary metric used by Sharma is the billionares' total net worth as a percentage of the country's GDP. This metric is embedded in double concentric circles. Unfortunately, without mental gymnastics, readers can't tell what the proportion is. This means we must look at the raw data which is supplied as a column on the right of the graphic. If readers are taking the information from the column of raw data, then why draw a chart?


The actual data is revealed on the left . Don't tell anyone you read it here but pie charts would work well with this dataset. You might complain that there is a conceptual problem - that if we sum up the net worth of everyone in a country, it would not equal GDP. I think the sum doesn't work - economists can chime in about this. Sharma seems to imply that the total would sum to 1. Anyone's net worth is accumulated over a number of years in which the GDP is fluctuating while the total GDP is given for a specific end of quarter of some year so does it make sense to divide one by the other?

Also, the fact that some people may have negative net worth creates problems with the pie-chart format and it's not much better in a concentric-circle format either.

A maddening decision puts the United States, which is the biggest circle, at the bottom of the chart. Notice that the countries are sorted from larger billionaires' share to smaller. The U.S. belongs to the top 5 nations with the worst inequality by this metric and yet a cheeky little bookmark sends us to the bottom of the list together with the more-equal nations.

Not only is the location of U.S. privileged, the location of the text, the number of decimal places given in the net worth amount, and the presence of the GDP value all set the U.S. apart from the other countries plotted.


The most interesting piece of information is waiting to be reconstructed. In Malaysia, nine citizens own as much as 18.3% of the country's GDP. In Mexico, 11 people own 10.9% of the country's GDP.

To make the number even more telling, we have to incorporate the population size. For Malaysia it is 28 million. This means that the top 0.000032% of the population owns 18.3%. In the case of perfect equality, this proportion would own 0.000032%. We can say the inequality index is 570,000. In Mexico, the index is 1.1 million. So in fact, the concentration of wealth at the time is worse in Mexico than in Malaysia. For reference, the U.S. comes in at 78,000.

Of course, the use of billionaires as a filtering device to determine who to count or not is completely arbitrary. In measuring income inequality, one should look at what proportion of the population control 50% of the wealth, for example.


There is no explanation for the choice of countries. The U.S. is the only developed nation in the entire chart.




New but is it better?

Conventionally, the bracket in a sports tournament is presented like this (link):


In the Euro 2012 that's happening right now, the group stage is followed by the knockout stage (quarter-, semi- and final).

The knockout stage is pretty straightforward. The group stage presents some challenges because it's difficult to present the chronology together with the team standing at the same time.


The official site of Euro 2012 has an innovative "Tournament Map" that is an attempt to improve upon the traditional design. (link)


I have mixed feelings about this presentation. It's easier to get a sense of how each team performed chronologically over the course of the competition. But then, I can't figure out what day the winner of a quarterfinal would play in the semifinal.

Simple rendering of complex data

Andrew Gelman likes this line chart showing the day-by-day trend in childbirth:



Andrew makes a number of good points about this chart. Make sure you read the whole post.

One of his points concerns making the line smoother by removing the within-week fluctuations. Doing so removes the weekday/weekend effect. By removing effects that are not of interest, we can focus on effects that are interesting. The number of births on any given day is a confluence of many factors, weekday/weekend being one of them. If we don't remove some of the contributing factors, we'd have no idea which factors are more important and which are less so.


The problem of "confounding" in complex datasets is demonstrated in the heat map, which Gelman also cited, without approval:

Heatmapbirthdays1Heat maps are great for certain datasets. This isn't one of them.

The weakness of heat maps is the reliance on color scales. Most software does not allow precise mapping of numbers to colors. The color pattern is automatically generated, which is often not to our liking. Even if the colors are acceptable, it is impossible to learn anything from a heat map other than the big-picture patterns.

The big-picture pattern we find here includes summer months being most popular for births while springtime is less popular. I fail to find any consistent patterns in the rows. If this is the key message, then we can collapse the rows, and even collapse the columns into seasons.

But what is the color scale? The colors correspond to ranks. Ranks ignore the actual difference between two data points. In other words, all the drastic troughs and peaks in the line chart disappear from this heat map. There are much better ways to turn count data into discrete bins.

 Picking the right ranking scheme is the most pressing issue here. The designer ranks all 366 days in one overall ranking. This ranking serves to play up the summer bulge in births but obscures other patterns.

Alternatively, days can be ranked within each month. That would remove the month-to-month effect and highlight the day-of-the-month effect.



Bloomberg issues a health warning dressed up as a fast-food menu

NYC mayor Michael Bloomberg is getting mixed reviews for his proposal to ban super-sized sugary drinks. Reader John O. wasn't impressed with this graphical effort (link):



The key problem: this picture is not scary at all. The reason it's not horrifying is that there is no context. People who have knowledge about healthy eating habits will get the message but that's preaching to the choir.

If you know that the recommended consumption of daily sugars for adults is roughly 20-36 grams, then you can see that one sugary drink of 12 ounces or higher would take you over the daily limit. A 64-ounce drink would give you more than 7 times what you need in a day. That's a powerful message but you won't know it from this chart. Not from the sugar cubes doubling as shadows, which is a cute, creative concept.

Also, make use of the chart-title real estate! Instead of "Sugar & Calories per Fountain Drink", say something memorable. "Fountain drinks make you fat and sick".


There is something else fishy about this graphic. What are the most prominent data being displayed?

You got it. They're 7, 12, 16, 32, 64. Where have we seen this type of data display?

Yup. This format is lifted from a menu in a Starbucks or a McDonald's (without prices).

Is this a health warning? Or a restaurant menu?


John wrote:

Also slightly confused about the slightly non-linear relationship between calories and drink size.  Maybe volume of ice is held constant...

It is in fact a proportional relationship. The confusion arises from the non-linear increase in cup size from 7 to 64 ounces. The math is roughly 11 calories per ounce, and 3g of sugar per ounce. I wonder if it is better to show those two numbers instead of the ten not-very-memorable numbers shown on the chart itself.


In case you're wondering, the heights (thus areas) of the cups have no relationship with any of the data, not calories, not sugars, and not the cup size.


PS. John also wrote: "The soda cup graph reminds me of the chart from Pravda that Tufte cites in 'Cognitive Style of Powerpoint'. " If you know what he's talking about, please post a link to the chart. Thanks.

Geographical data charted right

The following chart by the Financial Times reminds me of the famous Napoleon Russian Campaign map:


I also love it when geographical data, in this case average house price data by region, are plotted without a map. If plotted on a map, the relative prices are typically differentiated by color. On this chart, they are encoded in the heights of the columns. Our brains are just not wired to translate color differences into numeric differences so every time we can avoid color scales, we should.

Like Minard's chart, multiple dimensions are comfortably accommodated. The location along the river bank. The north/south orientation of the location. The "width" of a neighborhood.

A minor quibble is with the choice of data series. I wonder if price per square feet would be a better metric. One can also try a relative scale (indexed to the average).

Poking at the data behind a chart

Reader Jamie D. wasn't very amused by the following chart, from the Freakonomics blog (link):


Jamie summarized his view as follows:

First of all, a quick look of the graph makes you think you're comparing states with helmet laws vs. those w/out helmet laws.  But, upon closer reading, it's actually just a comparison between states that have repealed their helmet laws between 1994 and 2007 and ALL OTHER STATES.  Reading further down it appears that even in the heyday of helmet laws, only 26 states had them.  Thus, the graph is really a comparison between 7 states that repealed the helmet laws in that time period and the other 43 states, 24 of which have never had helmet laws at all.

In the Trifecta checkup, this problem surfaces as a disconnect between the question being investigated, and the data used to address the question. (For an explanation of the Trifecta checkup, see this post.)

Further, Jamie asked:

More importantly from a graphical perception point of view: the horizontal axis identifies itself as "years relative to repeal."  While that time horizon makes sense with respect to the repeal states (in light green), it is not clear at all what "year relative to repeal" means in the 43 states that did not have a helmet law repeal during the time at issue (the dark green).  This might be further explained in the book (which I don't have), but even if does, the chart is misleading and not helpful in explaining data (which is its raison d'être.)

Aligning the data to a particular event (like the repeal of a particular law) is typically a very smart thing to do... and it belongs to one of many statistical adjustments that make perfect sense, like the seasonal adjustments of economic data (link). But here, as Jamie pointed out, in the "control" group in which states did not repeal the helmet laws, it isn't clear what should be the "anchor" year (time = 0).


At a more abstract level, the designer is working with a dataset with four dimensions: the state, the year, the status of the helmet law within a state, and the organ donation rate. The data can be arranged as a 50-row, 4-column table.

The first issue has to do with the values in the third column (status of the helmet law). It would be a mistake to positively identify the states that have repealed the law as "repeal states", and then by default label all the rest as "non-repeal states". Instead, there should be three levels: repeal states, non-repeal states, and no-helmet-law states. I'd then plot three lines instead of two.

The second issue arises when the designer tries to transform the second column, from actual years (2000, 2001, etc.) to relative years (anchor year = 0, and other years go +1, +2 and -1, -2, etc.). At some point, she would need to make an explicit decision of how to create "relative years" for the non-repeal and no-helmet-law states.


One other problem with this chart is not starting the vertical axis from zero when they are drawing attention to the area under the lines, and not the levels of the lines themselves. If they use a line chart instead, the start-at-zero rule is not as important.

I'll skip the critique of the overall plan of this Freakonomics analysis as I already wrote much about that (with Andrew Gelman). See our article here.