« November 2011 | Main | January 2012 »

How to mess up a bar chart

Klout_barchartKaren Lopez at InfoAdvisors sent in what she calls "the stupidest bar chart of 2011" (link), showing us that one can mess up bar charts too.

This chart originally came from an infographic (link) published by Klout, which is a venture-funded startup that is creating an "influence" rating of online entities.


If we use the Junkcharts Trifecta Checkup (link), we find that the design failed on at least two of the three criteria.

What is the interesting question being addressed? It's the relative "influence" of the top websites in 2011. Readers would probably want to learn more, such as why it's the Top 11 rather than Top 10. Also, as they pointed out in the write-up, a tiny company called "SoundCloud" appeared in this list, which calls into question the ranking methodology. In addition, I wonder what people go to McDonald's website to do, and why Walmart instead of Amazon.

The choice of data is the key failure here. While Klout is a rating system, for some reason they chose to display the ranking on this bar chart. Ranks invite readers to presume that the difference between Facebook and Youtube is the same as between Walmart and Netflix, or that the difference between Facebook and Skype is double that between Facebook and Youtube, or that the difference between Facebook and Apple is the same as the difference between YouTube and McDonald's, etc.

In reality, and especially for this sort of data (influence), one is likely to encounter highly skewed distributions.

The graph itself is executed poorly. Karen asked:

Notice how 11th place Facebook has more bar?  But it’s in the worst place in the list.  Or is it?

The bar chart would make sense if rating is being plotted. With rank data, the bar chart is completely redundant.

Ron Paul confuses the charts

Andrew Sullivan (link) re-printed this grouped column chart showing the result of a Washington Post-ABC poll on how voters say they would react to Ron Paul running as an independent candidate in next year's U.S. presidential election.


One aspect of this chart bothers me... depending on one's familiarity with the election politics, the need to read carefully both the titles at the bottom of the chart, and the legend, and possibly also the title of the chart (or the knowledge that the Republican wears red and Democrat blue) in order to orient onself. You can experiment by blocking out one or two of these three items.

Here's the same chart with a small number of fixes. Printing the legend onto the bars themselves makes the data more readable. This change necessitates flipping the columns over to horizontal bars. There are pros and cons to using a stacked chart versus a grouped chart.


Neither of these charts answer the burning question in the reader's mind, which is likely to be from whom would Paul take his votes. The key message from above is that the insertion of Paul is projected to make the identity of the Republican candidate irrelevant. The following flow chart emphasizes the shift in votes as opposed to the vote totals.


It appears that the Others/Undecided voters who can still swing the election do not consider Ron Paul as a desirable alternative. Most of Ron Paul's supporters would come from voters who would have cast their votes for the Republican or Democratic candidate (by a ratio of 3 Republican votes to 1 Democratic vote if Romney is running, or 3 to 2 if Gingrich is running).

Colors and scales break the seam between charts

When reader Chris P. sent in the Mint infographic (discussed here), he pointed out a flaw in the poster.

Look at the legend for the first chart, which is a map of median household incomes.


Now, look at the legend for the second chart, which is the multiple bar chart we discussed before.


Six colors in one, 7 colors in the other.

Lowest bucket is under $40,000 in one, under $25,000 in the other. But both are given the same light yellow color. 

By varying the color scale, the designer completely severs the relationship between the two charts. Worse, a non-existent correlation is invented via common colors that signify nothing!


If you haven't already, see my other post about this infographic. 

Someone submits a good infographic

Reader Chris P. sent me to this Mint infographic showing the income distribution in the U.S. (link). I found the second section more interesting so this post will focus on that one chart. But I want to let Chris have his word also, so we have a double post. To see Chris's comment on the chart, see here.

Here is the chart from the second section:


What do I like about this chart?

It tells a story without appealing directly to the data.  I see only 7x2 = 14 numbers on the chart, all embedded into the legend/scale. So many charts of this type send readers immediately into a twister by bombarding our eyes with data.

In the middle of the chart, for instance, states like MD and MA contrast with states like MI and MS. Poorer people are in the yellow segments while richer people are in the greener segments. So we can see that in MD and MA, the green part extends below the first horizontal gridline while in MI and MS, that gridline cuts into the orange. The implication is that there are more rich people in MD and MA than in MI and MS.

The horizontal gridlines are subtle but surprisingly functional, allowing readers to pick out the information. The gridlines divide each column into 4 equal parts so each part is a quarter (quartile) of the state population. In MD and MA, at least the top 25% of their populations are considered rich by national standards. Rich, as defined by the green as defined by the legend, means household incomes greater than $75,000. In both those states, the top 25% earn at least $100,000.

Similarly, by looking at the color of the segment that crosses the lowest horizontal gridline, we know how much the bottom 25% earn in each state. The poorest segment seems to be smaller in AK than in other states.

The row of state boundaries at the bottom of the chart is very cute. And it encodes information, which is a wonderful touch. I believe (though haven't verified) the color of the state map tells us the mean household income within the state.


A few improvements would make this column chart better. One shouldn't place the national average above the chart horizontally using a different scale. Just place it as an additional column next to the other 50+ columns, with a slight offset and proper labeling. This allows direct lookup of how a state compares to the national average.

Also, try ordering by income inequality. The alphabetical order does the reader no favors. The ordering is particularly important because the main finding of the chart is that income distribution exhibits only moderate variability by state - most states look alike.


Given the low variability, the challenge is how to bring out the mild differences: which parts of the income distribution of which state show variance against the national average?

In the following attempt, we plot the "excess" proportion relative to the national average by state. 

For example, in the most "unequal" "state", District of Columbia (first chart), we find that it has a shortage (negative excess) of people earning below $75,000, and an excess of people earning above $75,000 when compared to the national income distribution. The proportion of "excess" increases with each higher income bracket (moving from left to right of the chart).


I have grouped and ordered the states by the orientation of the line plots. The first group of states, boxed in red, are all similar to DC, in the sense that they have a shortage of low earners and an excess of high earners.

Some states, like Texas, Pennsylvania and Georgia, have an income distribution that almost exactly mirrors the national average. Then, those states boxed in aquamarine have a small excess of poor people and a shortage of rich people compared to the national average. Not unexpectedly, Puerto Rico is on its own.


One has to be careful with this type of data because the income distributions are highly skewed. How are the income brackets determined?

Lumping everyone in the top 4% or so (earning $300,000 or more) into one bracket obscures the tremendous income inequality even within that bracket. In fact, for my chart above, I have to decide where to put the last data point, i.e. the people earning $200,000 or more, because $200,000 or more is not a point on the horizontal axis but an open-ended range. I just used $300,000 but the better thing to do is to find out the average income within that top bracket and place the point there.


Statistical adjustment in charts

On the book blog, I often talk about the reasons why statisticians adjust data, and why it is necessary in order to paint a proper picture of what the data is saying. (See here or here.)

On this blog, I have frequently complained about how the "prior information" on maps is too strong - large regions dominate our perception regardless of the data. In the U.S., large but sparsely populated states attain disproportionate attention.

So, why not bring "statistical adjustment" to maps?


That's exactly what cartograms do. For example, look at the following pair of maps created by the people at Leicestershire County Council. (PDF link here)


The map on the left and the cartogram on the right plot identical data. The only difference is that each hexagon on the cartogram represents an equal number of people. The two views give very different impressions: the big dark green patch on the middle-right of the map -- representing a relatively sparse neighborhood -- is shrunk to a single dark green hexagon on the cartogram. Meanwhile, the most deprived areas (dark purple) which look relatively small on the map are expanded to quite a few hexagons.

According to the map, most of the county live in areas ranked in the half considered less deprived (green), and that is good news. But wait... there is a lot of purple in the cartogram!

The real piece of news is that the majority of people live in the half of the neighborhoods considered more deprived (purple) but this uncomfortable fact is well-hidden in the mostly green map on the left.

Given that the measures of "deprivation" are about people, not geographical neighborhoods, the cartogram is much closer to the real world experience... notwithstanding the obvious geographical distortion introduced by the statistical adjustment.

According to Alex L., who is part of the team producing these graphics:

LSOAs were created for the 2001 [UK] Census to disseminate the data and are generally considered to represent 'neighbourhoods'. They are created to have a broadly consistent population (approx 1500 people in 2001) and socio-economic traits.


Question: Is there any reason to show the map at all?