Two lines dropping: further reading
Colors and scales break the seam between charts

Someone submits a good infographic

Reader Chris P. sent me to this Mint infographic showing the income distribution in the U.S. (link). I found the second section more interesting so this post will focus on that one chart. But I want to let Chris have his word also, so we have a double post. To see Chris's comment on the chart, see here.

Here is the chart from the second section:


What do I like about this chart?

It tells a story without appealing directly to the data.  I see only 7x2 = 14 numbers on the chart, all embedded into the legend/scale. So many charts of this type send readers immediately into a twister by bombarding our eyes with data.

In the middle of the chart, for instance, states like MD and MA contrast with states like MI and MS. Poorer people are in the yellow segments while richer people are in the greener segments. So we can see that in MD and MA, the green part extends below the first horizontal gridline while in MI and MS, that gridline cuts into the orange. The implication is that there are more rich people in MD and MA than in MI and MS.

The horizontal gridlines are subtle but surprisingly functional, allowing readers to pick out the information. The gridlines divide each column into 4 equal parts so each part is a quarter (quartile) of the state population. In MD and MA, at least the top 25% of their populations are considered rich by national standards. Rich, as defined by the green as defined by the legend, means household incomes greater than $75,000. In both those states, the top 25% earn at least $100,000.

Similarly, by looking at the color of the segment that crosses the lowest horizontal gridline, we know how much the bottom 25% earn in each state. The poorest segment seems to be smaller in AK than in other states.

The row of state boundaries at the bottom of the chart is very cute. And it encodes information, which is a wonderful touch. I believe (though haven't verified) the color of the state map tells us the mean household income within the state.


A few improvements would make this column chart better. One shouldn't place the national average above the chart horizontally using a different scale. Just place it as an additional column next to the other 50+ columns, with a slight offset and proper labeling. This allows direct lookup of how a state compares to the national average.

Also, try ordering by income inequality. The alphabetical order does the reader no favors. The ordering is particularly important because the main finding of the chart is that income distribution exhibits only moderate variability by state - most states look alike.


Given the low variability, the challenge is how to bring out the mild differences: which parts of the income distribution of which state show variance against the national average?

In the following attempt, we plot the "excess" proportion relative to the national average by state. 

For example, in the most "unequal" "state", District of Columbia (first chart), we find that it has a shortage (negative excess) of people earning below $75,000, and an excess of people earning above $75,000 when compared to the national income distribution. The proportion of "excess" increases with each higher income bracket (moving from left to right of the chart).


I have grouped and ordered the states by the orientation of the line plots. The first group of states, boxed in red, are all similar to DC, in the sense that they have a shortage of low earners and an excess of high earners.

Some states, like Texas, Pennsylvania and Georgia, have an income distribution that almost exactly mirrors the national average. Then, those states boxed in aquamarine have a small excess of poor people and a shortage of rich people compared to the national average. Not unexpectedly, Puerto Rico is on its own.


One has to be careful with this type of data because the income distributions are highly skewed. How are the income brackets determined?

Lumping everyone in the top 4% or so (earning $300,000 or more) into one bracket obscures the tremendous income inequality even within that bracket. In fact, for my chart above, I have to decide where to put the last data point, i.e. the people earning $200,000 or more, because $200,000 or more is not a point on the horizontal axis but an open-ended range. I just used $300,000 but the better thing to do is to find out the average income within that top bracket and place the point there.



Disinterested Observer

Wouldn't such data be better represented with a Lorenz curve?


A real Lorenz curve can't be derived from the published data precisely because the outer brackets were defined as "less than X" and "more than Y". We can't then recover the total income for those brackets, and for obvious reasons, not being able to get the total income for the top bracket is highly problematic.

I first tried the next best thing, which is to keep the income brackets as the horizontal axis, and plotting the cumulative proportion of population as the vertical axis. Now, readers are forced to examine the vertical distance between two curves in order to understand which part of the distribution diverges from the national average. Thus, I came to this chart where the vertical distances are plotted directly.

Eric Obermühlner

While going through the full history of this great blog I stumbled over this entry and to my suprise I find that I don't agree with you.

While the chart is visually beautyful I find several details quite annoying.

The black "NATIONAL" bar looks more like a subtitle to me. Took a while to realize it was referring to the horizontal stacked bar below it.
I agree that the national bar should be vertical and shown next to the others.

Sorting by the size of the richest group would probably greatly enhance the readability.

I found the bottom line especially confusing since I could not figure out what was encoded here (it looks too dark in many cases to be the mean - but I could be wrong). Why not label it?

The comments to this entry are closed.