The matter of bad choice
A joke

Dealing with skew

Bernard L pointed us to this income distribution chart printed in the Economist.

Economist_incomedist

The accompanying paragraph points to the range of the bars, that is, the gap between the top decile average and the bottom decile average, as evidence of income disparity, concluding that the US and Britain are among the worst.

Bernard likes the use of vertical sections to represent the average incomes by decile and dislikes the USA-Today style background image.  Agreed.  But why plot the middle deciles at all when the only worthy data involve the endpoints of the bars?

A close examination of the spacing of the middle deciles leads to more befuddlement.  There does not appear to be much difference between the countries.

The answer to this is that decile statistics are not appropriate for data as skewed as incomes.  At the high end, the 10% intervals are too coarse.

One clue to this is that the top 10% in the US only earns $90,000 on average but we have all heard of the billion-dollar hedge fund managers and Wall Street bankers and $30 million a movie celebrities.  The problem is that within the top decile, the income distribution is also tremendously skewed. 

The neat idea of plotting the vertical sections indicates an awareness that the red dots (average income) are insufficient because of the skew.  Alas, there remains a lot of skew above the top decile and the designer inadvertently falls back into the same trap by considering the average income within the top 10%.  Thus, the amount of disparity on the right side of the chart is grossly underestimated.  Roughly speaking, we are looking at 10 samples of the distribution, nine of which at the low end of the range and only one at the top end (long tail).  Here is the idea:

Redo_disparity


Reference: "Spreading the wealth", Economist, Oct 21 2008.

Comments

Andreas

The skewness is the reason why the gini-coefficient is the only really good measure of income inequality. I therefor suggest 2: small-multiples: average income, and gini-coeffecients.

derek

It doesn't say on the graphic that the ticks in the bar represent average within decile. If The Economist has done it right, the ticks should be the decile boundaries, i.e. the maximum within the decile below or, alternatively speaking, the minimum within the decile above. That way, the bar segments would represent the range within each decile.

"Average within decile" would be silly, although it would at least let you graph meaningful statistics for the zeroth and tenth deciles, which otherwise amount to zero and infinity respectively.

Kaiser

Derek: that's exactly how I found out that the ticks were not decile boundaries. It just could not be possible that the highest income person in the US made 90k. One big problem with this graph is the lack of explanation of what it means by "decile"

Dave T

derek,
As you say, I expect a bar to represent a range. The lowest would start at zero and the highest would extend to infinity (or near enough). You could eliminate the two extreme bars with no loss of information and just plot the middle eight. Less clutter.
But the Economist chart has nine bars, none of which go to zero or infinity.
The only sense I can make of it is the ten boundaries represent the average within decile. In which case the presence of the bars is very confusing. They represent the range between adjacent averages?
Better to plot just dots at the boundaries (or vertical ticks, since dots would be cluttered at the low end for Turkey and Mexico).
One could do cumulative distribution curves with the ten dots, but then it is harder to get 19 series on one chart.

Chris Jackson

Looks like an ideal dataset to illustrate with "density strips" - shaded strips with darkness related to probability density. These show the whole distribution with no need for tick marks - black in the middle and fading from grey to white in the tails. See my paper that's just come out in The American Statistician here .

derek

Dave, Kaiser, you're right, there are nine bars and ten ticks, when there should be at most eight bars and nine ticks. I ought to have counted them before posting.

How bizarre. I wonder if this was done by someone on the Economist's payroll who'd seen such things but was "unclear on the concept"?

ZBicyclist

This is too hard to compare. I think Andreas is right -- they should at least have Gini coefficients.

jcukier

http://dx.doi.org/10.1787/420721018310

that's the original graph from us.
It's certainly not the most obvious graph ever, but

a) it comes with an explanatory paragraph, and
b) it comes from a book (http://www.oecdbookshop.org/oecd/display.asp?sf1=identifiers&st1=9789264044180) which has tons of figures on income inequality, such as http://dx.doi.org/10.1787/420515624534 gini coefficients, and tons of others.

Anyway. Economist, please give proper citation information. you know you should.

ZBicyclist

jcukier, I am SO LOST. I must be doing something simple wrong.

I looked at the spreadsheet you referenced above
http://dx.doi.org/10.1787/420721018310

But I do not understand the numbers. I've transposed Turkey so I can post it here:

TUR
P10 1,312
P20 1,004
P30 776
P40 752
P50 777
P60 936
P70 1,115
P80 1,425
P90 2,353
P100 12,211

I'm expecting the income at the first decile (P10) to be the lowest, the income at the second decile (P20) to be the second lowest, etc. but these numbers aren't rising monotonically.

Obviously I'm misunderstanding SOMETHING, but I don't know what. If you see this message, and are feeling helpful, straighten me out.

Kaiser

ZB: Good catch. I cannot figure this out either.

Bill Towne

I'm pretty sure that the numbers are the differences between the deciles.

So the 20th %-tile would be 1312+1004 and

the 30th %-tiles is 1312+1004+776 and so on.

Bill

Jon Peltier

I agree that there must be some kind of confusion with the data and its treatment in the Economist chart. Average of each decile, that's a bit tricky.

Chris J - Your paper is interesting, and I'll have to read it more closely later. "Displaying Uncertainty" is an interesting phrase. As you approach the ends of the shaded bars, it becomes more and more uncertain where the ends actually end. This is a complaint about the data bars in Excel 2007, which are not trying to show uncertainty, but in fact are supposed to represent known values.

The shading does seem pretty effective, though. I generally make cumulative distribution charts for my own consumption, but I'm reluctant to hand them to someone else. I usually have to explain them more than once, and then wonder if the other person really gets it.

derek

Jon, have you ever thought of producing cumulative distribution charts as a waterfall chart, with a simple histogram bar graph underneath it for comparison? I have no idea if this will work well, as I just thought of it myself, but it seems as if the connection between a bar in the waterfall and a bar in the histogram would be intuitively obvious at a glance.

Jon Peltier

Derek -

I did some cumulative distribution charts (I'll probably post them tomorrow). At first I made a histogram type chart, where the height of each bar was the cumulative total of the decile values in that country's data. This didn't allow useful comparisons. I imagine using floating bars to represent each decile's value would look fine for one set of data, but it also would not allow for effective comparisons.

I then assumed the decile value represented the mid percentile's value (i.e., I used the first decile's value for the 5th percentile) and continued my analysis with XY charts.

Jon Peltier

I've written up my own analysis of this data and posted it in How do you display a lopsided distribution?. I welcome any comments on my approach.

Fabrice

I posted a Boxplot version of the chart here :
http://sparklines-excel.blogspot.com/

Some info on the different deciles is lost in the process but the chart is easier to read.

jcukier

ZBicyclist, I got it.
the table on the 2nd tab is what has been used to make the graph. Since it is a "stacked column" chart, then all the values are not absolute values, but correspond to the height of the bar. i.e. the value called P20 is really "P20-P10".

So Turkey would be:
TUR
P10 1,312
P20 2,316
P30 3,092
P40 3,844
P50 4,621
P60 5,557
P70 6,673
P80 8,098
P90 10,451
P100 22,662

The comments to this entry are closed.