« The matter of bad choice | Main | A joke »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341e992c53ef010535d3d653970b

Listed below are links to weblogs that reference Dealing with skew:

Comments

Andreas

The skewness is the reason why the gini-coefficient is the only really good measure of income inequality. I therefor suggest 2: small-multiples: average income, and gini-coeffecients.

derek

It doesn't say on the graphic that the ticks in the bar represent average within decile. If The Economist has done it right, the ticks should be the decile boundaries, i.e. the maximum within the decile below or, alternatively speaking, the minimum within the decile above. That way, the bar segments would represent the range within each decile.

"Average within decile" would be silly, although it would at least let you graph meaningful statistics for the zeroth and tenth deciles, which otherwise amount to zero and infinity respectively.

Kaiser

Derek: that's exactly how I found out that the ticks were not decile boundaries. It just could not be possible that the highest income person in the US made 90k. One big problem with this graph is the lack of explanation of what it means by "decile"

Dave T

derek,
As you say, I expect a bar to represent a range. The lowest would start at zero and the highest would extend to infinity (or near enough). You could eliminate the two extreme bars with no loss of information and just plot the middle eight. Less clutter.
But the Economist chart has nine bars, none of which go to zero or infinity.
The only sense I can make of it is the ten boundaries represent the average within decile. In which case the presence of the bars is very confusing. They represent the range between adjacent averages?
Better to plot just dots at the boundaries (or vertical ticks, since dots would be cluttered at the low end for Turkey and Mexico).
One could do cumulative distribution curves with the ten dots, but then it is harder to get 19 series on one chart.

Chris Jackson

Looks like an ideal dataset to illustrate with "density strips" - shaded strips with darkness related to probability density. These show the whole distribution with no need for tick marks - black in the middle and fading from grey to white in the tails. See my paper that's just come out in The American Statistician here .

derek

Dave, Kaiser, you're right, there are nine bars and ten ticks, when there should be at most eight bars and nine ticks. I ought to have counted them before posting.

How bizarre. I wonder if this was done by someone on the Economist's payroll who'd seen such things but was "unclear on the concept"?

ZBicyclist

This is too hard to compare. I think Andreas is right -- they should at least have Gini coefficients.

jcukier

http://dx.doi.org/10.1787/420721018310

that's the original graph from us.
It's certainly not the most obvious graph ever, but

a) it comes with an explanatory paragraph, and
b) it comes from a book (http://www.oecdbookshop.org/oecd/display.asp?sf1=identifiers&st1=9789264044180) which has tons of figures on income inequality, such as http://dx.doi.org/10.1787/420515624534 gini coefficients, and tons of others.

Anyway. Economist, please give proper citation information. you know you should.

ZBicyclist

jcukier, I am SO LOST. I must be doing something simple wrong.

I looked at the spreadsheet you referenced above
http://dx.doi.org/10.1787/420721018310

But I do not understand the numbers. I've transposed Turkey so I can post it here:

TUR
P10 1,312
P20 1,004
P30 776
P40 752
P50 777
P60 936
P70 1,115
P80 1,425
P90 2,353
P100 12,211

I'm expecting the income at the first decile (P10) to be the lowest, the income at the second decile (P20) to be the second lowest, etc. but these numbers aren't rising monotonically.

Obviously I'm misunderstanding SOMETHING, but I don't know what. If you see this message, and are feeling helpful, straighten me out.

Kaiser

ZB: Good catch. I cannot figure this out either.

Bill Towne

I'm pretty sure that the numbers are the differences between the deciles.

So the 20th %-tile would be 1312+1004 and

the 30th %-tiles is 1312+1004+776 and so on.

Bill

Jon Peltier

I agree that there must be some kind of confusion with the data and its treatment in the Economist chart. Average of each decile, that's a bit tricky.

Chris J - Your paper is interesting, and I'll have to read it more closely later. "Displaying Uncertainty" is an interesting phrase. As you approach the ends of the shaded bars, it becomes more and more uncertain where the ends actually end. This is a complaint about the data bars in Excel 2007, which are not trying to show uncertainty, but in fact are supposed to represent known values.

The shading does seem pretty effective, though. I generally make cumulative distribution charts for my own consumption, but I'm reluctant to hand them to someone else. I usually have to explain them more than once, and then wonder if the other person really gets it.

derek

Jon, have you ever thought of producing cumulative distribution charts as a waterfall chart, with a simple histogram bar graph underneath it for comparison? I have no idea if this will work well, as I just thought of it myself, but it seems as if the connection between a bar in the waterfall and a bar in the histogram would be intuitively obvious at a glance.

Jon Peltier

Derek -

I did some cumulative distribution charts (I'll probably post them tomorrow). At first I made a histogram type chart, where the height of each bar was the cumulative total of the decile values in that country's data. This didn't allow useful comparisons. I imagine using floating bars to represent each decile's value would look fine for one set of data, but it also would not allow for effective comparisons.

I then assumed the decile value represented the mid percentile's value (i.e., I used the first decile's value for the 5th percentile) and continued my analysis with XY charts.

Jon Peltier

I've written up my own analysis of this data and posted it in How do you display a lopsided distribution?. I welcome any comments on my approach.

Fabrice

I posted a Boxplot version of the chart here :
http://sparklines-excel.blogspot.com/

Some info on the different deciles is lost in the process but the chart is easier to read.

jcukier

ZBicyclist, I got it.
the table on the 2nd tab is what has been used to make the graph. Since it is a "stacked column" chart, then all the values are not absolute values, but correspond to the height of the bar. i.e. the value called P20 is really "P20-P10".

So Turkey would be:
TUR
P10 1,312
P20 2,316
P30 3,092
P40 3,844
P50 4,621
P60 5,557
P70 6,673
P80 8,098
P90 10,451
P100 22,662

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Marketing analytics and data visualization expert. Author and Speaker. Currently at Vimeo and NYU. See my full bio.

Book Blog



Link to junkcharts

Graphics design by Amanda Lee

The Read



Good Books

Keep in Touch

follow me on Twitter

Residues