## Visualizing uneven distributions

##### Jan 30, 2014

Jeff, a reader of the blog, asks for comment on this blog post of his (link).

The highlight of the post is this chart, which shows an uneven distribution.

The message of the chart is that a large amount of donations (about 25%) came from the top 3 percent of donors. This is a long-tailed distribution, and quite typical of much data that have to do with financial matters. Thus, it is a general problem as many of us encounter this type of data.

One of the insights from Jeff's post is that with some tricks, one can generate a chart that looks like the above using Excel. This is pretty impressive, and he credits Peltier for the pointer.

***

Now, let's see if there are other ways to present this data. One issue I have with the chart is that the most important statistics are found in the text labels. These are of the form: "X% of customers contribute Y% of revenues". So, in effect, there are two relevant data series, one of the share of people and then the share of revenues.

The following is a stacked column chart:

Here, the information is primarily encoded in the dotted guide lines between the two columns. It has the advantage of showing both the absolute share of people as well as of revenues, plus showing the uneven distribution between the two data series.

But it is also less fun to look at. The advantage of the original chart is that one can imagine that all the donors are being lined up along the horizontal axis from those who gave the least to those who gave the most. That's a pretty powerful mental picture. The weakness of the original is that few of us can mentally tally up the strangely shaped areas to learn the share of revenues.

***

The next version is a kind of profile chart:

I like this one because it places the two data series on equal footing, and allows for efficient comparison of the two sets of proportions. It also has the feature of showing all the shares, just like the stacked columns.

PS. Jeff has taken some of his readers' comments into account, and has evolved his original design to this one:

I can see these changes:

• customers ordered with the most important on the left and the least on the right.  To me, a neutral change
• The vertical axis is labelled "subscription value" instead of "How much do we get for each subscription". This is a slight improvement, using fewer words to convey the same point.
• The breakpoints have been set differently to split the revenues into five  so that each segment now accounts for exactly 20% of the revenues. I actually prefer the original segmentation -- that one visually picks out the breakpoints in the data, thus it is empirical rather than canonical. Look at the split between the gray and the yellow segments in the new chart. Does it make sense to split customers with the same subscription value into two groups?

You can follow this conversation by subscribing to the comment feed for this post.

Hi Kaiser. Great post. I think your redesigns are insightful. But I think that the audience that my original chart was for –an executive leadership team – might struggle to intrinsically understand what they show – particularly the bottom one.

To be sure, I’ve included quite a bit of different metrics in the data labels of my original (which I’ve subsequently redesigned to flow from largest customer to smallest, as you can see at the original blog post). In fact, there’s four different bits of information in each data label:
* Size of revenue bucket;
* Share and absolute amount of customers in each bucket;
* Share and absolute amount of revenue from each bucket (which is only relevant if you’ve grouped the customers into different sized buckets, like I did in one of the versions on the blog); and
* average sub

I could have put that extra information in a table below the chart. But putting it on the chart – in my opinion – was a much better design choice: they don’t have to move their eyes around, and this approach clearly illustrates some very important commercial aspects of their business. Conversely, putting less information on the chart would have required putting more information in the text. And that in my opinion would have slowed down the time it took to absorb this stuff, because they'd have to play chart tennis...moving their eyes back and forth, back and forth, back and forth...

I could have used different chart types – or even multiple charts – to encapsulate that information graphically, like you have. But I’m not sure my audience would have to work any less hard to fathom it as a result. I doubt they’d truly ‘get’ your redesigns if I wasn’t there to guide them through it. (And I wasn’t). Even if I was there, I’m not sure they would really fathom the insights that your charts contain.

If I were in the room presenting this stuff to the leadership team in person, what I’d probably do is have a Powerpoint slide that plots one series at a time, and as each series is added, the presentation would go something like this:

[Chart with no series]:
“Let’s take a look at just how diverse our customer base is”.

[Chart with just the key account customers, with no data labels]
“Now this group is our key customers. We really want to look after them, because even though there are only 6 of them, they contribute 25% of our revenue. Notice that there’s actually a heck of a lot of variation within even these 6…the lowest sub is around \$15k and the highest is close to double that, with an average of around \$21k.”

[Chart now shows the key account series AND the Large Customer series]
“Now these 27 customers account for the next 29% of our revenue. And check out just how much…”

And so forth.

The audience would really get that, I think. My data labels do the exact same thing.

One approach (may not be practical in Excel) would be to stick with the original chart, but discretize both the x and y axis, so instead of looking at areas you're looking at stacks of blocks. Each block would represent a certain amount of revenue, and each column of blocks would be a certain number of subscribers. Some advantages:
* It may be more clear that you should count blocks than that you should estimate areas.
* By sticking to solid blocks (ie rounding data) then you get the reader thinking in quantities that are small enough to be memorable. Like 19 blocks that are \$10,000 each, rather than \$192k.
* It would look pretty neat - especially if instead of blocks they were appropriate icons of some sort.
* It's basically the same chart, so the presentation of it would be little altered.

That said, in a presentation, it might work better to generate several charts, each clearly emphasizing one point, rather than one complex one that does everything.

Jeff: thanks for the comments. I have included your latest version with my additional comments in a postscript above. I think you should stick to the original segmentation.

If you are talking about business managers, the stacked column chart is likely to do the best job. This is a familiar chart type to many, and it gets to your key message directly.

Your strategy works but the actual chart becomes a sideshow because all the information is delivered in the data labels and the verbal communications.

Good point re the segmentation. These graphs are actually produced from a template I've developed in Excel that allows you to change the breakpoints between groupings, or even add more or less groupings depending on what your data shows. It it only takes seconds to do this, and all the labels, leader lines, and series update instantly. It's a fun wee tool to play with, and very handy for playing with different groupings.

It would have been a better demo had I kept those original groupings, rather than using equal groupings of 20% revenue contribution.

In fact, in the original post used 25% groupings, then went on to say:

Using revenue ‘buckets’ of 25% was a fairly arbitrary choice. What if we designed a chart template that let you dynamically choose different sized revenue buckets, as well as let you use more buckets if you wanted to? This template allows you to do just that.

God knows why I jumped back to quintiles. Was so busy trying to get the chart template looking as crisp as I could that I forgot about using the best example possible to demo the concept. Doh!

Thanks again for the critique.

The comments to this entry are closed.