Visualizing the 80/20 rule, with the bar-density plot
The Economist on the Economist: must read now

Bar-density and pie-density plots for showing relative proportions

In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.

Jc_redo_youtube_bar_2col

Other examples of this type of data include:

  • the top 10% of families own 75% of U.S. household wealth (link)
  • the top 1% of artists earn 77% of recorded music income (link)
  • Five percent of AT&T customers consume 46% of the bandwidth (link)

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

  • the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;
  • The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.

Jc_redo_youtube_bar_h_2col

Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)

Ams_voronoi

 

The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.

***

In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

  • Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.
  • Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.
  • Use the ppp function to generate the Voronoi data
  • Set up a colormap for plotting the Voronoi diagram
  • Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.

library(spatstat)

# enter the scaled proportions of creators and views
# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views
# total width should be set based on the width of your page

wide = c(378, 276, 346)/2

# set bar height, to attain a particular aspect ratio
bar_h = 50

# define function to generate points
# defines rectangular window

makepoints = function (n, wide, height) {
    df <- data.frame(x = runif(n,0,wide),y = runif(n,0,height))
    W <- owin( c(0, wide), c(0,height) ) # rectangular window
    pp1 <- as.ppp( df, W )
    y <- dirichlet(pp1)
    # y$marks <- sample(0:wide, n, replace=T) # marks are for colors
    return (y)
}

y_red = makepoints(nc[1], wide[1], bar_h) # height of each bar fixed
y_yel = makepoints(nc[2], wide[2], bar_h)
y_gry = makepoints(nc[3], wide[3], bar_h)

# setting colors (4 shades per bar, one color per bar)

cr_red = colourmap(c("lightsalmon","lightsalmon2", "lightsalmon4", "brown"), breaks=round(seq(0, wide[1],length.out=5)))

cr_yel = colourmap(c("burlywood1", "burlywood2", "burlywood3", "burlywood4"), breaks=round(seq(0, wide[2],length.out=5)))

cr_gry = colourmap(c("gray80", "gray60", "gray40", "gray20"), breaks=round(seq(0, wide[3],length.out=5)))

# plotting

par(mar=c(0,0,0,0))


# add png to save image to png

# remove values= if colors set in ppp

plot.tess(y_red, main="", border="pink3", do.col=T, values = sample(0:wide[1], nc[1], replace=T), col=cr_red, xlim=c(0, wide[1]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_yel, main="", border="darkgoldenrod4", do.col=T, values=sample(0:wide[2], nc[2], replace=T), col=cr_yel, xlim=c(0, wide[2]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_gry, main="", border="darkgray", do.col=T, values=sample(0:wide[3], nc[3], replace=T), col=cr_gry, xlim=c(0, wide[3]), ylim=c(0,bar_h), ribbon=F)

# because of random points, the tessellation looks different each time
# post-processing: make each bar the same height when aligned side by side

***

A cousin of the bar-density plot is the pie-density plot. Since I'm using only three creator segments, which each account for about 30-40% of the total views, it is natural to use a pie chart. In this case, we embed the Voronoi diagrams into the pie sectors.

Jc_redo_youtube_pie_lobsided

If the distribution were more even, that is to say, the creators are more or less equally important, the pie-density plot looks like this:

Redo_jc_youtube_pie_even

***

Something that is more like 80/20

The original chart shows the top 0.3 percent generating almost 40 percent of the views. A more typical insight is top X percent generates 80 percent of the data. For the YouTube data, X is 11 percent. What does the pie-density chart look like if  top 11 percent <-> 80 percent, middle 33 percent <-> 11 percent, bottom 56 percent <-> 8 percent?

Jc_youtube_8020_barh_pie

Roughly speaking, the second segment includes 3 times the people as the largest, and the third has 5 times as the largest.

 

P.S.

1) Check out my first Linkedin "article" on this topic. 

2) The first post on bar-density charts is here.

 

 

 

 

 

 

 

 

 

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Peter H

Creative solution, though I think many readers would have trouble interpreting the graph at first glance.

For example, the second pie chart ("if the distribution were more even...") really helps clarify how to read the first. Which begs the question, if the second pie chart is necessary, is the first really clear by itself?

I think I like the mirrored bar charts better, which are more easily understood. Though neither solution helps me quickly get a "feel" for the data. I think this is just a classically hard relationship to convey graphically.

Peter H

Perhaps a cumulative density plot is the best way to show this?

Something like: http://support.sas.com/documentation/cdl/en/procstat/66703/HTML/default/images/ex35out.png

Kaiser

PH: That's a "Lorenz" curve which I address in the Linkedin article as one of the more popular options. It's another option in which the large group of least important people get the most attention so I'm not too happy with it either. I liked it until I realized how hard it is to explain to non-technical people.

Jamie Briggs

I think this is very interesting, conceptually.

Looking at the first bar, I find it very difficult to think anything other than "one person made a *lot* of content, and five other people made quite a bit too".

I am really struggling to imagine how this can move from academic exercise to useful real-world visualization in a way that's more effective than the simpler methods explored.

Jon Peltier

I had the same impression as Jamie. How about if you cheat with the voronoi graphic and space the points more uniformly before drawing the cells boundaries?

Kaiser

JB & JP: Thanks for the valuable comments.

Let me provide some more behind-the-scenes thinking that didn't make it to the blog post.

1) The fact that you are drawn to the large creators tells me that the chart is succeeding in its key objective. The problem with all of the common ways of visualizing 80/20 data is that they fail to bring this message out. Instead, those charts say "look, here is a very small group of creators" and "look, in aggregate, this group makes a lot of content". In all the real-world cases in which I or my analysts are presenting 80/20 data, we want to draw attention to the big key accounts.

2) The common visualizations are not "simpler." It appears simpler for those of us who have learned how to read the Lorenz curve or the stacked bar charts. On the occasion when I had to explain those to someone who do not have the background, it is clear that they are not easy to understand. Part of the reason, I suspect, is that those charts do not visually tell the full story - the reader has to infer in his/her head that there are a few really big creators because (a) there are a small number of them and (b) in aggregate, they make up a big chunk of views.

3) It initially bothers me that a reader may be tempted to read the size of each piece of tessellation literally, even though the encoding of size into density is a deliberate decision of visualizing with less precision, much like encoding anything into color gradation. But I believe it is better than other alternatives.

JP suggested to divide up each bar/sector into equal areas. This change does not solve the problem if the reader insists on interpreting the size of each piece of tessellation. In fact, in real datasets, it is quite likely that the sizes are unevenly distributed with a heavy skew, especially when the number of pieces is small. So the real critique is that the relative sizes of the pieces do not reflect the true relative sizes of the individual creators.

4) This last question was mentioned briefly in the original post: the designer can choose to make the tessellation pieces reflect the exact relative sizes of individual creators. I believe the extra work is not merited. The appearance of the graphic will not change substantially. Further, when the total number of units is large, like in the case of Youtube, the added precision is destructive.

When the number of units are large, smoothing is a common solution. (This is the extra work.) You'd produce a smoothed version of the views distribution within each segment. Then divide each bar/sector into a specific number of pieces. Next, you have to solve the problem of creating a tessellation that fits the pieces into the bar/sector.

5) One reason why I ended up liking the Voronoi approach is that its lack of precision forces the reader to think conceptually about the existence of some very important creators. The irregular shapes and sizes make it impossible to dwell on the specific individual comparisons.

If the goal of the data visualization is to highlight the specific individual contributions of the top creators, then a simple bar chart showing Top 10 Creators is clearly superior. If the goal of the data visualization is to provide specific, precise data on the proportions of creators and views, then a data table is clearly superior.

My goal here is to come up with a visual design that directly tells the story of Youtube stars, the super-rich, superstars, etc.; that is more intuitive to a non-technical audience; and that is engaging.


The comments to this entry are closed.