Watching a valiant effort to rescue the pie chart

Today we return to the basics. In a twitter exchange with Dean E., I found the following pie chart in an Atlantic article about who's buying San Francisco real estate:

Atlantic_sfrealestatepie

The pie chart is great at one thing, showing how workers in the software industry accounted for half of the real estate purchases. (Dean and I both want to see more details of the analysis as we have many questions about the underlying data. In this post, I ignore these questions.)

After that, if we want to learn anything else from the pie chart, we have to read the data labels. This calls for one of my key recommendations: make your charts sufficient. The principle of self-sufficiency is that the visual elements of the data graphic should by themselves say something about the data. The test of self-sufficiency is executed by removing the data printed on the chart so that one can assess how much work the visual elements are performing. If the visual elements require data labels to work, then the data graphic is effectively a lookup table.

This is the same pie chart, minus the data:

Redo_atlanticsfrealestate_sufficiency

Almost all pie charts with a large number of slices are packed with data labels. Think of the labeling as a corrective action to fix the shortcoming of the form.

Here is a bar chart showing the same data:

Junkcharts_redo_atlanticsfrealestatebar

***

Let's look at all the efforts made to overcome the lack of self-sufficiency.

Here is a zoom-in on the left side of the chart:

Redo_atlanticsfrealestate_labeling_1

Data labels are necessary to help readers perceive the sizes of the slices. But as the slices are getting smaller, the labels are getting too dense, so the guiding lines are being stretched.

Eventually, the designer gave up on labeling every slice. You can see that some slices are missing labels:

Redo_atlanticsfrealestate_labeling_3

The designer also had to give up on sequencing the slices by the data. For example, hardware with a value of 2.4% should be placed between Education and Law. It is shifted to the top left side to make the labeling easier.

Redo_atlanticsfrealestate_labeling_2

Fitting all the data labels to the slices becomes the singular task at hand.

 


The unreasonable effect of chart labels

In discussing the bar-density and pie-density charts with a buddy (thanks LB!), it became obvious that the labeling is a challenge. And he's right.

Here is the pie-density chart for the Youtube views with the labels as originally conceived.

Kaiserfung_piedensity_youtube_orig_labels

These labels are trying too hard to provide precise data to the reader.

Here are some simplified labels that get at the message rather than the data:

Kaiserfung_piedensity_youtube_labels_2b


Here is a slightly different version:

Kaiserfung_piedensity_youtube_labels_3b


 

 

 


Bar-density and pie-density plots for showing relative proportions

In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.

Jc_redo_youtube_bar_2col

Other examples of this type of data include:

  • the top 10% of families own 75% of U.S. household wealth (link)
  • the top 1% of artists earn 77% of recorded music income (link)
  • Five percent of AT&T customers consume 46% of the bandwidth (link)

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

  • the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;
  • The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.

Jc_redo_youtube_bar_h_2col

Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)

Ams_voronoi

 

The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.

***

In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

  • Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.
  • Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.
  • Use the ppp function to generate the Voronoi data
  • Set up a colormap for plotting the Voronoi diagram
  • Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.

library(spatstat)

# enter the scaled proportions of creators and views
# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views
# total width should be set based on the width of your page

wide = c(378, 276, 346)/2

# set bar height, to attain a particular aspect ratio
bar_h = 50

# define function to generate points
# defines rectangular window

makepoints = function (n, wide, height) {
    df <- data.frame(x = runif(n,0,wide),y = runif(n,0,height))
    W <- owin( c(0, wide), c(0,height) ) # rectangular window
    pp1 <- as.ppp( df, W )
    y <- dirichlet(pp1)
    # y$marks <- sample(0:wide, n, replace=T) # marks are for colors
    return (y)
}

y_red = makepoints(nc[1], wide[1], bar_h) # height of each bar fixed
y_yel = makepoints(nc[2], wide[2], bar_h)
y_gry = makepoints(nc[3], wide[3], bar_h)

# setting colors (4 shades per bar, one color per bar)

cr_red = colourmap(c("lightsalmon","lightsalmon2", "lightsalmon4", "brown"), breaks=round(seq(0, wide[1],length.out=5)))

cr_yel = colourmap(c("burlywood1", "burlywood2", "burlywood3", "burlywood4"), breaks=round(seq(0, wide[2],length.out=5)))

cr_gry = colourmap(c("gray80", "gray60", "gray40", "gray20"), breaks=round(seq(0, wide[3],length.out=5)))

# plotting

par(mar=c(0,0,0,0))


# add png to save image to png

# remove values= if colors set in ppp

plot.tess(y_red, main="", border="pink3", do.col=T, values = sample(0:wide[1], nc[1], replace=T), col=cr_red, xlim=c(0, wide[1]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_yel, main="", border="darkgoldenrod4", do.col=T, values=sample(0:wide[2], nc[2], replace=T), col=cr_yel, xlim=c(0, wide[2]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_gry, main="", border="darkgray", do.col=T, values=sample(0:wide[3], nc[3], replace=T), col=cr_gry, xlim=c(0, wide[3]), ylim=c(0,bar_h), ribbon=F)

# because of random points, the tessellation looks different each time
# post-processing: make each bar the same height when aligned side by side

***

A cousin of the bar-density plot is the pie-density plot. Since I'm using only three creator segments, which each account for about 30-40% of the total views, it is natural to use a pie chart. In this case, we embed the Voronoi diagrams into the pie sectors.

Jc_redo_youtube_pie_lobsided

If the distribution were more even, that is to say, the creators are more or less equally important, the pie-density plot looks like this:

Redo_jc_youtube_pie_even

***

Something that is more like 80/20

The original chart shows the top 0.3 percent generating almost 40 percent of the views. A more typical insight is top X percent generates 80 percent of the data. For the YouTube data, X is 11 percent. What does the pie-density chart look like if  top 11 percent <-> 80 percent, middle 33 percent <-> 11 percent, bottom 56 percent <-> 8 percent?

Jc_youtube_8020_barh_pie

Roughly speaking, the second segment includes 3 times the people as the largest, and the third has 5 times as the largest.

 

P.S.

1) Check out my first Linkedin "article" on this topic. 

2) The first post on bar-density charts is here.

 

 

 

 

 

 

 

 

 


Finding simple ways to explain complicated data and concepts, using some Pew data

A reader submitted the following chart from Pew Research for discussion.

Pew_ST-2014-09-24-never-married-08

The reader complained that this chart was difficult to comprehend. What are some of the reasons?

The use of color is superfluous. Each line is a "cohort" of people being tracked over time. Each cohort is given its own color or hue. But the color or hue does not signify much.

The dotted lines. This design element requires a footnote to explain. The reader learns that some of the numbers on the chart are projections because those numbers pertain to time well into the future. The chart was published in 2014, using historical data so any numbers dated 2014 or after (and even some data before 2014) will be projections. The data are in fact encoded in the dots, not the slopes. Look at the cohort that has one solid line segment and one dotted line segment - it's unclear which of those three data points are projections, and which are experienced.

The focus on within-cohort trends. The line segments indicate the desire of the designer to emphasize trends within each cohort. However, it's not clear what the underlying message is. It may be that more and more people are not getting married (i.e. fewer people are getting married). That trend affects each of the three age groups - and it's easier to paint that message by focusing on between-cohort trends.

***
Here is a chart that emphasizes the between-cohort trends.

Redo_jc_pewmarriagebyage

A key decision is to not mix oil and water. The within-cohort analysis is presented in its own chart, next to the between-cohort analysis. It turns out that some of the gap between cohorts can be explained by people deferring marriage to later in life. The steep line on the right indicates that a bigger proportion of people now gets married between 35 and 44 than in previous cohorts.

I experimented a bit with the axes here. Several pie charts are used in lieu of axis labels. I also plotted a dual axis with the proportion of unmarried on the one side, and the corresponding proportion of married on the other side.


Education deserts: places without schools still serve pies and story time

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 2

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 4

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

Chronicle_edudesserts_pie

 The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

Redo_jc_chronedudesserts

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

  • Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
  • Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
  • Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
  • The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
  • Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)

 

 

 


The downside of discouraging pie charts

It's no secret most dataviz experts do not like pie charts.

Our disdain for pie charts causes people to look for alternatives.

Sometimes, the alternative is worse. Witness:

Schwab_bloombergaggregatebondindex

This chart comes from the Spring 2018 issue of On Investing, the magazine for Charles Schwab customers.

It's not a pie chart.

Redo_jc_bondindex

I'm forced to say the pie chart is preferred.

The original chart fails the self-sufficiency test. Here is the 2007 chart with the data removed.

Bloombergbondindex_sufficiency

It's very hard to figure out how large are those pieces, so any reader trying to understand this chart will resort to reading the data, which means the visual representation does no work!

Or, you can use a dot plot.

Redo_jc_bondindex2

This version emphasizes the change over time.

 


Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.

Swimmingpoolsvisitors_ger

He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!


Doing my duty on Pi Day #onelesspie

Xan Gregg and I started a #onelesspie campaign a few years ago. On Pi Day each year, we find a pie chart, and remake it. On Wikipedia, you can find all manners of pie chart. Try this search, and see for yourself.

Here's one found on the Wiki page about the city of Ogema, in Canada:

Ogema_Stats_canada_pie_chart

This chart has 20 age groups, each given a different color. That's way too much!

I was able to find data on 10-year age groups, not five. But the "shape" of the distribution is much easily seen on a column chart (a histogram).

Redo_ogema_age_distribution

Only a single color is needed.

The reason why I gravitated to this chart was the highly unusual age distribution... this town has almost uniform distribution of age groups, with each of the 10-year ranges accounting for about 11% of the population. Given that there are 9 groups, a perfectly even distribution would be 11% for each column. (Well, the last group of 80+ is cheating a bit as it has more than 10 years.)

I don't know about Ogema. Maybe a reader can explain this unusual age distribution!

 

 

 


Steel tariffs, and my new dataviz seminar

I am developing a new seminar aimed at business professionals who want to improve their ability to communicate using charts. I want any guidance to be tool-agnostic, so that attendees can implement them using Excel if that’s their main charting software. Over the 12+ years that I’ve been blogging, certain ideas keep popping up; and I have collected these motifs and organized them for the seminar. This post is about a recent chart that brings up a few of these motifs.

This chart has been making the rounds in articles about the steel tariffs.

2018.03.08steel_1

The chart shows the Top 10 nations that sell steel to the U.S., which together account for 78% of all imports. 

The chart shows a few signs of design. These things caught my eye:

  1. the pie chart on the left delivers the top-line message that 10 countries account for almost 80% of all U.S. steel imports
  2. the callout gives further information about which 10 countries and how much each nation sells to the U.S. This is a nice use of layering
  3. on the right side, progressive tints of blue indicate the respective volumes of imports

On the negative side of the ledger, the chart is marred by three small problems. Each of these problems concerns inconsistency, which creates confusion for readers.

  1. Inconsistent use of color: on the left side, the darker blue indicates lower volume while on the right side, the darker blue indicates higher volume
  2. Inconsistent coding of pie slices: on the right side, the percentages add up to 78% while the total area of the pie is 100%
  3. Inconsistent scales: the left chart carrying the top-line message is notably smaller than the right chart depicting the secondary message. Readers’ first impression is drawn to the right chart.

Easy fixes lead to the following chart:

Redo_steelimports_1

***

The central idea of the new dataviz seminar is that there are many easy fixes that are often missed by the vast majority of people making Excel charts. I will present a stack of these motifs. If you're in the St. Louis area, you get to experience the seminar first. Register for a spot here.

Send this message to your friends and coworkers in the area. Also, contact me if you'd like to bring this seminar to your area.

***

I also tried the following design, which brings out some other interesting tidbits, such as that Canada and Brazil together sell the U.S. about 30% of its imported steel, the top 4 importers account for about 50% of all steel imports, etc. Color is introduced on the chart via a stylized flag coloring.

Redo_steelimports_2