Making better pie charts if you must

I saw this chart on an NYU marketing twitter account:

LATAMstartupCEO_covidimpact

The graphical design is not easy on our eyes. It's just hard to read for various reasons.

The headline sounds like a subject line from an email.

The subheaders are long, and differ only by a single word.

Even if one prefers pie charts, they can be improved by following a few guidelines.

First, start the first sector at the 12-oclock direction. Like this:

Redo_junkcharts_latamceo_orientation

The survey uses a 5-point scale from "Very Good" to "Very Bad". Instead of using five different colors, it's better to use two extreme colors and shading. Like this:

Redo_junkcharts_latamceo_color

I also try hard to keep all text horizontal.

Redo_junkcharts_latamceo_labels

For those who prefers not to use pie charts, a side-by-side bar chart works well.

Redo_junkcharts_latamceo_bars

In my article for DataJournalism.com, I outlined "unspoken rules" for making various charts, including pie charts.

 

 

 


Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

 

 


Election visual 3: a strange, mash-up visualization

Continuing our review of FiveThirtyEight's election forecasting model visualization (link), I now look at their headline data visualization. (The previous posts in this series are here, and here.)

538_topchartofmaps

It's a set of 22 maps, each showing one election scenario, with one candidate winning. What chart form is this?

Small multiples may come to mind. A small-multiples chart is a grid in which every component graphic has the same form - same chart type, same color scheme, same scale, etc. The only variation from graphic to graphic is the data. The data are typically varied along a dimension of interest, for example, age groups, geographic regions, years. The following small-multiples chart, which I praised in the past (link), shows liquor consumption across the world.

image from junkcharts.typepad.com

Each component graphic changes according to the data specific to a country. When we scan across the grid, we draw conclusions about country-to-country variations. As with convention, there are as many graphics as there are countries in the dataset. Sometimes, the designer includes only countries that are directly relevant to the chart's topic.

***

What is the variable FiveThirtyEight chose to vary from map to map? It's the scenario used in the election forecasting model.

This choice is unconventional. The 22 scenarios is a subset of the 40,000 scenarios from the simulation - we are left wondering how those 22 are chosen.

Returning to our question: what chart form is this?

Perhaps you're reminded of the dot plot from the previous post. On that dot plot, the designer summarized the results of 40,000 scenarios using 100 dots. Since Biden is the winner in 75 percent of all scenarios, the dot plot shows 75 blue dots (and 25 red).

The map is the new dot. The 75 blue dots become 16 blue maps (rounded down) while the 25 red dots become 6 red maps.

Is it a pictogram of maps? If we ignore the details on the maps, and focus on the counts of colors, then yes. It's just a bit challenging because of the hole in the middle, and the atypical number of maps.

As with the dot plot, the map details are a nice touch. It connects readers with the simulation model which can feel very abstract.

Oddly, if you're someone familiar with probabilities, this presentation is quite confusing.

With 40,000 scenarios reduced to 22 maps, each map should represent 1818 scenarios. On the dot plot, each dot should represent 400 scenarios. This follows the rule for creating pictograms. Each object in a pictogram - dot, map, figurine, etc. - should encode an equal amount of the data. For the 538 visualization, is it true that each of the six red maps represents 1818 scenarios? This may be the case but not likely.

Recall the dot plot where the most extreme red dot shows a scenario in which Trump wins 376 out of 538 electoral votes (margin = 214). Each dot should represent 400 scenarios. The visualization implies that there are 400 scenarios similar to the one on display. For the grid of maps, the following red map from the top left corner should, in theory, represent 1,818 similar scenarios. Could be, but I'm not sure.

538_electoralvotemap_topleft

Mathematically, each of the depicted scenario, including the blowout win above, occurs with 1/40,000 chance in the simulation. However, one expects few scenarios that look like the extreme scenario, and ample scenarios that look like the median scenario.  

So, the right way to read the 538 chart is to ignore the map details when reading the embedded pictogram, and then look at the small multiples of detailed maps bearing in mind that extreme scenarios are unique while median scenarios have many lookalikes.

(Come to think about it, the analogous situation in the liquor consumption chart is the relative population size of different countries. When comparing country to country, we tend to forget that the data apply to large numbers of people in populous countries, and small numbers in tiny countries.)

***

There's a small improvement that can be made to the detailed maps. As I compare one map to the next, I'm trying to pick out which states that have changed to change the vote margin. Conceptually, the number of states painted red should decrease as the winning margin decreases, and the states that shift colors should be the toss-up states.

So I'd draw the solid Republican (Democratic) states with a lighter shade, forming an easily identifiable bloc on all maps, while the toss-up states are shown with a heavier shade.

Redo_junkcharts_538electoralmap_shading

Here, I just added a darker shade to the states that disappear from the first red map to the second.


Ask how you can give

A reader and colleague Georgette A was frustrated with the following graphic that appeared in the otherwise commendable article in National Geographic (link). The NatGeo article provides a history lesson on past pandemics that killed millions.

Natgeo_pandemichistory

What does the design want to convey to readers?

Our attention is drawn to the larger objects, the red triangle on the left or the green triangle on the right. Regarding the red triangle, we learn that the base is the duration of the pandemic while the height of the black bar represents the total deaths.

An immediate curiosity is why a green triangle is lodged in the middle of the red triangle. Answering this question requires figuring out the horizontal layout. Where we expect axis labels we find an unexpected series of numbers (0, 16, 48, 5, 2, 4, ...). These are durations that measure the widths of the triangular bases.

To solve this puzzle, imagine the chart with the triangles removed, leaving just the black columns. Now replace the durations with index numbers, 1 to 13, corresponding to the time order of the ending years of these epidemics. In other words, there is a time axis hidden behind the chart. [As Ken reminded me on Twitter, I forgot to mention that details of each pandemic are revealed by hovering over each triangle.]

This explains why the green triangle (Antonine Plague) is sitting inside the large red triangle (Plague of Justinian). The latter's duration is 3 times that of the former, and the Antonine Plague ended before the Plague of Justinian. In fact, the Antonine occurred during 165-180 while the Justinian happened during 541-588. The overlap is an invention of the design. To receive what the design gives, we have to think of time as a sequence, not of dates.

***

Now, compare the first and second red triangles. Their black columns both encode 50 million deaths. The Justinian Plague however was spread out over 48 years while the Black Death lasted just 5 years. This suggests that the Black Death was more fearsome than the Justinian Plague. And yet, the graphic presents the opposite imagery.

This is a pretty tough dataset to visualize. Here is a side-by-side bar chart that lets readers first compare deaths, and then compare durations.

Redo_natgeo_pandemichistory

In the meantime, I highly recommend the NatGeo article.


Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

Germanextremists_bars

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Redo_junkcharts_germanextremists_sidebysidelines

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

Redo_junkcharts_germanextremists_dots

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.

 

[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]


The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

Wsj_roundup_img1


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Wsj_roundup_img2

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

Redo_roundupwsj0

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Redo_dicambacomplaints1

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Redo_dicambacomplaints2

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.


All these charts lament the high prices charged by U.S. hospitals

Nyt_medicalprocedureprices

A former student asked me about this chart from the New York Times that highlights much higher prices of hospital procedures in the U.S. relative to a comparison group of seven countries.

The dot plot is clearly thought through. It is not a default chart that pops out of software.

Based on its design, we surmise that the designer has the following intentions:

  1. The names of the medical procedures are printed to be read, thus the long text is placed horizontally.

  2. The actual price is not as important as the relative price, expressed as an index with the U.S. price at 100%. These reference values are printed in glaring red, unignorable.

  3. Notwithstanding the above point, the actual price is still of secondary importance, and the values are provided as a supplement to the row labels. Getting to the actual prices in the comparison countries requires further effort, and a calculator.

  4. The primary comparison is between the U.S. and the rest of the world (or the group of seven countries included). It is less important to distinguish specific countries in the comparison group, and thus the non-U.S. dots are given pastels that take some effort to differentiate.

  5. Probably due to reader feedback, the font size is subject to a minimum so that some labels are split into two lines to prevent the text from dominating the plotting region.

***

In the Trifecta Checkup view of the world, there is no single best design. The best design depends on the intended message and what’s in the available data.

To illustate this, I will present a few variants of the above design, and discuss how these alternative designs reflect the designer's intentions.

Note that in all my charts, I expressed the relative price in terms of discounts, which is the mirror image of premiums. Instead of saying Country A's price is 80% of the U.S. price, I prefer to say Country A's price is a 20% saving (or discount) off the U.S. price.

First up is the following chart that emphasizes countries instead of hospital procedures:

Redo_medicalprice_hor_dot

This chart encourages readers to draw conclusions such as "Hospital prices are 60-80 percent cheaper in Holland relative to the U.S." But it is more taxing to compare the cost of a specific procedure across countries.

The indexing strategy already creates a barrier to understanding relative costs of a specific procedure. For example, the value for angioplasty in Australia is about 55% and in Switzerland, about 75%. The difference 75%-55% is meaningless because both numbers are relative savings from the U.S. baseline. Comparing Australia and Switzerland requires a ratio (0.75/0.55 = 1.36): Australia's prices are 36% above Swiss prices, or alternatively, Swiss prices are a 64% 26% discount off Australia's prices.

The following design takes it even further, excluding details of individual procedures:

Redo_medicalprice_hor_bar

For some readers, less is more. It’s even easier to get a rough estimate of how much cheaper prices are in the comparison countries, for now, except for two “outliers”, the chart does not display individual values.

The widths of these bars reveal that in some countries, the amount of savings depends on the specific procedures.

The bar design releases the designer from a horizontal orientation. The country labels are shorter and can be placed at the bottom in a vertical design:

Redo_medicalprice_vert_bar

It's not that one design is obviously superior to the others. Each version does some things better. A good designer recognizes the strengths and weaknesses of each design, and selects one to fulfil his/her intentions.

 

P.S. [1/3/20] Corrected a computation, explained in Ken's comment.


Revisiting global car sales

We looked at the following chart in the previous blog. The data concern the growth rates of car sales in different regions of the world over time.

Cnbc zh global car sales

Here is a different visualization of the same data.

Redo_cnbc_globalcarsales

Well, it's not quite the same data. I divided the global average growth rate by four to yield an approximation of the true global average. (The reason for this is explained in the other day's post.)

The chart emphasizes how each region was helping or hurting the global growth. It also features the trend in growth within each region.

 


This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]