« January 2020 | Main

Pie chart conventions

I came across this pie chart from a presentation at an industry meeting some weeks ago:

Mediaconversations_orig

This example breaks a number of the unspoken conventions on making pie charts and so it is harder to read than usual.

Notice that the biggest slice starts around 8 o'clock, and the slices are ordered alphabetically by the label, rather than numerically by size of the slice.

The following is the same chart ordered in a more conventional way. The largest slice is placed along the top vertical, and the other slices are arranged in a clock-wise manner from larger to smaller.

Redo_junkcharts_mediaconversations1

This version is easier to read because the reader does not need to think about the order of the slices. The expectation of decreasing size is met.

The above pie chart, though, reveals breaking of another convention. The colors on this chart signify nothing! The general rule is color differences should encode data differences. Here, the colors should go from deepest to lightest. (One can even argue that different tinges is redundant.)

Redo_junkcharts_mediaconversations2

You see how this version is even better. In the previous version, the colors are distracting. You're wondering what they mean, and then you realize they signify nothing.

***

As designers of graphics, we follow a bunch of conventions silently. When a design deviates from it, it's harder to understand.

Recently, I wrote a long article for DataJournalism.com, setting out many of these unspoken conventions. Read it here.

 


Too many colors on a chart is bad, but why?

The following chart is bad, but how so?

Junkcharts_colors_columnchart

The chart is annoying because of the misuse of colors.

What is the purpose of the multiple colors used in this chart? It's not encoding any data. Colors are used here to differentiate one bar from its two neighbors. Or perhaps to make the chart more "appealing".

The reason why the coloring scheme backfires is that readers may look for meaning in the colors. What's common between Iceland, United States and Germany for them to be assigned green? What about Japan, New Zealand, Spain and France, all of which shown yellow?

The readers' instinct is driven by a set of unspoken rules that govern the production of data visualization. Specifically, the rule here is: color differences reflect data differences. When such a rule is violated, the reader is misled and confused.

***

For more about this rule, other rules related to making bar charts, and other other rules for making data graphics, please read my Long Read article, here.

 


How to read this chart about coronavirus risk

In my just-published Long Read article at DataJournalism.com, I touched upon the subject of "How to Read this Chart".

Most data graphics do not come with directions of use because dataviz designers follow certain conventions. We do not need to tell you, for example, that time runs left to right on the horizontal axis (substitute right to left for those living in right-to-left countries). It's when we deviate from the norms that calls for a "How to Read this Chart" box.

***
A discussion over Twitter during the weekend on the following New York Times chart perfectly illustrates this issue. (The article is well worth reading to educate oneself on this red-hot public-health issue. I made some comments on the sister blog about the data a few days ago.)

Nyt_coronavirus_scatter

Reading this chart, I quickly grasp that the horizontal axis is the speed of infection and the vertical axis represents the deadliness. Without being told, I used the axis labels (and some of you might notice the annotations with the arrows on the top right.) But most people will likely miss - at a glance - that the vertical axis utilizes a log scale while the horizontal axis is linear (regular).

The effect of a log scale is to pull the large numbers toward the average while spreading the smaller numbers apart - when compared to a linear scale. So when we look at the top of the coronavirus box, it appears that this virus could be as deadly as SARS.

The height of the pink box is 3.9, while the gap between the top edge of the box and the SARS dot is 6. Yet our eyes tell us the top edge is closer to the SARS dot than it is to the bottom edge!

There is nothing inaccurate about this chart - the log scale introduces such distortion. The designer has to make a choice.

Indeed, there were two camps on Twitter, arguing for and against the log scale.

***

I use log scales a lot in analyzing data, but tend not to use log scales in a graph. It's almost a given that using the log scale requires a "How to Read this Chart" message. And the NY Times crew delivers!

Right below the chart is a paragraph:

Nyt_coronavirus_howtoreadthis

To make this even more interesting, the horizontal axis is a hidden "log" scale. That's because infections spread exponentially. Even though the scale is not labeled "log", think as if the large values have been pulled toward the middle.

Here is an over-simplified way to see this. A disease that spreads at a rate of fifteen people at a time is not 3 times worse than one that spreads five at a time. In the former case, the first sick person transmits it to 15, and then each of the 15 transmits the flu to 15 others, thus after two steps, 241 people have been infected (225 + 15 + 1). In latter case, it's 5x5 + 5 + 1 = 31 infections after two steps. So at this point, the number of infected is already 8 times worse, not 3 times. And the gap keeps widening with each step.

P.S. See also my post on the sister blog that digs deeper into the metrics.