Come si dice donut in italiano

One of my Italian readers sent me the following "horror chart". (Last I checked, it's not Halloween.)


I mean, people are selling these rainbow sunglasses.


The dataset behind the chart is the market share of steel production by country in 1992 and in 2014. The presumed story is how steel production has shifted from country to country over those 22 years.

Before anything else, readers must decipher the colors. This takes their eyes off the data and on to the color legend placed on the right column. The order of the color legend is different from that found in the nearest object, the 2014 donut. The following shows how our eyes roll while making sense of the donut chart.


It's easier to read the 1992 donut because of the order but now, our eyes must leapfrog the 2014 donut.


This is another example of a visualization that fails the self-sufficiency test. The entire dataset is actually printed around the two circles. If we delete the data labels, it becomes clear that readers are consuming the data labels, not the visual elements of the chart.


The chart is aimed at an Italian audience so they may have a patriotic interest in the data for Italia. What they find is disappointing. Italy apparently completely dropped out of steel production. It produced 3% of the world's steel in 1992 but zero in 2014.

Now I don't know if that is true because while reproducing the chart, I noticed that in the 2014 donut, there is a dark orange color that is not found in the legend. Is that Italy or a mysterious new entrant to steel production?

One alternative is a dot plot. This design accommodates arrows between the dots indicating growth versus decline.



Pies, bars and self-sufficiency

Andy Cotgreave asked Twitter followers to pick between pie charts and bar charts:


The underlying data are proportions of people who say they won't get the coronavirus vaccine.

I noticed two somewhat unusual features: the use of pies to show single proportions, and the aspect ratio of the bars (taller than typical). Which version is easier to understand?

To answer this question, I like to apply a self-sufficiency test. This test is used to determine whether the readers are using the visual elements of the chart to udnerstand the data, or are they bypassing the visual elements and just reading the data labels? So, let's remove the printed data from the chart and take another look:


For me, these charts are comparable. Each is moderately hard to read. That's because the percentages fall into a narrow range at one end of the range. For both charts, many readers are likely to be looking for the data labels.

Here's a sketch of a design that is self-sufficient.


The data do not appear on this chart.


My first reaction to Andy's tweet turned out to be a misreading of the charts. I thought he was disaggregating the pie chart, like we can unstack a stacked bar chart.


Looking at the data more carefully, I realize that the "proportions" are not part to the whole. Or rather, the whole isn't "all races" or "all education levels". The whole is all respondents of a particular type.



Circular areas offer misleading cues of their underlying data

John M. pointed me on Twitter to this chart about the progress of U.S.'s vaccination campaign:


This looks like a White House production, retweeted by WHO. John is unhappy about this nested bubble format, which I'll come back to later.

Let's zoom in on what matters:


An even bigger problem with this chart is the Q corner in our Trifecta Checkup. What is the question they are trying to address? It would appear to be the proportion of population that has "already received [one or more doses of] vaccine". And the big words tell us the answer is 8 percent.

_junkcharts_trifectacheckupBut is that really the question? Check out the dark blue circle. It is labeled "population that has already received vaccine" and thus we infer this bubble represents 8 percent. Now look at the outer bubble. Its annotation is "new population that received vaccine since January 27, 2021". The only interpretation that makes sense is that 8 percent  is not the most current number. If that is the case, why would the headline highlight an older statistic, and not the most up-to-date one?

Perhaps the real question is how fast is the progress in vaccination. Perhaps it took weeks to get to the dark circle and then days to get beyond. In order to improve this data visualization, we must first decide what the question really is.


Now let's get to those nested bubbles. The bubble chart is a format that is not "sufficient," by which I mean the visual by itself does not convey the data without the help of aids such as labels. Try to answer the following questions:


In my view, if your answer to the last question is anything more than 5 seconds, the dataviz has failed. A successful data visualization should not make readers solve puzzles.

The first two questions depict the confusing nature of concentric circle diagrams. The first data point is coded to the inner circle. Where is the second data point? Is it encoded to the outer circle, or just the outer ring?

In either case, human brains are not trained to compare circular areas. For question 1, the outer circle is 70% larger than the smaller circle. For question 2, the ring is 70% of the area of the dark blue circle. If you're thinking those numbers seem unreasonable, I can tell you that was my first reaction too! So I made the following to convince myself that the calculation was correct:


Circular areas offer misleading visual cues, and should be used sparingly.

[P.S. 2/10/2021. In the next post, I sketch out an alternative dataviz for this dataset.]

Atypical time order and bubble labeling

This chart appeared in a Charles Schwab magazine in Summer, 2019.


This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:


Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.


In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.


What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .


Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)



I made a streamgraph

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.



A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:


To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.


What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:


Trump led from beginning to end, with his lead widening slightly as more votes were counted.


As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.


(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!


If you come up with a better idea, make sure to leave a comment.





This holiday retailers hope it will snow dollars

According to the Conference Board, the pandemic will not deter U.S. consumers from emptying their wallets this holiday season. Here's a chart that shows their expectation (link):



A few little things make this chart work:

The "More" category is placed on the left, as English-speaking countries tend to be read Left-to-Right, and it is also given the deepest green, drawing our attention.

Only the "More" segments have data labels. I'd have omitted the decimals. I suspect they are added because financial analysts may be multiplying these percentages to yield dollar amounts, in which case the extra precision helps.

The categories are ordered by the decreasing propensity of increased spending this year relative to last year. (The business community has an optimism bias.)

The choice of three shades of one color instead of three different colors keeps the chart clean.


The use of snowflakes surely infuriates a hardcore Tufte fan although I like that they add a festive note to the presentation. The large snowflake isn't randomly positioned but placed exactly where it causes the least interference with the bar chart.


Using comparison to enrich a visual story

Just found this beauty deep in my submission pile (from Howie H.):


What's great about this pie chart is the story it's trying to tell. Almost half of the electorate did not vote in Texas in the 2016 Presidential election. The designer successfully draws my attention to the white sector that makes the point.

There are a few problems.

Showing two decimals is too much precision.

The purple sector is not labeled.

The white area seems exaggerated. The four sectors do not appear to meet at the center of the circle. The distortion is not too much but it's schizophrenic: the pie slices are drawn with low precision while the data labels have high precision.


The following fixes those problems, and also adds a second chart to contrast the two ways of thinking:


Book Review: Visualizing with Text by Richard Brath

Richardbarth_bookcoverThe creative process is sometimes described in terms of diverge-converge cycles. The diverge step involves experimentation and rewards suspending disbelief, while excesses are curbed and concepts refined during the converge step. Richard Brath's just-released book Visualizing with Text is an important resource that expands our appreciation for the place of text in visual displays.

Books on data visualization fall into recognizable types, of which two popular ones are the style guide, such as Edward Tufte, Dona Wong, and Alberto Cairo, and the coding manual, such as Ben Fry (processing) and Hadley Wickham (ggplot, Shiny). Brath's volume belongs to neither of those - it reads more like an encyclopedic catalog of how text can be incorporated into charts and graphs. He challenges us to blow up our imaginative space for characters, words, sentences, paragraphs and prose. It is a valuable aid for the diverge step of our creative process.

In modern data visualization, text is treated as an accessory, frequently found in titles, labels, legends, footnotes or surrounding text. Brath wants us to elevate text to the starring attraction. Starting with baby steps, such as direct labeling of lines and objects, and coordinating colors between chart elements and words, he experiments with inserting text into unlikely crannies, not shying away from ideas that even he admits may be somewhat of a dead-end.

One of the more immediately useful examples is the use of text labels that hug the lines on a line chart, similar to how roads and rivers are labeled on maps. I wish all software developers implement this function without delay.


A more esoteric example is to replace these lines with small-size text, as Brath makes an analogy between sentences and lines.


I am still deciding if this is a gold mine or a minefield. It is thought-provoking nonetheless.

Finally, the book includes some flights of fancy, like this one:


The red superscripts are numeric codes for French departments (provinces), arranged in ascending order of a given metric, and placed in proportional distance within the prose!

The converge step is left to the reader, as Brath refrains from bullhorning his opinions about chart types, which is why readers should not expect a style guide. He includes many experimental graphics, and may provide the pros and cons of a form without registering a judgement.

Because many of these ideas have yet to enter the mainstream, we'd need to implement these ideas on our own, which is why readers will not find a coding manual. As mentioned above, even the simplest and least controversial tactic of directly labeling lines is not available in Excel, let alone text that hugs or replaces lines. (This proves Brath's point that our community has done text a disservice.) Other ideas explored in later chapters require such features as italicizing numeric proportions of a word, rather than the entire word.

Recently, text has become a mainstay of Big Data. Visualizing with Text is timely, relevant and provocative. It is also clearly written, and tightly organized. Chapter 13 neatly summarizes the key concepts that have appeared along the way. There are plenty of use cases, primarily derived from research or business. After reading this book, you'll revel in the new sandbox of text, and long to free yourself from the constraints of your tool.


I recommend that you get the paper copy of the book. I reviewed the electronic version, and what irony! As you may have guessed, the electronic version ruins the typesetting. On every page, certain paragraphs show up in tiny font that resist all attempts to magnify, making Brath's case that legibility is an important metric for text visualization. Some of the more unusual fonts are dropped. The images are too small, even when popped up.

[P.S. Richard has a webpage where he included larger images and some code.]

Making better pie charts if you must

I saw this chart on an NYU marketing twitter account:


The graphical design is not easy on our eyes. It's just hard to read for various reasons.

The headline sounds like a subject line from an email.

The subheaders are long, and differ only by a single word.

Even if one prefers pie charts, they can be improved by following a few guidelines.

First, start the first sector at the 12-oclock direction. Like this:


The survey uses a 5-point scale from "Very Good" to "Very Bad". Instead of using five different colors, it's better to use two extreme colors and shading. Like this:


I also try hard to keep all text horizontal.


For those who prefers not to use pie charts, a side-by-side bar chart works well.


In my article for, I outlined "unspoken rules" for making various charts, including pie charts.




A testing mess: one chart, four numbers, four colors, three titles, wrong units, wrong lengths, wrong data

Twitterstan wanted to vote the following infographic off the island:


(The publisher's website is here but I can't find a direct link to this graphic.)

The mishap is particularly galling given the controversy swirling around this year's A-Level results in the U.K. For U.S. readers, you can think of A-Levels as SAT Subject Tests, which in the U.K. are required of all university applicants, and represent the most important, if not the sole, determinant of admissions decisions. Please see the upcoming post on my book blog for coverage of the brouhaha surrounding the statistical adjustments (to be posted sometime this week, it's here.).

The first issue you may notice about the chart is that the bar lengths have no relationship with the numbers printed on them. Here is a scatter plot correlating the bar lengths and the data.


As you can see, nothing.

Then, you may wonder what the numbers mean. The annotation at the bottom right says "Average number of A level qualifications per student". Wow, the British (in this case, English) education system is a genius factory - with the average student mastering close to three thousand subjects in secondary (high) school!

TES is the cool name for what used to be the Times Educational Supplement. I traced the data back to Ofqual, which is the British regulator for these examinations. This is the Ofqual version of the above chart:


The data match. You may see that the header of the data table reads "Number of students in England getting 3 x A*". This is a completely different metric than number of qualifications - in fact, this metric measures geniuses. "A*" is the U.K. equivalent of "A+". When I studied under the British system, there was no such grade. I guess grade inflation is happening all over the world. What used to be A is now A+, and what used to be B is now A. Scoring three A*s is tops - I wonder if this should say 3 or more because I recall that you can take as many subjects as you desire but most students max out at three (may have been four).

The number of students attaining the highest achievement has increased in the last two years compared to the two years before. We can't interpret these data unless we know if the number of students also grew at similar rates.

The units are students while the units we expect from the TES graphic should be subjects. The cutoff for the data defines top students while the TES graphic should connote minimum qualification, i.e. a passing grade.

Now, the next section of the Ofqual infographic resolves the mystery. Here is the chart:


This dataset has the right units and measurement. There is almost no meaningful shift in the last four years. The average number of qualifications per student is only different at the second decimal place. Replacing the original data with this set removes the confusion.


While I was re-making this chart, I also cleaned out the headers and sub-headers. This is an example of software hegemony: the designer wouldn't have repeated the same information three times on a chart with four numbers if s/he wasn't prompted by software defaults.


The corrected chart violates one of the conventions I described in my tutorial for color difference should reflect data difference.

In the following side-by-side comparison, you see that the use of multiple colors on the left chart signals different data - note especially the top and bottom bars which carry the same number, but our expectation is frustrated.



[P.S. 8/25/2020. Dan V. pointed out another problem with these bar charts: the bars were truncated so that the bar lengths are not proportional to the data. The corrected chart is shown on the right below:


8/26/2020: added link to the related post on my book blog.]