## Ringing in the data

##### Feb 15, 2022

There is a lot of great stuff at Visual Capitalist.

This circular design isn't one of their best.

***

A self-sufficiency test helps diagnose the problem. Notice that every data point is printed on the diagram. If the data labels were removed, there isn't much one can learn from the chart other than the ranking of countries from most indebted to least. It would be impossible to know the difference in debt levels between any pair of countries.

In other words, the data labels rather than visual elements are doing most of the work. In a good dataviz, we like the visual elements to carry the weight.

***

The concentric rings embed a visual hierarchy: Japan is singled out, then the next tier of countries include Sudan, Greece, Eritrea, Cape Verde, Italy, Suriname, and Barbados; and so on.

What is the clustering algorithm? What determines which countries fall into the same group?

It's implicitly determined by how many countries can fit inside the next ring. The designer carefully computed the number of rings, the widths of the rings, the density of the circles, etc. in such a way that there is no unsightly white space on the outer ring. Score a 10/10 for effort!

So the clustering of countries is not data-driven but constrained by the chart form. This limitation is similar to that found on maps used to illustrate spatial data.

## Asymmetry and orientation

##### Oct 20, 2021

An author in Significance claims that a single season of Premier League football without live spectators is enough to prove that the so-called home field advantage is really a live-spectator advantage.

The following chart depicts the data going back many seasons:

I find this bar chart challenging.

It plots the ratio of home wins to away wins using an odds scale, which is not intuitive. The odds scale (probability of success divided by probability of failure) runs from 0 to positive infinity, with 1 being a special value indicating equal odds. But all the values for which away wins exceed home wins are squeezed into the interval between 0 and 1 while the values for which home wins exceed away wins are laid out between 1 and infinity. So it's an inherently asymmetric graphic for a symmetric formula.

The section labeled "more away wins than home wins" are filled with red bars for all those seasons with positive home field advantage while the most recent season, the outlier, has a shorter bar in that section than the rest.

Here's an alternative view:

I have incorporated dual axes here - but both axes are different only by scaling. There are 380 games in a Premier League season so the percentage scale is just a re-expression of the counts.

## One of the most frequently produced maps is also one of the worst

##### Jul 08, 2021

Summer is here, many Americans are putting the pandemic in their rear-view mirrors, and gas prices are soaring. Business Insider told the story using this map:

What do we want to learn about gas prices this summer?

Which region has the highest / lowest prices?

How much higher / lower than the national average are the regional prices?

How much has prices risen, compared to last year, or compared to the last few weeks?

***

How much work did you have to do to get answers to those questions from the above map?

Unfortunately, this type of map continues to dominate the popular press. It merely delivers a geography lesson and not much else. Its dominant feature tells readers how to classify the 50 states into regions. Its color encodes no data.

Not surprisingly, this map fails the self-sufficiency test (link). The entire dataset is printed on the map, and if those numbers were removed, we would be left with a map of the regions of the U.S. The graphical elements of the chart are not doing much work.

***

In the following chart, I used the map as a color legend. Also, an additional plot shows each region's price level against the national average.

One can certainly ditch the map altogether, which makes having seven colors unnecessary. To address other questions, just stack on other charts, for example, showing the price increase versus last year.

***

From a Trifecta Checkup perspective, we find that the trouble starts with the Q corner. There are several important questions not addressed by the graphic. In the D corner, no context is provided to interpret the data. Are these prices abnormal? How do they compare to the national average or to a year ago? In the V corner, the chart takes too much effort to comprehend a basic fact, such as which region has the highest average price.

For more on the Trifecta Checkup, see this guide.

## Come si dice donut in italiano

##### Apr 15, 2021

One of my Italian readers sent me the following "horror chart". (Last I checked, it's not Halloween.)

I mean, people are selling these rainbow sunglasses.

The dataset behind the chart is the market share of steel production by country in 1992 and in 2014. The presumed story is how steel production has shifted from country to country over those 22 years.

Before anything else, readers must decipher the colors. This takes their eyes off the data and on to the color legend placed on the right column. The order of the color legend is different from that found in the nearest object, the 2014 donut. The following shows how our eyes roll while making sense of the donut chart.

It's easier to read the 1992 donut because of the order but now, our eyes must leapfrog the 2014 donut.

This is another example of a visualization that fails the self-sufficiency test. The entire dataset is actually printed around the two circles. If we delete the data labels, it becomes clear that readers are consuming the data labels, not the visual elements of the chart.

The chart is aimed at an Italian audience so they may have a patriotic interest in the data for Italia. What they find is disappointing. Italy apparently completely dropped out of steel production. It produced 3% of the world's steel in 1992 but zero in 2014.

Now I don't know if that is true because while reproducing the chart, I noticed that in the 2014 donut, there is a dark orange color that is not found in the legend. Is that Italy or a mysterious new entrant to steel production?

One alternative is a dot plot. This design accommodates arrows between the dots indicating growth versus decline.

## Pies, bars and self-sufficiency

##### Mar 29, 2021

Andy Cotgreave asked Twitter followers to pick between pie charts and bar charts:

The underlying data are proportions of people who say they won't get the coronavirus vaccine.

I noticed two somewhat unusual features: the use of pies to show single proportions, and the aspect ratio of the bars (taller than typical). Which version is easier to understand?

To answer this question, I like to apply a self-sufficiency test. This test is used to determine whether the readers are using the visual elements of the chart to udnerstand the data, or are they bypassing the visual elements and just reading the data labels? So, let's remove the printed data from the chart and take another look:

For me, these charts are comparable. Each is moderately hard to read. That's because the percentages fall into a narrow range at one end of the range. For both charts, many readers are likely to be looking for the data labels.

Here's a sketch of a design that is self-sufficient.

The data do not appear on this chart.

***

My first reaction to Andy's tweet turned out to be a misreading of the charts. I thought he was disaggregating the pie chart, like we can unstack a stacked bar chart.

Looking at the data more carefully, I realize that the "proportions" are not part to the whole. Or rather, the whole isn't "all races" or "all education levels". The whole is all respondents of a particular type.

## And you thought that pie chart was bad...

##### Mar 19, 2021

Vying for some of the worst charts of the year, Adobe came up with a few gems in its Digital Trends Survey. This was a tip from Nolan H. on Twitter.

There are many charts that should be featured; I'll focus on this one.

This is one of those survey questions that allow each respondent to select multiple responses so that adding up the percentages exceeds 100%. The survey asks people which of these futuristic products do they think is most important. There were two separate groups of respondents, consumers (lighter red) and businesses (darker red).

If, like me, you are a left-to-right, top-to-bottom reader, you'd have consumed this graphic in the following way:

The most important item is found in the lower bottom corner while the least important is placed first.

Here is a more sensible order of these objects:

To follow this order, our eyes must do this:

Now, let me say I like what they did with the top of the chart:

Put the legend above the chart because no one can understand it without first reading the legend.

***

Data are embedded into part-circles (i.e. sectors)... but where do we find the data? The most obvious place to look for them is the areas of the sectors. But that's the wrong place. As I show in the explainer, the designer placed the data in the "height" - the distance from the peak point of the object to the horizontal baseline.

As a result of this choice, the areas of the sectors distort the data - they are proportional to the square of the data.

One simple way to figure out that your graphical objects have obscured the data is the self-sufficiency test. Remove all data labels from the chart, and ask if you still have something understandable.

With these unusual shapes, it's not easy to judge how much larger is one object from the next. That's why the data labels were included - the readers are looking at the data values, rather than the graphical objects. That's sad, if you are the designer.

***

One last mystery. What decides the layering of the light vs dark red sectors?

This design always places the smaller object in front of the larger object. Recall that the light red is for consumers and dark red for businesses. The comparison between these disjoint segments is not as interesting as the comparison of different ratings of technologies with each segment. So it's unfortunate that this aspect may get more attention than it deserves. It's also a consequence of the chart form. If the light red is always placed in front, then in some panels (such as the middle one shown above), the light red completely blocks the dark red.

## Circular areas offer misleading cues of their underlying data

##### Feb 09, 2021

John M. pointed me on Twitter to this chart about the progress of U.S.'s vaccination campaign:

This looks like a White House production, retweeted by WHO. John is unhappy about this nested bubble format, which I'll come back to later.

Let's zoom in on what matters:

An even bigger problem with this chart is the Q corner in our Trifecta Checkup. What is the question they are trying to address? It would appear to be the proportion of population that has "already received [one or more doses of] vaccine". And the big words tell us the answer is 8 percent.

But is that really the question? Check out the dark blue circle. It is labeled "population that has already received vaccine" and thus we infer this bubble represents 8 percent. Now look at the outer bubble. Its annotation is "new population that received vaccine since January 27, 2021". The only interpretation that makes sense is that 8 percent  is not the most current number. If that is the case, why would the headline highlight an older statistic, and not the most up-to-date one?

Perhaps the real question is how fast is the progress in vaccination. Perhaps it took weeks to get to the dark circle and then days to get beyond. In order to improve this data visualization, we must first decide what the question really is.

***

Now let's get to those nested bubbles. The bubble chart is a format that is not "sufficient," by which I mean the visual by itself does not convey the data without the help of aids such as labels. Try to answer the following questions:

In my view, if your answer to the last question is anything more than 5 seconds, the dataviz has failed. A successful data visualization should not make readers solve puzzles.

The first two questions depict the confusing nature of concentric circle diagrams. The first data point is coded to the inner circle. Where is the second data point? Is it encoded to the outer circle, or just the outer ring?

In either case, human brains are not trained to compare circular areas. For question 1, the outer circle is 70% larger than the smaller circle. For question 2, the ring is 70% of the area of the dark blue circle. If you're thinking those numbers seem unreasonable, I can tell you that was my first reaction too! So I made the following to convince myself that the calculation was correct:

Circular areas offer misleading visual cues, and should be used sparingly.

[P.S. 2/10/2021. In the next post, I sketch out an alternative dataviz for this dataset.]

## Atypical time order and bubble labeling

##### Dec 29, 2020

This chart appeared in a Charles Schwab magazine in Summer, 2019.

This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:

Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.

***

In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.

What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .

Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)

## Putting vaccine trials in boxes

##### Sep 08, 2020

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes.

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.

## Working with multiple dimensions, an example from Germany

##### Jul 15, 2020

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.