Labels, scales, controls, aggregation all in play

JB @barclaysdevries sent me the following BBC production over Twitter.


He was not amused.

This chart pushes a number of my hot buttons.

First, I like to assume that readers don't need to be taught that 2007 and 2018 are examples of "Year".

Second, starting an area chart away from zero is equally as bad as starting a bar chart not at zero! The area is distorted and does not reflect the relative values of the data.

Third, I suspect the 2007 high point is a local peak, which they chose in order to forward a sky-is-falling narrative related to China's growth.

So I went to a search engine and looked up China's growth rate, and it helpfully automatically generated the following chart:


Just wow! This chart does a number of things right.

First, it confirms my hunch above. 2007 is a clear local peak and it is concerning that the designer chose that as a starting point.

Second, this chart understands that the zero-growth line has special meaning.

Third, there are more year labels.

Fourth, and very importantly, the chart offers two "controls". We can look at China's growth relative to India's and relative to the U.S.'s. Those two other lines bring context.

JB's biggest complaint is that the downward-sloping line confuses the issue, which is that slowing growth is still growth. The following chart conveys a completely different message but the underlying raw data are the same:



The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic


How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:



You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.


The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.


A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.



1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.

No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.


Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.


Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.


The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.


On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).


Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!






Made in France stereotypes

France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks.  (You can still register for the free seminar here.)

The following Made in France poster brings out all the stereotypes of the French.


(You can download the original PDF here.)

It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.

It's also failing in the D(ata) corner. Take a look at the top of the chart.


France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:


Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.

Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.

Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.

It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.

Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)

Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:


The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.

In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.


One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.

How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:


I like this type of block diagram which brings out the structure of the dataset. The key features are:

  • The total wine exports to the rest of the world was $1.4 billion in 2016
  • Half of it went to five European neighbors, the other half to the rest of the world
  • On the left half, Germany took a third of those exports; the UK and Switzerland together is another third; and the final third went to Belgium and the Netherlands
  • On the right half, the countries in the blue zone accounted for three-fifths with the unspecified countries taking two-fifths.
  • As indicated, the two-fifths (in gray) represent 20% of total wine exports, and were spread out among over 180 countries.
  • The three-fifths of the blue zone were split in half, with the first half going to North America (about 2/3 to USA and 1/3 to Canada) and the second half going to Asia (2/3 to China and 1/3 to Japan)
  • As the title indicates, the top 9 importers of French wine covered 80% of the total volume (in litres) while the other 180+ countries took 20% of the volume

 The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.



Why line charts are better than area charts

I saw this chart on Business Insider recently:


This links to Market Insider, where there is a tool to make different types of charts. Despite the huge drop depicted above, by last week, the Dow Jones index has recovered to the level at the start of 2018:


The same chart can be made as an area chart (called a "mountain chart" by Market Insider).


The painting of the area serves no purpose here because the area doesn't mean anything.

Imagine adding an inch of space to the bottom of each chart. The area chart is sensitive to the choice of the minimum value of the vertical axis while the line chart isn't. Since the data did not change, it's not a good idea for the display to shift perception. That's why I prefer the line chart.

A chart Hans Rosling would have loved

I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.


If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.

This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.

The text labels on the left and right side of the charts are just perfect.


Little things that irk me:

I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.

Even though the gridlines are muted, I still find them excessive.

There is a coding bug in the Vaccination chart right around 1960.


Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.


This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:


The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.

Let's not mix these polarized voters as the medians run away from one another

Long-time follower Daniel L. sent in a gem, by the Washington Post. This is a multi-part story about the polarization of American voters, nicely laid out, with superior analyses and some interesting graphics. Click here to see the entire article.

Today's post focuses on the first graphic. This one:


The key messages are written out on the 2017 charts: namely, 95% of Republicans are more conservative than the median Democrat, and 97% of Democrats are more libearl than the median Republicans.

This is a nice statistical way of laying out the polarization. There are a number of additional insights one can draw from the population distributions: for example, in the bottom row, the Democrats have been moving left consistently, and decisively in 2017. By contrast, Republicans moved decisively to the right from 2004 to 2017. I recall reading about polarization in past elections but it is really shocking to see the extreme in 2017.

A really astounding but hidden feature is that the median Democrat and the median Republican were not too far apart in 1994 and 2004 but the gap exploded in 2017.


I like to solve a few minor problems on this graphic. It's a bit confusing to have each chart display information on both Republican and Democratic distributions. The reader has to understand that in the top row, the red area represents Republican voters but the blue line shows the median Democrat.

Also, I want to surface two key insights: the huge divide that developed in 2017, and the exploding gap between the two medians.

Here is the revised graphic:


On the left side, each chart focuses on one party, and the trend over the three elections. The reader can cross charts to discover that the median voter in one party is more extreme than essentially all of the voters of the other party. This same conclusion can be drawn from the exploding gap between the median voters in either party, which is explicitly plotted in the lower right chart. The top right chart is a pretty visualization of how polarized the country was in the 2017 election.


Some like it packed, some like it piled, and some like it wrapped

In addition to Xan's "packed bars" (which I discussed here), there are some related efforts to improve upon the treemap. To recap, treemap is a design to show parts against the whole, and it works by packing rectangles into the bounding box. Frequently, this leads to odd-shaped rectangles, e.g. really thin and really tall ones, and it asks readers to estimate relative areas of differently-scaled boxes. We often make mistakes in this task.

The packed bar chart approaches this challenge by allowing only the width of the box to vary with the data. The height of every box is identical, so readers only have to compare lengths.

Via Twitter, Adil pointed me to this article by him and his collaborators that describes a few alternatives.

One of the options is the "wrapped bar chart" introduced by Stephen Few. Like Xan, he also restricts the variation to legnths of bars while keeping the heights fixed. But he goes further, and abandons packing completely. Instead of packing, Few wraps the bars. Start with a large bar chart with many categories filling up a tall plotting area. He then divides the bars into different blocks and place them side by side. Here is an example showing 50 states, ranked by total electoral votes:


You can see the white space because there is no packing. This version makes it easier to see the relative importance of the different blocks of states but it is tough to tell how much the first block of 13 states accounts for. The wrapped barchart is organized similar to a small multiples, except that the scale in each panel is allowed to vary.

Another option is the "piled bars." This option, presented by Yalçın, Elmqvist, and Bederson, brings packing back. But unlike the packed bars or the treemap, the outside envelope no longer represents the total amount. In the "piled bars" design, the top X categories act as the canvas, and the smaller categories are packed inside these bars rather than around them. Take a look at this example, which plots GDP growth of different countries:


 The inset on the left column is instructive. The green (smallest) and red (medium) bars are packed inside the blue (largest) bars. In this example, it doesn't make sense to add up GDP growth rates, so it doesn't matter that the outer envelope does not equal the total. It would not work as well with the electoral vote data in the previous example.

I wonder whether a piled dot plot works better than a piled bar chart. This piled bar chart shares a problem with the stacked area chart, which is that other than the first piece, all the other pieces represent the differences between the respective data and the next lower category, rather than the value of the data point. Readers are led to compare the green, red and blue pieces but the corresponding values are not truly comparable, or of primary interest.

This problem goes away if the bars are represented by dots.


What strikes me as the most key paragraph in the Yalcin, et. al.'s article is the following:

To understand graphical perception performance, we studied three basic tasks:

1) How accurately can we estimate the difference between two data points?
2) How accurately can we estimate the rank of a data point among all the rest?
3) How accurately can we guess the distribution characteristic of the whole dataset?

As a chart designer, we have to prioritize these tasks. There is unlikely to be a single chart form that will prevail on all three tasks. So if the designer starts with the question that he or she wants to address, that leads to the key task that the visualization should enable, which leads to the chart form that facilitates that task the best.




Unintentional deception of area expansion #bigdata #piechart

Someone sent me this chart via Twitter, as an example of yet another terrible pie chart. (I couldn't find that tweet anymore but thank you to the reader for submitting this.)


At first glance, this looks like a pie chart with the radius as a second dimension. But that is the wrong interpretation.

In a pie chart, we typically encode the data in the angles of the pie sectors, or equivalently, the areas of the sectors. In this special case, the angle is invariant across the slices, and the data are encoded in the radius.

Since the data are found in the radii, let's deconstruct this chart by reducing each sector to its left-side edge.

This leads to a different interpretation of the chart: it’s actually a simple bar chart, manipulated.


The process of the manipulation runs against what data visualization should be. It takes the bar chart (bottom right) that is easy to read, introduces slants so it becomes harder to digest (top right), and finally absorbs a distortion to go from inefficient to incompetent (left).

What is this distortion I just mentioned? When readers look at the original chart, they are not focusing on the left-side edge of each sector but they are seeing the area of each sector. The ratio of areas is not the same as the ratio of lengths. Adding purple areas to the chart seems harmless but in fact, despite applying the same angles, the designer added disproportionately more area to the larger data points compared to the smaller ones.


In order to remedy this situation, the designer has to take the square root of the lengths of the edges. But of course, the simple bar chart is more effective.