Hanging things on your charts

The Financial Times published the following chart that shows the rollout of vaccines in the U.K.


(I can't find the online link to the article. The article is titled "AstraZeneca and Oxford face setbacks and success as battle enters next phase", May 29/30 2021.)

This chart form is known as a "streamgraph", and it is a stacked area chart in disguise. 

The same trick can be applied to a column chart. See the "hanging" column chart below:


The two charts show exactly the same data. The left one roots the columns at the bottom. The right one aligns the middle of the columns. 

I have rarely found these hanging charts useful. The realignment makes it harder to compare the sizes of the different column segments. On the normal stacked column chart, the yellow segments are the easiest to compare because they share the same base level. Even this is taken away from the reader on the right side.

Note also that the hanging version does not admit a vertical axis

The same comments apply to the streamgraph.


Nevertheless, I was surprised that the FT chart shown above actually works. The main message I learned was that initially U.K. primarily rolled out AstraZeneca and, to a lesser extent, Pfizer, shots while later, they introduced other vaccines, including Johnson & Johnson, Novavax, CureVac, Moderna, and "Other". 

I can also see that the supply of AstraZeneca has not changed much through the entire time window. Pfizer has grown to roughly the same scale as AstraZeneca. Moderna remains a small fraction of total shots. 

I can even roughly see that the total number of vaccinations has grown about six times from start to finish. 

That's quite a lot for one chart, so job well done!

There is one problem with the FT chart. It should have labelled end of May as "today". Half the chart is history, and the other half is the future.


For those following Covid-19 news, the FT chart is informative in a different way.

There is a misleading statement going around blaming the U.K.'s recent surge in cases on the Astrazeneca vaccine, claiming that the U.K. mostly uses AZ. This chart shows that from the start, about a third of the shots administered in the U.K. are Pfizer, and Pfizer's share has been growing over time. 

U.K. compared to some countries mostly using mRNA vaccines


U.K. is almost back to the winter peak. That's because the U.K. is serious about counting cases. Look at the state of testing in these countries:


What's clear about the U.S. case count is that it is kept low by cutting the number of tests by two-thirds, thus, our data now is once again severely biased towards severe cases. 

We can do a back-of-the-envelope calculation. The drop in testing may directly lead to a proportional drop in reported cases, thus removing 500 (asymptomatic, or mild) cases per million from the case count. The case count goes below 250 per million so the additional 200 or so reduction is due to other reasons such as vaccinations.

Did prices go up or down? Depends on how one looks at the data

The U.S. media have been flooded with reports of runaway inflation recently, and it's refreshing to see a nice article in the Wall Street Journal that takes a second look at the data. Because as my readers know, raw data can be incredibly deceptive.

Inflation typically describes the change in price level relative to the prior year. The month-on-month change in price levels is a simple seasonal adjustment used to remove the effect of seasonality that masks the true change in price levels. (See this explainer of seasonal adjustment.)

As the pandemic enters the second year, this methodology is comparing 2021 price levels to pandemic-impacted price levels of 2020. This produces a very confusing picture. As the WSJ article explains, prices can be lower than they were in 2019 (pre-pandemic) and yet substantially higher than they were in 2020 (during the pandemic). This happens in industry sectors that were heavily affected by the economic shutdown, e.g. hotels, travel, entertainment.

Wsj_pricechangehotels_20192021Here is how they visualized this phenomenon. Amusingly, some algorithm estimated that it should take 5 minutes to read the entire article. It may take that much time to understand properly what this chart is showing.

Let me save you some time.

The chart shows monthly inflation rates of hotel price levels.

The pink horizontal stripes represent the official inflation numbers, which compare each month's hotel prices to those of a year prior. The most recent value for May of 2021 says hotel prices rose by 9% compared to May of 2020.

The blue horizontal stripes show an alternative calculation which compares each month's hotel prices to those of two years prior. Think of 2018-9 as "normal" years, pre-pandemic. Using this measure, we find that hotel prices for May of 2021 are about 4% lower than for May of 2019.

(This situation affects all of our economic statistics. We may see an expansion in employment levels from a year ago which still leaves us behind where we were before the pandemic.)

What confused me on the WSJ chart are the blocks of color. In a previous chart, the readers learn that solid colors mean inflation rose while diagonal lines mean inflation decreased. It turns out that these are month-over-month changes in inflation rates (notice that one end of the column for the previous month touches one end of the column of the next month).

The color patterns become the most dominant feature of this chart, and yet the month-over-month change in inflation rates isn't the crux of the story. The real star of the story should be the difference in inflation rates - for any given month - between two reference years.


In the following chart, I focus attention on the within-month, between-reference-years comparisons.


Because hotel prices dropped drastically during the pandemic, and have recovered quite well in recent months as the U.S. reopens the economy, the inflation rate of hotel prices is almost 10%. Nevertheless, the current price level is still 7% below the pre-pandemic level.



Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.


The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:


The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.


This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).





Tip of the day: transform data before plotting

The Financial Times called out a twitter user for some graphical mischief. Here are the two charts illustrating the plunge in Bitcoin's price last week : (Hat tip to Mark P.)


There are some big differences between the two charts. The left chart depicts this month's price actions, drawing attention to the last week while the right chart shows a longer period of time, starting from 2012. The author of the tweet apparently wanted to say that the recent drop is nothing to worry about. 

The Financial Times reporter noted another subtle difference - the right chart uses a log scale while the left chart is linear. Specifically, it's a log 2 scale, which means that each step up is double the previous number (1, 2, 4, 8, etc.). The effect is to make large changes look smaller. Presumably most readers fail to notice the scale. Even if they do, it's not natural to assign different differences to the same physical distances.



These price charts always miss the mark. That's because the current price is insufficient to capture whether a Bitcoin investor made money or lost money. If you purchased Bitcoins this month, you lost money. If your purchase was a year ago, you still made quite a bit of money despite the recent price plunge.

The following chart should not be read as a time series, even though the horizontal axis is time. Think date of Bitcoin purchase. This chart tells you how much $1 of Bitcoin is worth last week, based on what day the purchase was made.


People who bought this year have mostly been in the red. Those who purchased before October 2020 and held on are still very pleased with their decision.

This example illustrates that simple transformations of the raw data yield graphics that are much more informative.


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)


I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.


The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.


See my collection of posts about Wall Street Journal graphics.

Two commendable student projects, showing different standards of beauty

A few weeks ago, I did a guest lecture for Ray Vella's dataviz class at NYU, and discussed a particularly hairy dataset that he assigns to students.

I'm happy to see the work of the students, and there are two pieces in particular that show promise.

The following dot plot by Christina Barretto shows the disparities between the richest and poorest nations increasing between 2000 and 2015.

BARRETTO  Christina - RIch Gets Richer Homework - 2021-04-14

The underlying dataset has the average GDP per capita for the richest and the poor regions in each of nine countries, for two years (2000 and 2015). With each year, the data are indiced to the national average income (100). In the U.K., the gap increased from around 800 to 1,100 in the 15 years. It's evidence that the richer regions are getting richer, and the poorer regions are getting poorer.

(For those into interpreting data, you should notice that I didn't say the rich getting richer. During the lecture, I explain how to interpret regional averages.)

Christina's chart reflects the tidy, minimalist style advocated by Tufte. The countries are sorted by the 2000-to-2015 difference, with Britain showing up as an extreme outlier.


The next chart by Adrienne Umali is more infographic than Tufte.

Adrienne Umali_v2

It's great story-telling. The top graphic explains the underlying data. It shows the four numbers and how the gap between the richest and poorest regions is computed. Then, it summarizes these four numbers into a single metric, "gap increase". She chooses to measure the change as a ratio while Christina's chart uses the difference, encoded as a vertical line.

Adrienne's chart is successful because she filters our attention to a single country - the U.S. It's much too hard to drink data from nine countries in one gulp.

This then sets her up for the second graphic. Now, she presents the other eight countries. Because of the work she did in the first graphic, the reader understands what those red and green arrows mean, without having to know the underlying index values.

Two small suggestions: a) order the countries from greatest to smallest change; b) leave off the decimals. These are minor flaws in a brilliant piece of work.



Come si dice donut in italiano

One of my Italian readers sent me the following "horror chart". (Last I checked, it's not Halloween.)


I mean, people are selling these rainbow sunglasses.


The dataset behind the chart is the market share of steel production by country in 1992 and in 2014. The presumed story is how steel production has shifted from country to country over those 22 years.

Before anything else, readers must decipher the colors. This takes their eyes off the data and on to the color legend placed on the right column. The order of the color legend is different from that found in the nearest object, the 2014 donut. The following shows how our eyes roll while making sense of the donut chart.


It's easier to read the 1992 donut because of the order but now, our eyes must leapfrog the 2014 donut.


This is another example of a visualization that fails the self-sufficiency test. The entire dataset is actually printed around the two circles. If we delete the data labels, it becomes clear that readers are consuming the data labels, not the visual elements of the chart.


The chart is aimed at an Italian audience so they may have a patriotic interest in the data for Italia. What they find is disappointing. Italy apparently completely dropped out of steel production. It produced 3% of the world's steel in 1992 but zero in 2014.

Now I don't know if that is true because while reproducing the chart, I noticed that in the 2014 donut, there is a dark orange color that is not found in the legend. Is that Italy or a mysterious new entrant to steel production?

One alternative is a dot plot. This design accommodates arrows between the dots indicating growth versus decline.



Losses trickle down while gains trickle up

In a rich dataset, it's hard to convey all the interesting insights on a single chart. Following up on the previous post, I looked further at the wealth distribution dataset. In the previous post, I showed this chart, which indicated that the relative wealth of the super-rich (top 1%) rose dramatically around 2011.


As a couple of commenters noticed, that's relative wealth. I indiced everything to the Bottom 50%.

In this next chart, I apply a different index. Each income segment is set to 100 at the start of the time period under study (2000), and I track how each segment evolved in the last two decades.


This chart offers many insights.

The Bottom 50% have been left far, far behind in the last 20 years. In fact, from 2000-2018, this segment's wealth never once reached the 2000 level. At its worst, around 2010, the Bottom 50% found themselves 80% poorer than they were 10 years ago!

In the meantime, the other half of the population has seen their wealth climb continuously through the 20 years. This is particularly odd because the major crisis of these two decades was the Too Big to Fail implosion of financial instruments, which the Bottom 50% almost surely did not play a part in. During that crisis, the top 50% were 30-60% better off than they were in 2000. Is this the "trickle-down" economy in which losses are passed down (but gains are passed up)?

The chart also shows how the recession hit the bottom 50% much deeper, and how the recovery took more than a decade. For the top half, the recovery came between 2-4 years.

It also appears that top 10% are further peeling off from the rest of the population. Since 2009, the top 11-49% have been steadily losing ground relative to the top 10%, while the gap between them and the Bottom 50% has narrowed.


This second chart is not nearly as dramatic as the first one but it reveals much more about the data.


Finding the hidden information behind nice-looking charts

This chart from Business Insider caught my attention recently. (link)


There are various things they did which I like. The use of color to draw a distinction between the top 3 lines and the line at the bottom - which tells the story that the bottom 50% has been left far behind. Lines being labelled directly is another nice touch. I usually like legends that sit atop the chart; in this case, I'd have just written the income groups into the line labels.

Take a closer look at the legend text, and you'd notice they struggled with describing the income percentiles.


This is a common problem with this type of data. The top and bottom categories are easy, as it's most natural to say "top x%" and "bottom y%". By doing so, we establish two scales, one running from the top, and the other counting from the bottom - and it's a head scratcher which scale to use for the middle categories.

The designer decided to lose the "top" and "bottom" descriptors, and went with "50-90%" and "90-99%". Effectively, these follow the "bottom" scale. "50-90%" is the bottom 50 to 90 percent, which corresponds to the top 10 to 50 percent. "90-99%" is the bottom 90-99%, which corresponds to the top 1 to 10%. On this chart, since we're lumping the top three income groups, I'd go with "top 1-10%" and "top 10-50%".


The Business Insider chart is easy to mis-read. It appears that the second group from the top is the most well-off, and the wealth of the top group is almost 20 times that of the bottom group. Both of those statements are false. What's confusing us is that each line represents very different numbers of people. The yellow line is 50% of the population while the "top 1%" line is 1% of the population. To see what's really going on, I look at a chart showing per-capita wealth. (Just divide the data of the yellow line by 50, etc.)


For this chart, I switched to a relative scale, using the per-capita wealth of the Bottom 50% as the reference level (100). Also, I applied a 4-period moving average to smooth the line. The data actually show that the top 1% holds much more wealth per capita than all other income segments. Around 2011, the gap between the top 1% and the rest was at its widest - the average person in the top 1% is about 3,000 times wealthier than someone in the bottom 50%.

This chart raises another question. What caused the sharp rise in the late 2000s and the subsequent decline? By 2020, the gap between the top and bottom groups is still double the size of the gap from 20 years ago. We'd need additional analyses and charts to answer this question.


If you are familiar with our Trifecta Checkup, the Business Insider chart is a Type D chart. The problem with it is in how the data was analyzed.

The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:


The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:


(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this: