Hanging things on your charts

The Financial Times published the following chart that shows the rollout of vaccines in the U.K.

Ft_astrazeneca_uk_rollout

(I can't find the online link to the article. The article is titled "AstraZeneca and Oxford face setbacks and success as battle enters next phase", May 29/30 2021.)

This chart form is known as a "streamgraph", and it is a stacked area chart in disguise. 

The same trick can be applied to a column chart. See the "hanging" column chart below:

Junkcharts_hangingcolumns

The two charts show exactly the same data. The left one roots the columns at the bottom. The right one aligns the middle of the columns. 

I have rarely found these hanging charts useful. The realignment makes it harder to compare the sizes of the different column segments. On the normal stacked column chart, the yellow segments are the easiest to compare because they share the same base level. Even this is taken away from the reader on the right side.

Note also that the hanging version does not admit a vertical axis

The same comments apply to the streamgraph.

***

Nevertheless, I was surprised that the FT chart shown above actually works. The main message I learned was that initially U.K. primarily rolled out AstraZeneca and, to a lesser extent, Pfizer, shots while later, they introduced other vaccines, including Johnson & Johnson, Novavax, CureVac, Moderna, and "Other". 

I can also see that the supply of AstraZeneca has not changed much through the entire time window. Pfizer has grown to roughly the same scale as AstraZeneca. Moderna remains a small fraction of total shots. 

I can even roughly see that the total number of vaccinations has grown about six times from start to finish. 

That's quite a lot for one chart, so job well done!

There is one problem with the FT chart. It should have labelled end of May as "today". Half the chart is history, and the other half is the future.

***

For those following Covid-19 news, the FT chart is informative in a different way.

There is a misleading statement going around blaming the U.K.'s recent surge in cases on the Astrazeneca vaccine, claiming that the U.K. mostly uses AZ. This chart shows that from the start, about a third of the shots administered in the U.K. are Pfizer, and Pfizer's share has been growing over time. 

U.K. compared to some countries mostly using mRNA vaccines

Ourworldindata_cases

U.K. is almost back to the winter peak. That's because the U.K. is serious about counting cases. Look at the state of testing in these countries:

Ourworldindata_tests

What's clear about the U.S. case count is that it is kept low by cutting the number of tests by two-thirds, thus, our data now is once again severely biased towards severe cases. 

We can do a back-of-the-envelope calculation. The drop in testing may directly lead to a proportional drop in reported cases, thus removing 500 (asymptomatic, or mild) cases per million from the case count. The case count goes below 250 per million so the additional 200 or so reduction is due to other reasons such as vaccinations.


One of the most frequently produced maps is also one of the worst

Summer is here, many Americans are putting the pandemic in their rear-view mirrors, and gas prices are soaring. Business Insider told the story using this map:

Businessinsider_gasprices_1

What do we want to learn about gas prices this summer?

Which region has the highest / lowest prices?

How much higher / lower than the national average are the regional prices?

How much has prices risen, compared to last year, or compared to the last few weeks?

***

How much work did you have to do to get answers to those questions from the above map?

Unfortunately, this type of map continues to dominate the popular press. It merely delivers a geography lesson and not much else. Its dominant feature tells readers how to classify the 50 states into regions. Its color encodes no data.

Not surprisingly, this map fails the self-sufficiency test (link). The entire dataset is printed on the map, and if those numbers were removed, we would be left with a map of the regions of the U.S. The graphical elements of the chart are not doing much work.

***

In the following chart, I used the map as a color legend. Also, an additional plot shows each region's price level against the national average.

Junkcharts_redo_businessinsider_gasprices2021

One can certainly ditch the map altogether, which makes having seven colors unnecessary. To address other questions, just stack on other charts, for example, showing the price increase versus last year.

***

_trifectacheckup_imageFrom a Trifecta Checkup perspective, we find that the trouble starts with the Q corner. There are several important questions not addressed by the graphic. In the D corner, no context is provided to interpret the data. Are these prices abnormal? How do they compare to the national average or to a year ago? In the V corner, the chart takes too much effort to comprehend a basic fact, such as which region has the highest average price.

For more on the Trifecta Checkup, see this guide.

 


Did prices go up or down? Depends on how one looks at the data

The U.S. media have been flooded with reports of runaway inflation recently, and it's refreshing to see a nice article in the Wall Street Journal that takes a second look at the data. Because as my readers know, raw data can be incredibly deceptive.

Inflation typically describes the change in price level relative to the prior year. The month-on-month change in price levels is a simple seasonal adjustment used to remove the effect of seasonality that masks the true change in price levels. (See this explainer of seasonal adjustment.)

As the pandemic enters the second year, this methodology is comparing 2021 price levels to pandemic-impacted price levels of 2020. This produces a very confusing picture. As the WSJ article explains, prices can be lower than they were in 2019 (pre-pandemic) and yet substantially higher than they were in 2020 (during the pandemic). This happens in industry sectors that were heavily affected by the economic shutdown, e.g. hotels, travel, entertainment.

Wsj_pricechangehotels_20192021Here is how they visualized this phenomenon. Amusingly, some algorithm estimated that it should take 5 minutes to read the entire article. It may take that much time to understand properly what this chart is showing.

Let me save you some time.

The chart shows monthly inflation rates of hotel price levels.

The pink horizontal stripes represent the official inflation numbers, which compare each month's hotel prices to those of a year prior. The most recent value for May of 2021 says hotel prices rose by 9% compared to May of 2020.

The blue horizontal stripes show an alternative calculation which compares each month's hotel prices to those of two years prior. Think of 2018-9 as "normal" years, pre-pandemic. Using this measure, we find that hotel prices for May of 2021 are about 4% lower than for May of 2019.

(This situation affects all of our economic statistics. We may see an expansion in employment levels from a year ago which still leaves us behind where we were before the pandemic.)

What confused me on the WSJ chart are the blocks of color. In a previous chart, the readers learn that solid colors mean inflation rose while diagonal lines mean inflation decreased. It turns out that these are month-over-month changes in inflation rates (notice that one end of the column for the previous month touches one end of the column of the next month).

The color patterns become the most dominant feature of this chart, and yet the month-over-month change in inflation rates isn't the crux of the story. The real star of the story should be the difference in inflation rates - for any given month - between two reference years.

***

In the following chart, I focus attention on the within-month, between-reference-years comparisons.

Junkcharts_redo_wsj_inflationbaserate

Because hotel prices dropped drastically during the pandemic, and have recovered quite well in recent months as the U.S. reopens the economy, the inflation rate of hotel prices is almost 10%. Nevertheless, the current price level is still 7% below the pre-pandemic level.

 



 


Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)

 


Distorting perception versus distorting the data

This chart appears in the latest ("last print issue") of Schwab's On Investing magazine:

Schwab_oninvesting_returnlandscape

I know I don't like triangular charts, and in this post, I attempt to verbalize why.

It's not the usual complaint of distorting the data. When the base of the triangle is fixed, and only the height is varied, then the area is proportional to the height and thus nothing is distorted.

Nevertheless, my ability to compare those triangles pales in comparison to the following columns.

Junkcharts_triangles_rectangles

This phenomenon is not limited to triangles. One can take columns and start varying the width, and achieve a similar effect:

Junkcharts_changing_base

It's really the aspect ratio - the relationship between the height and the width that's the issue.

***

Interestingly, with an appropriately narrow base, even the triangular shape can be saved.

Junkcharts_narrower_base

In a sense, we can think of the width of these shapes as noise, a distraction - because the width is constant, and not encoding any data.

It's like varying colors for no reason at all. It introduces a pointless dimension.

Junkcharts_color_notdata

It may be prettier but the colors also interfere with our perception of the changing heights.


Further exploration of tessellation density

Last year, I explored using bar-density (and pie-density) charts to illustrate 80/20-type distributions, which are very common in real life (link).

Kaiserfung_youtube_bardensity

The key advantage of this design is that the most important units (i.e. the biggest stars/creators) are represented by larger pieces while the long tail is shown by little pieces. The skewness is encoded in the density of the tessellation.

So when the following chart showed up on my Twitter feed, I returned to the idea of using tessellation density as a visual cue.

Harvard_income_students

This wbur chart is a good statistical chart - effiicient at communicating the data, but "boring". The only things I'd change is to remove the vertical axis, gridlines, and the decimals.

In concept, the underlying data is similar to the Youtube data. Less than 0.5 percent of Youtubers produced 38% of the views on the platform. The richest 1% of the population took 15% of Harvard's spots; the richest 20% took 70%.

As I explore this further, the analogy falls apart. In the Youtube scenario, the stars should naturally occupy bigger spaces. In the Harvard scenario, letting the children of the top 1% taking up more space on the chart doesn't really make sense since each incoming Harvard student has equal status.

Instead of going down that potential deadend, I investigated how tessellation density can be used for visualization. For one thing, tessellations are pretty things and appealing.

Here is something I created:

Junkcharts_redo_wbur_harvard_rich

The chart is read vertically by comparing Harvard's selection of students with the hypothetical "ideal" of equal selection. (I don't agree that this type of equality is the right thing but let me focus on the visualization here.) This, selectivity is coded in the density. Selectivity is defined here as the over/under representation. Harvard is more "selective" in lower-income groups.

In the first and second columns, we see that Harvard's densities are lower than the densities as expected in the general population, indicating that the poorest 20%, and the middle 20% of the population are under-represented in Harvard's student body. Then in the third column, the comparison flips. The density in the top box is about 3-4 times as high as the bottom box. You may have to expand the graphic to see the 1% slither, which also shows a much higher density in the top box.

I was surprised by how well I was able to eyeball the relative densities. You can try it and let me know how you fare.

(There is even a trick to do this. From the diagram with larger pieces, pick a representative piece. Then, roughly estimate how many smaller pieces from the other tessellation can fit into that representative piece. Using this guideline, I estimate that the ratios of the densities to be 1:6, 1:2, 3:1, 10:1. The actual ratios are 1:6.7, 1:2.5, 3:1, 15:1. I find that my intuition gets me most of the way there even if I don't use this trick.)

Density encoding is under-used as a visual cue. I think our ability to compare densities is surprisingly good (when the units are not overlapping). Of course, you wouldn't use density if you need to be precise, just as you wouldn't use color, or circular areas. Nevertheless, there are many occasions where you can afford to be less precise, and you'd like to spice up your charts.


Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.

Econ_seriea_racism

The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:

Junkcharts_redo_seriea_racism

The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.

***

This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).

 

 

 

 


Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:

Banknote_picker_us

I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

***
I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:

Banknote_picker_euro

Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.

***

Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.

Banknote_us_2

Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.


Tip of the day: transform data before plotting

The Financial Times called out a twitter user for some graphical mischief. Here are the two charts illustrating the plunge in Bitcoin's price last week : (Hat tip to Mark P.)

Ft_tradingview_btcprices

There are some big differences between the two charts. The left chart depicts this month's price actions, drawing attention to the last week while the right chart shows a longer period of time, starting from 2012. The author of the tweet apparently wanted to say that the recent drop is nothing to worry about. 

The Financial Times reporter noted another subtle difference - the right chart uses a log scale while the left chart is linear. Specifically, it's a log 2 scale, which means that each step up is double the previous number (1, 2, 4, 8, etc.). The effect is to make large changes look smaller. Presumably most readers fail to notice the scale. Even if they do, it's not natural to assign different differences to the same physical distances.

***

Junkcharts_redo_fttradingviewbitcoinpricechart

These price charts always miss the mark. That's because the current price is insufficient to capture whether a Bitcoin investor made money or lost money. If you purchased Bitcoins this month, you lost money. If your purchase was a year ago, you still made quite a bit of money despite the recent price plunge.

The following chart should not be read as a time series, even though the horizontal axis is time. Think date of Bitcoin purchase. This chart tells you how much $1 of Bitcoin is worth last week, based on what day the purchase was made.

Junkcharts_redo_fttradingviewbitcoinpricechart_2

People who bought this year have mostly been in the red. Those who purchased before October 2020 and held on are still very pleased with their decision.

This example illustrates that simple transformations of the raw data yield graphics that are much more informative.

 


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.