Ranking data provide context but can also confuse

This dataviz from the Economist had me spending a lot of time clicking around - which means it is a success.

Econ_usaexcept_hispanic

The graphic presents four measures of wellbeing in society - life expectancy, infant mortality rate, murder rate and prison population. The primary goal is to compare nations across those metrics. The focus is on comparing how certain nations (or subgroups) rank against each other, as indicated by the relative vertical position.

The Economist staff has a particular story to tell about racial division in the US. The dotted bars represent the U.S. average. The colored bars are the averages for Hispanic, white and black Americans. The wider the gap between the colored bars, the more variant is the experiences between American races.

The chart shows that the racial gap of life expectancy is the widest. For prison population, the U.S. and its racial subgroups occupy many of the lowest (i.e. least desirable) ranks, with the smallest gap in ranking.

***

The primary element of interactivity is hovering on a bar, which then highlights the four bars corresponding to the particular nation selected. Here is the picture for Thailand:

Econ_usaexcept_thailand

According to this view of the world, Thailand is a close cousin of the U.S. On each metric, the Thai value clings pretty near the U.S. average and sits within the range by racial groups. I'm surprised to learn that the prison population in Thailand is among the highest in the world.

Unfortunately, this chart form doesn't facilitate comparing Thailand to a country other than the U.S as one can highlight only one country at a time.

***

While the main focus of the chart is on relative comparison through ranking, the reader can extract absolute difference by reading the lengths of the bars.

This is a close-up of the bottom of the prison population metric:

Econ_useexcept_prisonpop_bottomThe length of each bar displays the numeric data. The red line is an outlier in this dataset. Black Americans suffer an incarceration rate that is almost three times the national average. Even white Americans (blue line) is imprisoned at a rate higher than most countries around the world.

As noted above, the prison population metric exhibits the smallest gap between racial subgroups. This chart is a great example of why ranking data frequently hide important information. The small gap in ranking masks the extraordinary absolute difference in incareration rates between white and black America.

The difference between rank #1 and rank #2 is enormous.

Econ_useexcept_lifeexpect_topThe opposite situation appears for life expectancy. The life expectancy values are bunched up especially at the top of the scale. The absolute difference between Hispanic and black America is 82 - 75 = 7 years, which looks small because the axis starts at zero. On a ranking scale, Hispanic is roughly in the top 15% while black America is just above the median. The relative difference is huge.

For life expectancy, ranking conveys the view that even a 7-year difference is a big deal because the countries are tightly bunched together. For prison population, ranking shows the view that a multiple fold difference is "unimportant" because a 20-0 blowout and a 10-0 blowout are both heavy defeats.

***

Whenever you transform numeric data to ranks, remember that you are artificially treating the gap between each value and the next value as a constant, even when the underlying numeric gaps show wide variance.

 

 

 

 

 


Hanging things on your charts

The Financial Times published the following chart that shows the rollout of vaccines in the U.K.

Ft_astrazeneca_uk_rollout

(I can't find the online link to the article. The article is titled "AstraZeneca and Oxford face setbacks and success as battle enters next phase", May 29/30 2021.)

This chart form is known as a "streamgraph", and it is a stacked area chart in disguise. 

The same trick can be applied to a column chart. See the "hanging" column chart below:

Junkcharts_hangingcolumns

The two charts show exactly the same data. The left one roots the columns at the bottom. The right one aligns the middle of the columns. 

I have rarely found these hanging charts useful. The realignment makes it harder to compare the sizes of the different column segments. On the normal stacked column chart, the yellow segments are the easiest to compare because they share the same base level. Even this is taken away from the reader on the right side.

Note also that the hanging version does not admit a vertical axis

The same comments apply to the streamgraph.

***

Nevertheless, I was surprised that the FT chart shown above actually works. The main message I learned was that initially U.K. primarily rolled out AstraZeneca and, to a lesser extent, Pfizer, shots while later, they introduced other vaccines, including Johnson & Johnson, Novavax, CureVac, Moderna, and "Other". 

I can also see that the supply of AstraZeneca has not changed much through the entire time window. Pfizer has grown to roughly the same scale as AstraZeneca. Moderna remains a small fraction of total shots. 

I can even roughly see that the total number of vaccinations has grown about six times from start to finish. 

That's quite a lot for one chart, so job well done!

There is one problem with the FT chart. It should have labelled end of May as "today". Half the chart is history, and the other half is the future.

***

For those following Covid-19 news, the FT chart is informative in a different way.

There is a misleading statement going around blaming the U.K.'s recent surge in cases on the Astrazeneca vaccine, claiming that the U.K. mostly uses AZ. This chart shows that from the start, about a third of the shots administered in the U.K. are Pfizer, and Pfizer's share has been growing over time. 

U.K. compared to some countries mostly using mRNA vaccines

Ourworldindata_cases

U.K. is almost back to the winter peak. That's because the U.K. is serious about counting cases. Look at the state of testing in these countries:

Ourworldindata_tests

What's clear about the U.S. case count is that it is kept low by cutting the number of tests by two-thirds, thus, our data now is once again severely biased towards severe cases. 

We can do a back-of-the-envelope calculation. The drop in testing may directly lead to a proportional drop in reported cases, thus removing 500 (asymptomatic, or mild) cases per million from the case count. The case count goes below 250 per million so the additional 200 or so reduction is due to other reasons such as vaccinations.


Did prices go up or down? Depends on how one looks at the data

The U.S. media have been flooded with reports of runaway inflation recently, and it's refreshing to see a nice article in the Wall Street Journal that takes a second look at the data. Because as my readers know, raw data can be incredibly deceptive.

Inflation typically describes the change in price level relative to the prior year. The month-on-month change in price levels is a simple seasonal adjustment used to remove the effect of seasonality that masks the true change in price levels. (See this explainer of seasonal adjustment.)

As the pandemic enters the second year, this methodology is comparing 2021 price levels to pandemic-impacted price levels of 2020. This produces a very confusing picture. As the WSJ article explains, prices can be lower than they were in 2019 (pre-pandemic) and yet substantially higher than they were in 2020 (during the pandemic). This happens in industry sectors that were heavily affected by the economic shutdown, e.g. hotels, travel, entertainment.

Wsj_pricechangehotels_20192021Here is how they visualized this phenomenon. Amusingly, some algorithm estimated that it should take 5 minutes to read the entire article. It may take that much time to understand properly what this chart is showing.

Let me save you some time.

The chart shows monthly inflation rates of hotel price levels.

The pink horizontal stripes represent the official inflation numbers, which compare each month's hotel prices to those of a year prior. The most recent value for May of 2021 says hotel prices rose by 9% compared to May of 2020.

The blue horizontal stripes show an alternative calculation which compares each month's hotel prices to those of two years prior. Think of 2018-9 as "normal" years, pre-pandemic. Using this measure, we find that hotel prices for May of 2021 are about 4% lower than for May of 2019.

(This situation affects all of our economic statistics. We may see an expansion in employment levels from a year ago which still leaves us behind where we were before the pandemic.)

What confused me on the WSJ chart are the blocks of color. In a previous chart, the readers learn that solid colors mean inflation rose while diagonal lines mean inflation decreased. It turns out that these are month-over-month changes in inflation rates (notice that one end of the column for the previous month touches one end of the column of the next month).

The color patterns become the most dominant feature of this chart, and yet the month-over-month change in inflation rates isn't the crux of the story. The real star of the story should be the difference in inflation rates - for any given month - between two reference years.

***

In the following chart, I focus attention on the within-month, between-reference-years comparisons.

Junkcharts_redo_wsj_inflationbaserate

Because hotel prices dropped drastically during the pandemic, and have recovered quite well in recent months as the U.S. reopens the economy, the inflation rate of hotel prices is almost 10%. Nevertheless, the current price level is still 7% below the pre-pandemic level.

 



 


Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)

 


Distorting perception versus distorting the data

This chart appears in the latest ("last print issue") of Schwab's On Investing magazine:

Schwab_oninvesting_returnlandscape

I know I don't like triangular charts, and in this post, I attempt to verbalize why.

It's not the usual complaint of distorting the data. When the base of the triangle is fixed, and only the height is varied, then the area is proportional to the height and thus nothing is distorted.

Nevertheless, my ability to compare those triangles pales in comparison to the following columns.

Junkcharts_triangles_rectangles

This phenomenon is not limited to triangles. One can take columns and start varying the width, and achieve a similar effect:

Junkcharts_changing_base

It's really the aspect ratio - the relationship between the height and the width that's the issue.

***

Interestingly, with an appropriately narrow base, even the triangular shape can be saved.

Junkcharts_narrower_base

In a sense, we can think of the width of these shapes as noise, a distraction - because the width is constant, and not encoding any data.

It's like varying colors for no reason at all. It introduces a pointless dimension.

Junkcharts_color_notdata

It may be prettier but the colors also interfere with our perception of the changing heights.


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.


Probabilities and proportions: which one is the chart showing

The New York Times showed this chart (link):

Nyt_unvaccinated_undeterred

My first read: oh my gosh, 40-50% of the unvaccinated Americans are living their normal lives - dining at restaurants, assembling with more than 10 people, going to religious gatherings.

After reading the text around this chart, I realize I have misinterpreted it.

The chart should be read by columns. Each column is a "pie chart". For example, the first column shows that half the restaurant diners are not vaccinated, a third are fully vaccinated, and the remainder are partially vaccinated. The other columns have roughly the same proportions.

The author says "The rates of vaccination among people doing these activities largely reflect the rates in the population." This line is perhaps more confusing than intended. What she's saying is that in the general population, half of us are unvaccinated, a third are fully unvaccinated, and the remainder are partially vaccinated.

Here's a picture:

Junkcharts_redo_nyt_unvaccinatedundeterred

What this chart is saying is that the people dining out is like a random sample from all Americans. So too the other groups depicted. What Americans are choosing to do is independent of their vaccination status.

Unvaccinated people are no less likely to be doing all these activities than the fully vaccinated. This raises the question: are half of the people not wearing masks outdoors unvaccinated?

***

Why did I read the chart wrongly in the first place? It has to do with expectations.

Most survey charts plot probabilities not proportions. I haphazardly grabbed the following Pew Research chart as an example:

Pew_kids_socialmedia

From this chart, we learn that 30% of kids 9-11 years old uses TikTok compared to 11% of kids 5-8.  The percentages down a column do not sum to 100%.

 


Reading this chart won't take as long as withdrawing troops from Afghanistan

Art sent me the following Economist chart, noting how hard it is to understand. I took a look, and agreed. It's an example of a visual representation that takes more time to comprehend than the underlying data.

Econ_theendisnear

The chart presents responses to 3 questions on a survey. For each question, the choices are Approve, Disapprove, and "Neither" (just picking a word since I haven't seen the actual survey question). The overall approval/disapproval rates are presented, and then broken into two subgroups (Democrats and Republicans).

The first hurdle is reading the scale. Because the section from 75% to 100% has been removed, we are left with labels 0, 25, 50, 75, which do not say percentages unless we've consumed the title and subtitle. The Economist style guide places the units of data in the subtitle instead of on
the axis itself.

Our attention is drawn to the thick lines, which represent the differences between approval and disapproval rates. These differences are signed: it matters whether the proportion approving is higher or lower than the proportion disapproving. This means the data are encoded in the order of the dots plus the length of the line segment between them.

The two bottom rows of the Afghanistan question demonstrates this mental challenge. Our brains have to process the following visual cues:

1) the two lines are about the same lengths

2) the Republican dots are shifted to the right by a little

3) the colors of the dots are flipped

What do they all mean?

Econ_theendofforever_subset

A chart runs in trouble when you need a paragraph to explain how to read it.

It's sometimes alright to make complicated data visualization that illustrates complicated concepts. What justifies it is the payoff. I wrote about the concept of return on effort in data visualization here.

The payoff for this chart escaped me. Take the Democratic response to troop withdrawal. About 3/4 of Democrats approve while 15% disapprove. The thick line says 60% more Democrats approve than disapprove.

***

Here, I show the full axis, and add a 50% reference line

Junkcharts_redo_econ_theendofforever_1

Small edits but they help visualize "half of", "three quarters of".

***

Next, I switch to the more conventional stacked bars.

Junkcharts_redo_econ_theendofforever_stackedbars

This format reveals some of the hidden data on the chart - the proportion answering neither approve/disapprove, and neither yes/no.

On the stacked bars visual, the proportions are counted from both ends while in the dot plot above, the proportions are measured from the left end only.

***

Read all my posts about Economist charts here

 


Two commendable student projects, showing different standards of beauty

A few weeks ago, I did a guest lecture for Ray Vella's dataviz class at NYU, and discussed a particularly hairy dataset that he assigns to students.

I'm happy to see the work of the students, and there are two pieces in particular that show promise.

The following dot plot by Christina Barretto shows the disparities between the richest and poorest nations increasing between 2000 and 2015.

BARRETTO  Christina - RIch Gets Richer Homework - 2021-04-14

The underlying dataset has the average GDP per capita for the richest and the poor regions in each of nine countries, for two years (2000 and 2015). With each year, the data are indiced to the national average income (100). In the U.K., the gap increased from around 800 to 1,100 in the 15 years. It's evidence that the richer regions are getting richer, and the poorer regions are getting poorer.

(For those into interpreting data, you should notice that I didn't say the rich getting richer. During the lecture, I explain how to interpret regional averages.)

Christina's chart reflects the tidy, minimalist style advocated by Tufte. The countries are sorted by the 2000-to-2015 difference, with Britain showing up as an extreme outlier.

***

The next chart by Adrienne Umali is more infographic than Tufte.

Adrienne Umali_v2

It's great story-telling. The top graphic explains the underlying data. It shows the four numbers and how the gap between the richest and poorest regions is computed. Then, it summarizes these four numbers into a single metric, "gap increase". She chooses to measure the change as a ratio while Christina's chart uses the difference, encoded as a vertical line.

Adrienne's chart is successful because she filters our attention to a single country - the U.S. It's much too hard to drink data from nine countries in one gulp.

This then sets her up for the second graphic. Now, she presents the other eight countries. Because of the work she did in the first graphic, the reader understands what those red and green arrows mean, without having to know the underlying index values.

Two small suggestions: a) order the countries from greatest to smallest change; b) leave off the decimals. These are minor flaws in a brilliant piece of work.

 

 


Pies, bars and self-sufficiency

Andy Cotgreave asked Twitter followers to pick between pie charts and bar charts:

Ac_pie_or_bar

The underlying data are proportions of people who say they won't get the coronavirus vaccine.

I noticed two somewhat unusual features: the use of pies to show single proportions, and the aspect ratio of the bars (taller than typical). Which version is easier to understand?

To answer this question, I like to apply a self-sufficiency test. This test is used to determine whether the readers are using the visual elements of the chart to udnerstand the data, or are they bypassing the visual elements and just reading the data labels? So, let's remove the printed data from the chart and take another look:

Junkcharts_selfsufficiency_pieorbar

For me, these charts are comparable. Each is moderately hard to read. That's because the percentages fall into a narrow range at one end of the range. For both charts, many readers are likely to be looking for the data labels.

Here's a sketch of a design that is self-sufficient.

Junkcharts_selfsufficientdesign

The data do not appear on this chart.

***

My first reaction to Andy's tweet turned out to be a misreading of the charts. I thought he was disaggregating the pie chart, like we can unstack a stacked bar chart.

Junkcharts_probabilities_proportions

Looking at the data more carefully, I realize that the "proportions" are not part to the whole. Or rather, the whole isn't "all races" or "all education levels". The whole is all respondents of a particular type.