The gift of small edits and subtraction

While making the chart on fertility rates (link), I came across a problem that pops up quite often, and is  ignored by most software programs.

Here is an earlier version of the chart I later discarded:

Junkcharts_redofertilitychart_2

Compare this to the version I published in the blog post:

Junkcharts_redofertilitychart_1

Aside from adding the chart title, there is one major change. I removed the empty plots from the grid. This is a visualization trick that should be called adding by subtracting. The empty scaffolding on the first chart increases our cognitive load without yielding any benefit. The whitespace brings out the message that only countries in Asia and Africa have fertility rates above 5.0. 

This is a small edit. But small edits accumulate and deliver a big impact. Bear this in mind the next time you make a chart.

 

P.S.

(1) You'd have to use a lower-level coding language to execute this small edit. Most software programs are quite rigid when it comes to making small-multiples (facet) charts.

(2) If there is a next iteration, I'd reverse the Asia and Oceania rows.

 


Working hard at clarity

As I am preparing another blog post about the pandemic, I came across the following data graphic, recently produced by the CDC for a vaccine advisory board meeting:

CDC_positivevaccineintent

This is not an example of effective visual communications.

***

For one thing, readers are directed to scour the footnotes to figure out what's going on. If we ignore those for the moment, we see clusters of bubbles that have remained pretty stable from December 2020 to August 2021. The data concern some measure of Americans' intent to take the COVID-19 vaccine. That much we know.

There may have been a bit of an upward trend between January and May, although if you were shown the clusters for December, February and April, you'd think the trend's been pretty flat. 

***

But those colors? What could they represent? You'd surely have to fish this one out of the footnotes. Specifically, this obtuse sentence: "Surveys with multiple time points are shown with the same color bubble for each time point." I had to read it several times. I think it simply means "Color represents the pollster." 

Then it adds: "Surveys with only one time point are shown in gray." which simply means "All pollsters who have only one entry in the dataset are grouped together and shown in gray."

Another problem with this chart is over-plotting. Look at the July cluster. It's impossible to tell how many polls were conducted in July because the circles pile on top of one another. 

***

The appearance of the flat trend is a result of two unfortunate decisions made by the designer. If I retained the chart form, I'd have produced something that looks like this:

Junkcharts_redo_cdcvaccineintent_sameform

The first design choice is to expand the vertical axis to range from 0% to 100%. This effectively squeezes all the bubbles into a small range.

Junkcharts_redo_cdcvaccineintent_startatzero

The second design choice is to enlarge the bubbles causing copious amount of overlapping. 

Junkcharts_redo_cdcvaccineintent_startatzero_bigdots

In particular, this decision blows up the Pew poll (big pink bubble) that contained 10 times the sample size of most of the other polls. The Pew outcome actually came in at 70% but the top of the pink bubble extends to over 80%. Because of this, the outlier poll of December 2020 - which surprisingly printed the highest number of all polls in the entire time window - no longer looks special. 

***

Now, let's see what else we can do to enhance this chart. 

I don't like how bubble size is used to encode the sample size. It creates a weird sensation for anyone who's familiar with sampling errors, and confidence regions. The Pew poll with 10 times the sample size is the most reliable poll of them all. Reliability means the error bars around the Pew poll outcome is the smallest of them all. I tend to think of the area around a point estimate as showing the sampling error so the Pew poll would be a dot, showing the high precision of that estimate. 

But that won't work because larger bubbles catch more of the reader's attention. So, in the following version, all dots have the same size. I encode reliability in the opacity of the color. The darker dots are polls that are more reliable, that have larger sample sizes.

Junkcharts_redo_cdcvaccineintent_opacity

Two of the pollsters have more frequent polling than others. In this next version, I highlighted those two, which reveals the trend better.

Junkcharts_redo_cdcvaccineintent_opacitywithlines

 

 

 


Simple charts are the hardest to do right

The CDC website has a variety of data graphics about many topics, one of which is U.S. vaccinations. I was looking for information about Covid-19 data broken down by age groups, and that's when I landed on these charts (link).

Cdc_vaccinations_by_age_small

The left panel shows people with at least one dose, and the right panel shows those who are "fully vaccinated." This simple chart takes an unreasonable amount of time to comprehend.

***

The analyst introduces three metrics, all of which are described as "percentages". Upon reflection, they are proportions of the people in specific age ranges.

Readers are thus invited to compare these proportions. It's not clear, however, which comparisons are intended. The first item listed in the legend states "Percent among Persons who completed all recommended doses in last 14 days". For most readers, including me, this introduces an unexpected concept. The 14 days here do not refer to the (in)famous 14-day case-counting window but literally the most recent two weeks relative to when the chart was produced.

It would have been clearer if the concept of Proportions were introduced in the chart title or axis title, while the color legend explains the concept of the base population. From the lighter shade to the darker shade (of red and blue) to the gray color, the base population shifts from "Among Those Who Completed/Initiated Vaccinations Within Last 14 Days" to "Among Those Who Completed/Initiated Vaccinations Any Time" to "Among the U.S. Population (regardless of vaccination status)".

Also, a reverse order helps our comprehension. Each subsequent category is a subset of the one above. First, the whole population, then those who are fully vaccinated, and finally those who recently completed vaccinations.

The next hurdle concerns the Q corner of our Trifecta Checkup. The design leaves few hints as to what question(s) its creator intended to address. The age distribution of the U.S. population is useless unless it is compared to something.

One apparently informative comparison is the age distribution of those fully vaccinated versus the age distribution of all Americans. This is revealed by comparing the lengths of the dark blue bar and the gray bar. But is this comparison informative? It's telling me that people aged 50 to 64 account for ~25% of those who are fully vaccinated, and ~20% of all Americans. Because proportions necessarily add to 100%, this implies that other age groups have been less vaccinated. Duh! Isn't that the result of an age-based vaccination prioritization? During the first week of the vaccination campaign, one might expect close to 100% of all vaccinations to be in the highest age group while it was 0% for the other age groups.

This is a chart in search of a question. The 25% vs 20% comparison does not assist readers in making a judgement. Does this mean the vaccination campaign is working as expected, worse than expected or better than expected? The problem is the wrong baseline. The designer of this chart implies that the expected proportions should conform to the overall age distribution - but that clearly stands in the way of CDC's initial prioritization of higher-risk age groups.

***

In my version of the chart, I illustrate the proportion of people in each age group who have been fully vaccinated.

Junkcharts_cdcvaccinationsbyage_1

Among those fully vaccinated, some did it within the most recent two weeks:

Junkcharts_cdcvaccinationsbyage_2

***

Elsewhere on the CDC site, one learns that on these charts, "fully vaccinated" means one shot of J&J or 2 shots of Pfizer or Moderna, without dealing with the 14-day window or other complications. Why do we think different definitions are used in different analyses? Story-first thinking, as I have explained here. When it comes to telling the story about vaccinations, the story is about the number of shots in arms. They want as big a number as possible, and abandon any criterion that decreases the count. When it comes to reporting on vaccine effectiveness, they want as small a number of cases as possible.

 

 

 

 

 


What metaphors give, they take away

Aleks pointed me to the following graphic making the rounds on Twitter:

Whyaxis_covid_men

It's being passed around as an example of great dataviz.

The entire attraction rests on a risque metaphor. The designer is illustrating a claim that Covid-19 causes erectile dysfunction in men.

That's a well-formed question so in using the Trifecta Checkup, that's a pass on the Q corner.

What about the visual metaphor? I advise people to think twice before using metaphors because these devices can give as they can take. This example is no exception. Some readers may pay attention to the orientation but other readers may focus on the size.

I pulled out the tape measure. Here's what I found.

Junkcharts_covid_eds

The angle is accurate on the first chart but the diameter has been exaggerated relative to the other. The angle is slightly magnified in the bottom chart which has a smaller circumference.

***

Let's look at the Data to round out our analysis. They come from a study from Italy (link), utilizing survey responses. There were 25 male respondents in the survey who self-reported having had Covid-19. Seven of these submitted answers to a set of five questions that were "suggestive of erectile dysfunction". (This isn't as arbitrary as it sounds - apparently it is an internationally accepted way of conducting reseach.) Seven out of 25 is 28 percent. Because the sample size is small, the 95% confidence range is 10% to 46%.

The researchers then used the propensity scoring method to find 3 matches per each infected person. Each match is a survey respondent who did not self-report having had Covid-19. See this post about a real-world vaccine study to learn more about propensity scoring. Among the 75 non-infected men, 7 were judged to have ED. The 95% range is 3% to 16%.

The difference between the two subgroups is quite large. The paper also includes other research that investigates the mechanisms that can explain the observed correlation. Nevertheless, the two proportions depicted in the chart have wide error bars around them.

I have always had a question about analysis using this type of survey data (including my own work). How do they know that ED follows infection rather than precedes it? One of the inviolable rules of causation is that the effect follows the cause. If it's a series of surveys, the sequencing may be measurable but a single survey presents challenges. 

The headline of the dataviz is "Get your vaccines". This comes from a "story time" moment in the paper. On page 1, under Discussion and conclusion, they inserted the sentence "Universal vaccination against COVID-19 and the personal protective equipment could possibly have the added benefit of preventing sexual dysfunctions." Nothing in the research actually supports this claim. The only time the word "vaccine" appears in the entire paper is on that first page.

"Story time" is the moment in a scientific paper when the researchers - after lulling readers to sleep over some interesting data - roll out statements that are not supported by the data presented before.

***

The graph succeeds in catching people's attention. The visual metaphor works in one sense but not in a different sense.

 

P.S. [8/6/2021] One final note for those who do care about the science: the internet survey not surprisingly has a youth bias. The median age of 25 infected people was 39, maxing out at 45 while the median of the 75 not infected was 42, maxing out at 49.


Ranking data provide context but can also confuse

This dataviz from the Economist had me spending a lot of time clicking around - which means it is a success.

Econ_usaexcept_hispanic

The graphic presents four measures of wellbeing in society - life expectancy, infant mortality rate, murder rate and prison population. The primary goal is to compare nations across those metrics. The focus is on comparing how certain nations (or subgroups) rank against each other, as indicated by the relative vertical position.

The Economist staff has a particular story to tell about racial division in the US. The dotted bars represent the U.S. average. The colored bars are the averages for Hispanic, white and black Americans. The wider the gap between the colored bars, the more variant is the experiences between American races.

The chart shows that the racial gap of life expectancy is the widest. For prison population, the U.S. and its racial subgroups occupy many of the lowest (i.e. least desirable) ranks, with the smallest gap in ranking.

***

The primary element of interactivity is hovering on a bar, which then highlights the four bars corresponding to the particular nation selected. Here is the picture for Thailand:

Econ_usaexcept_thailand

According to this view of the world, Thailand is a close cousin of the U.S. On each metric, the Thai value clings pretty near the U.S. average and sits within the range by racial groups. I'm surprised to learn that the prison population in Thailand is among the highest in the world.

Unfortunately, this chart form doesn't facilitate comparing Thailand to a country other than the U.S as one can highlight only one country at a time.

***

While the main focus of the chart is on relative comparison through ranking, the reader can extract absolute difference by reading the lengths of the bars.

This is a close-up of the bottom of the prison population metric:

Econ_useexcept_prisonpop_bottomThe length of each bar displays the numeric data. The red line is an outlier in this dataset. Black Americans suffer an incarceration rate that is almost three times the national average. Even white Americans (blue line) is imprisoned at a rate higher than most countries around the world.

As noted above, the prison population metric exhibits the smallest gap between racial subgroups. This chart is a great example of why ranking data frequently hide important information. The small gap in ranking masks the extraordinary absolute difference in incareration rates between white and black America.

The difference between rank #1 and rank #2 is enormous.

Econ_useexcept_lifeexpect_topThe opposite situation appears for life expectancy. The life expectancy values are bunched up especially at the top of the scale. The absolute difference between Hispanic and black America is 82 - 75 = 7 years, which looks small because the axis starts at zero. On a ranking scale, Hispanic is roughly in the top 15% while black America is just above the median. The relative difference is huge.

For life expectancy, ranking conveys the view that even a 7-year difference is a big deal because the countries are tightly bunched together. For prison population, ranking shows the view that a multiple fold difference is "unimportant" because a 20-0 blowout and a 10-0 blowout are both heavy defeats.

***

Whenever you transform numeric data to ranks, remember that you are artificially treating the gap between each value and the next value as a constant, even when the underlying numeric gaps show wide variance.

 

 

 

 

 


Hanging things on your charts

The Financial Times published the following chart that shows the rollout of vaccines in the U.K.

Ft_astrazeneca_uk_rollout

(I can't find the online link to the article. The article is titled "AstraZeneca and Oxford face setbacks and success as battle enters next phase", May 29/30 2021.)

This chart form is known as a "streamgraph", and it is a stacked area chart in disguise. 

The same trick can be applied to a column chart. See the "hanging" column chart below:

Junkcharts_hangingcolumns

The two charts show exactly the same data. The left one roots the columns at the bottom. The right one aligns the middle of the columns. 

I have rarely found these hanging charts useful. The realignment makes it harder to compare the sizes of the different column segments. On the normal stacked column chart, the yellow segments are the easiest to compare because they share the same base level. Even this is taken away from the reader on the right side.

Note also that the hanging version does not admit a vertical axis

The same comments apply to the streamgraph.

***

Nevertheless, I was surprised that the FT chart shown above actually works. The main message I learned was that initially U.K. primarily rolled out AstraZeneca and, to a lesser extent, Pfizer, shots while later, they introduced other vaccines, including Johnson & Johnson, Novavax, CureVac, Moderna, and "Other". 

I can also see that the supply of AstraZeneca has not changed much through the entire time window. Pfizer has grown to roughly the same scale as AstraZeneca. Moderna remains a small fraction of total shots. 

I can even roughly see that the total number of vaccinations has grown about six times from start to finish. 

That's quite a lot for one chart, so job well done!

There is one problem with the FT chart. It should have labelled end of May as "today". Half the chart is history, and the other half is the future.

***

For those following Covid-19 news, the FT chart is informative in a different way.

There is a misleading statement going around blaming the U.K.'s recent surge in cases on the Astrazeneca vaccine, claiming that the U.K. mostly uses AZ. This chart shows that from the start, about a third of the shots administered in the U.K. are Pfizer, and Pfizer's share has been growing over time. 

U.K. compared to some countries mostly using mRNA vaccines

Ourworldindata_cases

U.K. is almost back to the winter peak. That's because the U.K. is serious about counting cases. Look at the state of testing in these countries:

Ourworldindata_tests

What's clear about the U.S. case count is that it is kept low by cutting the number of tests by two-thirds, thus, our data now is once again severely biased towards severe cases. 

We can do a back-of-the-envelope calculation. The drop in testing may directly lead to a proportional drop in reported cases, thus removing 500 (asymptomatic, or mild) cases per million from the case count. The case count goes below 250 per million so the additional 200 or so reduction is due to other reasons such as vaccinations.


Probabilities and proportions: which one is the chart showing

The New York Times showed this chart (link):

Nyt_unvaccinated_undeterred

My first read: oh my gosh, 40-50% of the unvaccinated Americans are living their normal lives - dining at restaurants, assembling with more than 10 people, going to religious gatherings.

After reading the text around this chart, I realize I have misinterpreted it.

The chart should be read by columns. Each column is a "pie chart". For example, the first column shows that half the restaurant diners are not vaccinated, a third are fully vaccinated, and the remainder are partially vaccinated. The other columns have roughly the same proportions.

The author says "The rates of vaccination among people doing these activities largely reflect the rates in the population." This line is perhaps more confusing than intended. What she's saying is that in the general population, half of us are unvaccinated, a third are fully unvaccinated, and the remainder are partially vaccinated.

Here's a picture:

Junkcharts_redo_nyt_unvaccinatedundeterred

What this chart is saying is that the people dining out is like a random sample from all Americans. So too the other groups depicted. What Americans are choosing to do is independent of their vaccination status.

Unvaccinated people are no less likely to be doing all these activities than the fully vaccinated. This raises the question: are half of the people not wearing masks outdoors unvaccinated?

***

Why did I read the chart wrongly in the first place? It has to do with expectations.

Most survey charts plot probabilities not proportions. I haphazardly grabbed the following Pew Research chart as an example:

Pew_kids_socialmedia

From this chart, we learn that 30% of kids 9-11 years old uses TikTok compared to 11% of kids 5-8.  The percentages down a column do not sum to 100%.

 


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.

***

I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.

Statista_avgnewcases

The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:

WHO_horizontal_casesbyregion

On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.

***

What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.

Junkcharts_redo_statistawho_covidregional

In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.

 

 


Making graphics last over time

Yesterday, I analyzed the data visualization by the White House showing the progress of U.S. Covid-19 vaccinations. Here is the chart.

Whgov_proportiongettingvaccinated

John who tweeted this at me, saying "please get a better data viz".

I'm happy to work with them or the CDC on better dataviz. Here's an example of what I do.

Junkcharts_redo_whgov_usvaccineprogress

Obviously, I'm using made-up data here and this is a sketch. I want to design a chart that can be updated continuously, as data accumulate. That's one of the shortcomings of that bubble format they used.

In earlier months, the chart can be clipped to just the lower left corner.

Junkcharts_redo_whgov_usvaccineprogress_2