« July 2021 | Main | September 2021 »

Simple charts are the hardest to do right

The CDC website has a variety of data graphics about many topics, one of which is U.S. vaccinations. I was looking for information about Covid-19 data broken down by age groups, and that's when I landed on these charts (link).

Cdc_vaccinations_by_age_small

The left panel shows people with at least one dose, and the right panel shows those who are "fully vaccinated." This simple chart takes an unreasonable amount of time to comprehend.

***

The analyst introduces three metrics, all of which are described as "percentages". Upon reflection, they are proportions of the people in specific age ranges.

Readers are thus invited to compare these proportions. It's not clear, however, which comparisons are intended. The first item listed in the legend states "Percent among Persons who completed all recommended doses in last 14 days". For most readers, including me, this introduces an unexpected concept. The 14 days here do not refer to the (in)famous 14-day case-counting window but literally the most recent two weeks relative to when the chart was produced.

It would have been clearer if the concept of Proportions were introduced in the chart title or axis title, while the color legend explains the concept of the base population. From the lighter shade to the darker shade (of red and blue) to the gray color, the base population shifts from "Among Those Who Completed/Initiated Vaccinations Within Last 14 Days" to "Among Those Who Completed/Initiated Vaccinations Any Time" to "Among the U.S. Population (regardless of vaccination status)".

Also, a reverse order helps our comprehension. Each subsequent category is a subset of the one above. First, the whole population, then those who are fully vaccinated, and finally those who recently completed vaccinations.

The next hurdle concerns the Q corner of our Trifecta Checkup. The design leaves few hints as to what question(s) its creator intended to address. The age distribution of the U.S. population is useless unless it is compared to something.

One apparently informative comparison is the age distribution of those fully vaccinated versus the age distribution of all Americans. This is revealed by comparing the lengths of the dark blue bar and the gray bar. But is this comparison informative? It's telling me that people aged 50 to 64 account for ~25% of those who are fully vaccinated, and ~20% of all Americans. Because proportions necessarily add to 100%, this implies that other age groups have been less vaccinated. Duh! Isn't that the result of an age-based vaccination prioritization? During the first week of the vaccination campaign, one might expect close to 100% of all vaccinations to be in the highest age group while it was 0% for the other age groups.

This is a chart in search of a question. The 25% vs 20% comparison does not assist readers in making a judgement. Does this mean the vaccination campaign is working as expected, worse than expected or better than expected? The problem is the wrong baseline. The designer of this chart implies that the expected proportions should conform to the overall age distribution - but that clearly stands in the way of CDC's initial prioritization of higher-risk age groups.

***

In my version of the chart, I illustrate the proportion of people in each age group who have been fully vaccinated.

Junkcharts_cdcvaccinationsbyage_1

Among those fully vaccinated, some did it within the most recent two weeks:

Junkcharts_cdcvaccinationsbyage_2

***

Elsewhere on the CDC site, one learns that on these charts, "fully vaccinated" means one shot of J&J or 2 shots of Pfizer or Moderna, without dealing with the 14-day window or other complications. Why do we think different definitions are used in different analyses? Story-first thinking, as I have explained here. When it comes to telling the story about vaccinations, the story is about the number of shots in arms. They want as big a number as possible, and abandon any criterion that decreases the count. When it comes to reporting on vaccine effectiveness, they want as small a number of cases as possible.

 

 

 

 

 


Check your presumptions while you're reading this chart about Israel's vaccination campaign

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?

Dvir_aran_chart

Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):

AronBrand_israelcases_twoorthreedoses

It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).

Lukas_denominator

C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)

Joescholar_basepopulationquestion

Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.

JasonPogue_5monthsout

D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:

Ontario_case_definition

Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:

Dvir_group_definition

In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.

DvirAran_proportionofunvaccinated

That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.

***

If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.

Junkcharts_dviraran_israel_threeshotschart

This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.

***

As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.


What metaphors give, they take away

Aleks pointed me to the following graphic making the rounds on Twitter:

Whyaxis_covid_men

It's being passed around as an example of great dataviz.

The entire attraction rests on a risque metaphor. The designer is illustrating a claim that Covid-19 causes erectile dysfunction in men.

That's a well-formed question so in using the Trifecta Checkup, that's a pass on the Q corner.

What about the visual metaphor? I advise people to think twice before using metaphors because these devices can give as they can take. This example is no exception. Some readers may pay attention to the orientation but other readers may focus on the size.

I pulled out the tape measure. Here's what I found.

Junkcharts_covid_eds

The angle is accurate on the first chart but the diameter has been exaggerated relative to the other. The angle is slightly magnified in the bottom chart which has a smaller circumference.

***

Let's look at the Data to round out our analysis. They come from a study from Italy (link), utilizing survey responses. There were 25 male respondents in the survey who self-reported having had Covid-19. Seven of these submitted answers to a set of five questions that were "suggestive of erectile dysfunction". (This isn't as arbitrary as it sounds - apparently it is an internationally accepted way of conducting reseach.) Seven out of 25 is 28 percent. Because the sample size is small, the 95% confidence range is 10% to 46%.

The researchers then used the propensity scoring method to find 3 matches per each infected person. Each match is a survey respondent who did not self-report having had Covid-19. See this post about a real-world vaccine study to learn more about propensity scoring. Among the 75 non-infected men, 7 were judged to have ED. The 95% range is 3% to 16%.

The difference between the two subgroups is quite large. The paper also includes other research that investigates the mechanisms that can explain the observed correlation. Nevertheless, the two proportions depicted in the chart have wide error bars around them.

I have always had a question about analysis using this type of survey data (including my own work). How do they know that ED follows infection rather than precedes it? One of the inviolable rules of causation is that the effect follows the cause. If it's a series of surveys, the sequencing may be measurable but a single survey presents challenges. 

The headline of the dataviz is "Get your vaccines". This comes from a "story time" moment in the paper. On page 1, under Discussion and conclusion, they inserted the sentence "Universal vaccination against COVID-19 and the personal protective equipment could possibly have the added benefit of preventing sexual dysfunctions." Nothing in the research actually supports this claim. The only time the word "vaccine" appears in the entire paper is on that first page.

"Story time" is the moment in a scientific paper when the researchers - after lulling readers to sleep over some interesting data - roll out statements that are not supported by the data presented before.

***

The graph succeeds in catching people's attention. The visual metaphor works in one sense but not in a different sense.

 

P.S. [8/6/2021] One final note for those who do care about the science: the internet survey not surprisingly has a youth bias. The median age of 25 infected people was 39, maxing out at 45 while the median of the 75 not infected was 42, maxing out at 49.