Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.

***

I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.

Statista_avgnewcases

The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:

WHO_horizontal_casesbyregion

On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.

***

What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.

Junkcharts_redo_statistawho_covidregional

In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.

 

 


Making graphics last over time

Yesterday, I analyzed the data visualization by the White House showing the progress of U.S. Covid-19 vaccinations. Here is the chart.

Whgov_proportiongettingvaccinated

John who tweeted this at me, saying "please get a better data viz".

I'm happy to work with them or the CDC on better dataviz. Here's an example of what I do.

Junkcharts_redo_whgov_usvaccineprogress

Obviously, I'm using made-up data here and this is a sketch. I want to design a chart that can be updated continuously, as data accumulate. That's one of the shortcomings of that bubble format they used.

In earlier months, the chart can be clipped to just the lower left corner.

Junkcharts_redo_whgov_usvaccineprogress_2


Circular areas offer misleading cues of their underlying data

John M. pointed me on Twitter to this chart about the progress of U.S.'s vaccination campaign:

Whgov_proportiongettingvaccinated

This looks like a White House production, retweeted by WHO. John is unhappy about this nested bubble format, which I'll come back to later.

Let's zoom in on what matters:

Whgov_proportiongettingvaccinated_clip

An even bigger problem with this chart is the Q corner in our Trifecta Checkup. What is the question they are trying to address? It would appear to be the proportion of population that has "already received [one or more doses of] vaccine". And the big words tell us the answer is 8 percent.

_junkcharts_trifectacheckupBut is that really the question? Check out the dark blue circle. It is labeled "population that has already received vaccine" and thus we infer this bubble represents 8 percent. Now look at the outer bubble. Its annotation is "new population that received vaccine since January 27, 2021". The only interpretation that makes sense is that 8 percent  is not the most current number. If that is the case, why would the headline highlight an older statistic, and not the most up-to-date one?

Perhaps the real question is how fast is the progress in vaccination. Perhaps it took weeks to get to the dark circle and then days to get beyond. In order to improve this data visualization, we must first decide what the question really is.

***

Now let's get to those nested bubbles. The bubble chart is a format that is not "sufficient," by which I mean the visual by itself does not convey the data without the help of aids such as labels. Try to answer the following questions:

Junkcharts_whgov_vaccineprogress_bubblequiz

In my view, if your answer to the last question is anything more than 5 seconds, the dataviz has failed. A successful data visualization should not make readers solve puzzles.

The first two questions depict the confusing nature of concentric circle diagrams. The first data point is coded to the inner circle. Where is the second data point? Is it encoded to the outer circle, or just the outer ring?

In either case, human brains are not trained to compare circular areas. For question 1, the outer circle is 70% larger than the smaller circle. For question 2, the ring is 70% of the area of the dark blue circle. If you're thinking those numbers seem unreasonable, I can tell you that was my first reaction too! So I made the following to convince myself that the calculation was correct:

Junkcharts_whgov_vaccineprogress_bubblequiz_2

Circular areas offer misleading visual cues, and should be used sparingly.

[P.S. 2/10/2021. In the next post, I sketch out an alternative dataviz for this dataset.]


Illustrating differential growth rates

Reader Mirko was concerned about a video published in Germany that shows why the new coronavirus variant is dangerous. He helpfully provided a summary of the transcript:

The South African and the British mutations of the SARS-COV-2 virus are spreading faster than the original virus. On average, one infected person infects more people than before. Researchers believe the new variant is 50 to 70 % more transmissible.

Here are two key moments in the video:

Germanvid_newvariant1

This seems to be saying the original virus (left side) replicates 3 times inside the infected person while the new variant (right side) replicates 19 times. So we have a roughly 6-fold jump in viral replication.

Germanvid_newvariant2

Later in the video, it appears that every replicate of the old virus finds a new victim while the 19 replicates of the new variant land on 13 new people, meaning 6 replicates didn't find a host.

As Mirko pointed out, the visual appears to have run away from the data. (In our Trifecta Checkup, we have a problem with the arrow between the D and the V corners. What the visual is saying is not aligned with what the data are saying.)

***

It turns out that the scientists have been very confusing when talking about the infectiousness of this new variant. The most quoted line is that the British variant is "50 to 70 percent more transmissible". At first, I thought this is a comment on the famous "R number". Since the R number around December was roughly 1 in the U.K, the new variant might bring the R number up to 1.7.

However, that is not the case. From this article, it appears that being 5o to 70 percent more transmissible means R goes up from 1 to 1.4. R is interpreted as the average number of people infected by one infected person.

Mirko wonders if there is a better way to illustrate this. I'm sure there are many better ways. Here's one I whipped up:

Junkcharts_redo_germanvideo_newvariant

The left side is for the 40% higher R number. Both sides start at the center with 10 infected people. At each time step, if R=1 (right side), each of the 10 people infects 10 others, so the total infections increase by 10 per time step. It's immediately obvious that a 40% higher R is very serious indeed. Starting with 10 infected people, in 10 steps, the total number of infections is almost 1,000, almost 10 times higher than when R is 1.

The lines of the graphs simulate the transmission chains. These are "average" transmission chains since R is an average number.

 

P.S. [1/29/2021: Added the missing link to the article in which it is reported that 50-70 percent more transmissible implies R increasing by 40%.]

 

 


A beautiful curve and its deadly misinterpretation

When the preliminary analyses of their Phase 3 trials came out , vaccine developers pleased their audience of scientists with the following data graphic:

Pfizerfda_cumcases

The above was lifted out of the FDA briefing document for the Pfizer / Biontech vaccine.

Some commentators have honed in on the blue line for the vaccinated arm of the Pfizer trial.

Junkcharts_pfizerfda_redo_vaccinecases

Since the vertical axis shows cumulative number of cases, it is noted that the vaccine reached peak efficacy after 14 days following the first dose. The second dose was administered around Day 21. At this point, the vaccine curve appeared almost flat. Thus, these commentators argued, we should make a big bet on the first dose.

***

The chart is indeed very beautiful. It's rare to see such a huge gap between the test group and the control group. Notice that I just described the gap between test and control. That's what a statistician is looking at in that chart - not the blue line, but the gap between the red and blue lines.

Imagine: if the curve for the placebo group looked the same as that for the vaccinated group, then the chart would lose all its luster. Screams of victory would be replaced by tears of sadness.

Here I bring back both lines, and you should focus on the gaps between the lines:

Junkcharts_pfizerfda_redo_twocumcases

Does the action stop around day 14? The answer is a resounding No! In fact, the red line keeps rising so over time, the vaccine's efficacy improves (since VE is a ratio between the two groups).

The following shows the vaccine efficacy curve:

Junkcharts_pfizerfda_redo_ve

Right before the second dose, VE is just below 50%. VE keeps rising and reaches 70% by day 50, which is about a month after the second dose.

If the FDA briefing document has shown the VE curve, instead of the cumulative-cases curve, few would argue that you don't need the second dose!

***

What went wrong here? How come the beautiful chart may turn out to be lethal? (See this post on my book blog for reasons why I think foregoing or delaying the second dose will exacerbate the pandemic.)

It's a bit of bait and switch. The original chart plots cumulative case counts, separately for each treatment group. Cumulative case counts are inputs to computing vaccine efficacy. It is true that as the blue line for the vaccine flattens, VE would likely rise. But the case count for the vaccine group is an imperfect proxy for VE. As I showed above, the VE continues to gain strength long after the vaccine case count has levelled.

The important lesson for data visualization designers is: plot the metric that matters to decision-makers; avoid imperfect proxies.

 

P.S. [1/19/2021: For those who wants to get behind the math of all this, the following several posts on my book blog will help.

One-dose Pfizer is not happening, and here's why

The case for one-dose vaccines is lacking key details

One-dose vaccine strategy elevates PR over science

]

[1/21/2021: The Guardian chimes in with "Single Covid vaccine dose in Israel 'less effective than we thought'" (link). "In remarks reported by Army Radio, Nachman Ash said a single dose appeared “less effective than we had thought”, and also lower than Pfizer had suggested." To their credit, Pfizer has never publicly recommended a one-dose treatment.]

[1/21/2021: For people in marketing or business, I wrote up a new post that expresses the one-dose vs two-dose problem in terms of optimizing an email drip campaign. It boils down to: do you accept that argument that you should get rid of your latter touches because the first email did all the work? Or do you want to run an experiment with just one email before you decide? You can read this on the book blog here.]


Handling partial data on graphics

Last week, I posted on the book blog a piece about excess deaths and accelerated deaths (link). That whole piece is about how certain types of analysis have to be executed at certain moments of time.  The same analysis done at the wrong time yields the wrong conclusions.

Here is a good example of what I'm talking about. This is a graph of U.S. monthly deaths from Covid-19 during the entire pandemic. The chart is from the COVID Tracking Project, although I pulled it down from my Twitter feed.

Covidtracking_monthlydeaths

There is nothing majorly wrong with this column chart (I'd remove the axis labels). But there is a big problem. Are we seeing a boomerang of deaths from November to December to January?

Junkcharts_covidtrackingproject_monthlydeaths_1

Not really. This trend is there only because the chart is generated on January 12. The last column contains 12 days while the prior two columns contain 30-31 days.

Junkcharts_covidtrackingproject_monthlydeaths_2

The Trifecta Checkup picks up this problem. What the visual is showing isn't what the data are saying. I'd call this a Type D chart.

***

What to fix this?

One solution is to present partial data for all the other columns, so that the readers can compare the January column to the others.

Junkcharts_covidtrackingmonthydeaths_first12days

One critique of this is the potential seasonality. The first 38% (12 out of 31) of a month may not be comparable across months. A further seasonal adjustment makes this better - if we decide the benefits outweight the complexity.

Another solution is to project the full-month tally.

Junkcharts_covidtrackingmonthydeaths_projected

The critique here is the accuracy of the projection.

But the point is that not making the adjustment would be worse.

 

 


These are the top posts of 2020

It's always very interesting as a writer to look back at a year's of posts and find out which ones were most popular with my readers.

Here are the top posts on Junk Charts from 2020:

How to read this chart about coronavirus risk

This post about a New York Times scatter plot dates from February, a time when many Americans were debating whether Covid-19 was just the flu.

Proportions and rates: we are no dupes

This post about a ArsTechnica chart on the effects of Covid-19 by age is an example of designing the visual to reflect the structure of the data.

When the pie chart is more complex than the data

This post shows a 3D pie chart which is worse than a 2D pie chart.

Twitter people upset with that Covid symptoms diagram

This post discusses some complicated graphics designed to illustrate complicated datasets on Covid-19 symptoms.

Cornell must remove the logs before it reopens in the fall

This post is another warning to think twice before you use log scales.

What is the price of objectivity?

This post turns an "objective" data visualization into a piece of visual story-telling.

The snake pit chart is the best election graphic ever

This post introduces my favorite U.S. presidential election graphic, designed by the FiveThirtyEight team.

***

Here is a list of posts that deserve more attention:

Locating the political center

An example of bringing readers as close to the insights as possible

Visualizing change over time

An example of designing data visualization to reflect the structure of multivariate data

Bloomberg made me digest these graphics slowly

An example of simple and thoughtful graphics

The hidden bad assumption behind most dual-axis time-series charts

Read this before you make a dual-axis chart

Pie chart conventions

Read this before you make a pie chart

***
Looking forward to bring you more content in 2021!

Happy new year.


Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:

Policies_cases_hosp_static

Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.

***

Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.

Cases_over_time_grouped_by_policies

This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.

***

The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:

NewYork_policies_vs_cases

Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.

***

You can find the full article here.

 

 

 


Visualizing change over time: case study via Arstechnica

ArsTechnica published the following chart in its article titled "Grim new analyses spotlight just how hard the U.S. is failing in  pandemic" (link).

Artechnica-covid-mortality

There are some very good things about this chart, so let me start there.

In a Trifecta Checkup, I'd give the Q corner high marks. The question is clear: how has the U.S. performed relative to other countries? In particular, the chart gives a nuanced answer to this question. The designer realizes that there are phases in the pandemic, so the same question is asked three times: how has the U.S. performed relative to other countries since June, since May, and since the start of the pandemic?

In the D corner, this chart also deserves a high score. It selects a reasonable measure of mortality, which is deaths per population. It simplifies cognition by creating three grades of mortality rates per 100,000. Grade A is below 5 deaths, Grade B, between 5 and 25, and Grade C is above 25. 

A small deduction for not including the source of the data (the article states it's from a JAMA article). If any reader notices problems with the underlying data or calculations, please leave a comment.

***

So far so good. And yet, you might feel like I'm over-praising a chart that feels distinctly average. Not terrible, not great.

The reason for our ambivalence is the V corner. This is what I call a Type V chart. The visual design isn't doing justice to the underlying question and data analysis.

The grouped bar chart isn't effective here because the orange bars dominate our vision. It's easy to see how each country performed over the course of the pandemic but it's hard to learn how countries compare to each other in different periods.

How are the countries ordered? It would seem like the orange bars may be the sorting variable but this interpretation fails in the third group of countries.

The designer apparently made the decision to place the U.S. at the bottom (i.e. the worst of the league table). As I will show later, this is justified but the argument cannot be justified by the orange bars alone. The U.S. is worse in both the blue and purple bars but not the orange.

This points out that there is interest in the change in rates (or ranks) over time. And in the following makeover, I used the Bumps chart as the basis, as its chief use is in showing how ranking changes over time.

Redo_junkcharts_at_coviddeathstable_1

 

Better clarity can often be gained by subtraction:

Redo_junkcharts_at_coviddeathstable_2


Making better pie charts if you must

I saw this chart on an NYU marketing twitter account:

LATAMstartupCEO_covidimpact

The graphical design is not easy on our eyes. It's just hard to read for various reasons.

The headline sounds like a subject line from an email.

The subheaders are long, and differ only by a single word.

Even if one prefers pie charts, they can be improved by following a few guidelines.

First, start the first sector at the 12-oclock direction. Like this:

Redo_junkcharts_latamceo_orientation

The survey uses a 5-point scale from "Very Good" to "Very Bad". Instead of using five different colors, it's better to use two extreme colors and shading. Like this:

Redo_junkcharts_latamceo_color

I also try hard to keep all text horizontal.

Redo_junkcharts_latamceo_labels

For those who prefers not to use pie charts, a side-by-side bar chart works well.

Redo_junkcharts_latamceo_bars

In my article for DataJournalism.com, I outlined "unspoken rules" for making various charts, including pie charts.