These are the top posts of 2020

It's always very interesting as a writer to look back at a year's of posts and find out which ones were most popular with my readers.

Here are the top posts on Junk Charts from 2020:

How to read this chart about coronavirus risk

This post about a New York Times scatter plot dates from February, a time when many Americans were debating whether Covid-19 was just the flu.

Proportions and rates: we are no dupes

This post about a ArsTechnica chart on the effects of Covid-19 by age is an example of designing the visual to reflect the structure of the data.

When the pie chart is more complex than the data

This post shows a 3D pie chart which is worse than a 2D pie chart.

Twitter people upset with that Covid symptoms diagram

This post discusses some complicated graphics designed to illustrate complicated datasets on Covid-19 symptoms.

Cornell must remove the logs before it reopens in the fall

This post is another warning to think twice before you use log scales.

What is the price of objectivity?

This post turns an "objective" data visualization into a piece of visual story-telling.

The snake pit chart is the best election graphic ever

This post introduces my favorite U.S. presidential election graphic, designed by the FiveThirtyEight team.

***

Here is a list of posts that deserve more attention:

Locating the political center

An example of bringing readers as close to the insights as possible

Visualizing change over time

An example of designing data visualization to reflect the structure of multivariate data

Bloomberg made me digest these graphics slowly

An example of simple and thoughtful graphics

The hidden bad assumption behind most dual-axis time-series charts

Read this before you make a dual-axis chart

Pie chart conventions

Read this before you make a pie chart

***
Looking forward to bring you more content in 2021!

Happy new year.


Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:

Policies_cases_hosp_static

Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.

***

Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.

Cases_over_time_grouped_by_policies

This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.

***

The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:

NewYork_policies_vs_cases

Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.

***

You can find the full article here.

 

 

 


Visualizing change over time: case study via Arstechnica

ArsTechnica published the following chart in its article titled "Grim new analyses spotlight just how hard the U.S. is failing in  pandemic" (link).

Artechnica-covid-mortality

There are some very good things about this chart, so let me start there.

In a Trifecta Checkup, I'd give the Q corner high marks. The question is clear: how has the U.S. performed relative to other countries? In particular, the chart gives a nuanced answer to this question. The designer realizes that there are phases in the pandemic, so the same question is asked three times: how has the U.S. performed relative to other countries since June, since May, and since the start of the pandemic?

In the D corner, this chart also deserves a high score. It selects a reasonable measure of mortality, which is deaths per population. It simplifies cognition by creating three grades of mortality rates per 100,000. Grade A is below 5 deaths, Grade B, between 5 and 25, and Grade C is above 25. 

A small deduction for not including the source of the data (the article states it's from a JAMA article). If any reader notices problems with the underlying data or calculations, please leave a comment.

***

So far so good. And yet, you might feel like I'm over-praising a chart that feels distinctly average. Not terrible, not great.

The reason for our ambivalence is the V corner. This is what I call a Type V chart. The visual design isn't doing justice to the underlying question and data analysis.

The grouped bar chart isn't effective here because the orange bars dominate our vision. It's easy to see how each country performed over the course of the pandemic but it's hard to learn how countries compare to each other in different periods.

How are the countries ordered? It would seem like the orange bars may be the sorting variable but this interpretation fails in the third group of countries.

The designer apparently made the decision to place the U.S. at the bottom (i.e. the worst of the league table). As I will show later, this is justified but the argument cannot be justified by the orange bars alone. The U.S. is worse in both the blue and purple bars but not the orange.

This points out that there is interest in the change in rates (or ranks) over time. And in the following makeover, I used the Bumps chart as the basis, as its chief use is in showing how ranking changes over time.

Redo_junkcharts_at_coviddeathstable_1

 

Better clarity can often be gained by subtraction:

Redo_junkcharts_at_coviddeathstable_2


Making better pie charts if you must

I saw this chart on an NYU marketing twitter account:

LATAMstartupCEO_covidimpact

The graphical design is not easy on our eyes. It's just hard to read for various reasons.

The headline sounds like a subject line from an email.

The subheaders are long, and differ only by a single word.

Even if one prefers pie charts, they can be improved by following a few guidelines.

First, start the first sector at the 12-oclock direction. Like this:

Redo_junkcharts_latamceo_orientation

The survey uses a 5-point scale from "Very Good" to "Very Bad". Instead of using five different colors, it's better to use two extreme colors and shading. Like this:

Redo_junkcharts_latamceo_color

I also try hard to keep all text horizontal.

Redo_junkcharts_latamceo_labels

For those who prefers not to use pie charts, a side-by-side bar chart works well.

Redo_junkcharts_latamceo_bars

In my article for DataJournalism.com, I outlined "unspoken rules" for making various charts, including pie charts.

 

 

 


Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

 

 


Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Nyt_excessdeaths_south

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

Nyt_excessdeaths_northeast3

 

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

 


Ask how you can give

A reader and colleague Georgette A was frustrated with the following graphic that appeared in the otherwise commendable article in National Geographic (link). The NatGeo article provides a history lesson on past pandemics that killed millions.

Natgeo_pandemichistory

What does the design want to convey to readers?

Our attention is drawn to the larger objects, the red triangle on the left or the green triangle on the right. Regarding the red triangle, we learn that the base is the duration of the pandemic while the height of the black bar represents the total deaths.

An immediate curiosity is why a green triangle is lodged in the middle of the red triangle. Answering this question requires figuring out the horizontal layout. Where we expect axis labels we find an unexpected series of numbers (0, 16, 48, 5, 2, 4, ...). These are durations that measure the widths of the triangular bases.

To solve this puzzle, imagine the chart with the triangles removed, leaving just the black columns. Now replace the durations with index numbers, 1 to 13, corresponding to the time order of the ending years of these epidemics. In other words, there is a time axis hidden behind the chart. [As Ken reminded me on Twitter, I forgot to mention that details of each pandemic are revealed by hovering over each triangle.]

This explains why the green triangle (Antonine Plague) is sitting inside the large red triangle (Plague of Justinian). The latter's duration is 3 times that of the former, and the Antonine Plague ended before the Plague of Justinian. In fact, the Antonine occurred during 165-180 while the Justinian happened during 541-588. The overlap is an invention of the design. To receive what the design gives, we have to think of time as a sequence, not of dates.

***

Now, compare the first and second red triangles. Their black columns both encode 50 million deaths. The Justinian Plague however was spread out over 48 years while the Black Death lasted just 5 years. This suggests that the Black Death was more fearsome than the Justinian Plague. And yet, the graphic presents the opposite imagery.

This is a pretty tough dataset to visualize. Here is a side-by-side bar chart that lets readers first compare deaths, and then compare durations.

Redo_natgeo_pandemichistory

In the meantime, I highly recommend the NatGeo article.


Everything in Texas is big, but not this BIG

Long-time reader John forwarded the following chart via Twitter.

Covidtracking_texassquare

The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.

Redo_texasdeathsquares_process

The final product:

Redo_texasdeaths_columns

An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.

Redo_texasdeaths_distortion

Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!