Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


Graphing the extreme

The Covid-19 pandemic has brought about extremes. So many events have never happened before. I doubt The Conference Board has previously seen the collapse of confidence in the economy by CEOs. Here is their graphic showing this extreme event:

Tcb_COVID-19-CEO-confidence-1170

To appreciate this effort, you have to see the complexity of the underlying data. There is a CEO Confidence Measure. The measure has three components. Each component is scored on a scale probably from 0 to 100, with 5o as the middle. Then, the components are aggregated into an overall score. The measure is repeatedly estimated over time, and they did two surveys during the Pandemic, pre and post the lockdown in the U.S. And then, there's the rightmost column, which provides another reference point for one of the components of the measure.

One can easily get one's limbs tied up in knots trying to tame this beast.

Of course, the tiny square stands out. CEOs have a super pessimistic outlook for the next 6 months for overall economy. The number 3 on this scale probably means almost every respondent has a negative view. 

The grid arrangement does not appear attractive but it is terrifically functional. The grid delivers horizontal and vertical comparisons. Moving vertically, we learn that even at the start of the year, the average sentiment was negative (9 points below 50), then it lost another 10 points, and finally imploded.

Moving horizontally, we can compare related metrics since everything is conveniently expressed in the same scale. While CEOs are depressed about the overall economy, they have slightly more faith about their own industry. And then moving left, we learn that many CEOs expect a V-shaped recovery, a really fast bounceback within 6 months. 

As the Conference Board surveys this group again in the near future, I wonder if the optimism still holds. 

The Conference Board has an entire set of graphics about the economic crisis of Covid-19 here. For some reason, they don't let me link to a specific chart so I can't directly link to the chart. 
 


Proportions and rates: we are no dupes

Reader Lucia G. sent me this chart, from Ars Technica's FAQ about the coronavirus:

Arstechnica_covid-19-2.001-1280x960

She notices something wrong with the axis.

The designer took the advice not to make a dual axis, but didn't realize that the two metrics are not measured on the same scale even though both are expressed as percentages.

The blue bars, labeled "cases", is a distribution of cases by age group. The sum of the blue bars should be 100 percent.

The orange bars show fatality rates by age group. Each orange bar's rate is based on the number of cases in that age group. The sum of the orange bars will not add to 100 percent.

In general, the rates will have much lower values than the proportions. At least that should be the case for viruses that are not extremely fatal.

This is what the 80 and over section looks like.

Screen Shot 2020-03-12 at 1.19.46 AM

It is true that fatality rate (orange) is particularly high for the elderly while this age group accounts for less than 5 percent of total cases (blue). However, the cases that are fatal, which inhabit the orange bar, must be a subset of the total cases for 80 and over, which are shown in the blue bar. Conceptually, the orange bar should be contained inside the blue bar. So, it's counter-intuitive that the blue bar is so much shorter than the orange bar.

The following chart fixes this issue. It reveals the structure of the data, Total cases are separated by age group, then within each age group, a proportion of the cases are fatal.

Junkcharts_redo_arstechnicacovid19

This chart also shows that most patients recover in every age group. (This is only approximately true as some of the cases may not have been discharged yet.)

***

This confusion of rates and proportions reminds me of something about exit polls I just wrote about the other day on the sister blog.

When the media make statements about trends in voter turnout rate in the primary elections, e.g. when they assert that youth turnout has not increased, their evidence is from exit polls, which can measure only the distribution of voters by age group. Exit polls do not and cannot measure the turnout rate, which is the proportion of registered (or eligible) voters in the specific age group who voted.

Like the coronavirus data, the scales of these two metrics are different even though they are both percentages: the turnout rate is typically a number between 30 and 70 percent, and summing the rates across all age groups will exceed 100 percent many times over. Summing the proportions of voters across all age groups should be 100 percent, and no more.

Changes in the proportion of voters aged 18-29 and changes in the turnout rate of people aged 18-29 are not the same thing. The former is affected by the turnout of all age groups while the latter is a clean metric affected only by 18 to 29-years-old.

Basically, ignore pundits who use exit polls to comment on turnout trends. No matter how many times they repeat their nonsense, proportions and rates are not to be confused. Which means, ignore comments on turnout trends because the only data they've got come from exit polls which don't measure rates.

 

P.S. Here is some further explanation of my chart, as a response to a question from Enrico B. on Twitter.

The chart can be thought of as two distributions, one for cases (gray) and one for deaths (red). Like this:

Junkcharts_redo_arstechnicacoronavirus_2

The side-by-side version removes the direct visualization of the fatality rate within each age group. To understand fatality rate requires someone to do math in their head. Readers can qualitatively assess that for the 80 and over, they accounted for 3 percent of cases but also about 21 percent of deaths. People aged 70 to 79 however accounted for 9 percent of cases but 30 percent of deaths, etc.

What I did was to scale the distribution of deaths so that they can be compared to the cases. It's like fitting the red distribution inside the gray distribution. Within each age group, the proportion of red against the length of the bar is the fatality rate.

For every 100 cases regardless of age, 3 cases are for people aged 80 and over within which 0.5 are fatal (red).

So, the axis labels are correct. The values are proportions of total cases, although as the designer of the chart, I hope people are paying attention more to the proportion of red, as opposed to the units.

What might strike people as odd is that the biggest red bar does not appear against 80 and above. We might believe it's deadlier the older you are. That's because on an absolute scale, more people aged 70-79 died than those 80 and above. The absolute deaths is the product of the proportion of cases and the fatality rate. That's really a different story from the usual plot of fatality rates by age group. In those charts, we "control" for the prevalence of cases. If every age group were infected in the same frequency, then COVID-19 does kill more 80 and over.

 

 

 


Taking small steps to bring out the message

Happy new year! Good luck and best wishes!

***

We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.

Econ_internationalgroceries_sm

The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:

Redo_econgroceriesintl_1

The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:

Redo_econgroceriesintl_2

This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.

***

Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?


This Wimbledon beauty will be ageless

Ft_wimbledonage


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.

 

Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.

 

Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.

***

This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)


Inspiration from a waterfall of pie charts: illustrating hierarchies

Reader Antonio R. forwarded a tweet about the following "waterfall of pie charts" to me:

Water-stats-pie21

Maarten Lamberts loved these charts (source: here).

I am immediately attracted to the visual thinking behind this chart. The data are presented in a hierarchy with three levels. The levels are nested in the sense that the pieces in each pie chart add up to 100%. From the first level to the second, the category of freshwater is sub-divided into three parts. From the second level to the third, the "others" subgroup under freshwater is sub-divided into five further categories.

The designer faces a twofold challenge: presenting the proportions at each level, and integrating the three levels into one graphic. The second challenge is harder to master.

The solution here is quite ingenious. A waterfall/waterdrop metaphor is used to link each layer to the one below. It visually conveys the hierarchical structure.

***

There remains a little problem. There is a confusion related to the part and the whole. The link between levels should be that one part of the upper level becomes the whole of the lower level. Because of the color scheme, it appears that the part above does not account for the entirety of the pie below. For example, water in lakes is plotted on both the second and third layers while water in soil suddenly enters the diagram at the third level even though it should be part of the "drop" from the second layer.

***

I started playing around with various related forms. I like the concept of linking the layers and want to retain it. Here is one graphic inspired by the waterfall pies from above:

Redo_waterfall_pies