Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


A testing mess: one chart, four numbers, four colors, three titles, wrong units, wrong lengths, wrong data

Twitterstan wanted to vote the following infographic off the island:

Tes_Alevelsresults

(The publisher's website is here but I can't find a direct link to this graphic.)

The mishap is particularly galling given the controversy swirling around this year's A-Level results in the U.K. For U.S. readers, you can think of A-Levels as SAT Subject Tests, which in the U.K. are required of all university applicants, and represent the most important, if not the sole, determinant of admissions decisions. Please see the upcoming post on my book blog for coverage of the brouhaha surrounding the statistical adjustments (to be posted sometime this week, it's here.).

The first issue you may notice about the chart is that the bar lengths have no relationship with the numbers printed on them. Here is a scatter plot correlating the bar lengths and the data.

Junkcharts_redo_tes_alevels_scatter


As you can see, nothing.

Then, you may wonder what the numbers mean. The annotation at the bottom right says "Average number of A level qualifications per student". Wow, the British (in this case, English) education system is a genius factory - with the average student mastering close to three thousand subjects in secondary (high) school!

TES is the cool name for what used to be the Times Educational Supplement. I traced the data back to Ofqual, which is the British regulator for these examinations. This is the Ofqual version of the above chart:

Ofqual_threeAstar

The data match. You may see that the header of the data table reads "Number of students in England getting 3 x A*". This is a completely different metric than number of qualifications - in fact, this metric measures geniuses. "A*" is the U.K. equivalent of "A+". When I studied under the British system, there was no such grade. I guess grade inflation is happening all over the world. What used to be A is now A+, and what used to be B is now A. Scoring three A*s is tops - I wonder if this should say 3 or more because I recall that you can take as many subjects as you desire but most students max out at three (may have been four).

The number of students attaining the highest achievement has increased in the last two years compared to the two years before. We can't interpret these data unless we know if the number of students also grew at similar rates.

The units are students while the units we expect from the TES graphic should be subjects. The cutoff for the data defines top students while the TES graphic should connote minimum qualification, i.e. a passing grade.

***
Now, the next section of the Ofqual infographic resolves the mystery. Here is the chart:

Ofqual_Alevelquals

This dataset has the right units and measurement. There is almost no meaningful shift in the last four years. The average number of qualifications per student is only different at the second decimal place. Replacing the original data with this set removes the confusion.

Junkcharts_redo_tes_alevels_correctdata

While I was re-making this chart, I also cleaned out the headers and sub-headers. This is an example of software hegemony: the designer wouldn't have repeated the same information three times on a chart with four numbers if s/he wasn't prompted by software defaults.

***

The corrected chart violates one of the conventions I described in my tutorial for DataJournalism.com: color difference should reflect data difference.

In the following side-by-side comparison, you see that the use of multiple colors on the left chart signals different data - note especially the top and bottom bars which carry the same number, but our expectation is frustrated.

Junkcharts_redo_tes_alevels_sidebyside

***

[P.S. 8/25/2020. Dan V. pointed out another problem with these bar charts: the bars were truncated so that the bar lengths are not proportional to the data. The corrected chart is shown on the right below:

Junkcharts_redo_tes_alevels_barlengths

8/26/2020: added link to the related post on my book blog.]


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


More visuals of the economic crisis

As we move into the next phase of the dataviz bonanza arising from the coronavirus pandemic, we will see a shift from simple descriptive graphics of infections and deaths to bivariate explanatory graphics claiming (usually spurious) correlations.

The FT is leading the way with this effort, and I hope all those who follow will make a note of several wise decisions they made.

  • They source their data. Most of the data about business activities come from private entities, many of which are data vendors who make money selling the data. In this article, FT got restaurant data from OpenTable, retail foot traffic data from Springboard, box office data from Box Office Mojo, flight data from Flightradar24, road traffic data from TomTom, and energy use data from European Network of Transmission System Operators for Electricity.
  • They generally let the data and charts speak without "story time". The text primarily describes the trends of the various data series.
  • They selected sectors that are obviously impacted by the shutdowns so any link between the observed trends and the crisis is plausible.

The FT charts are examples of clarity. Here is the one about road traffic patterns in major cities:

Ft_roadusage_corona_wrongsource

The cities are organized into regions: Europe, US, China, other Asia.

The key comparison is the last seven days versus the historical averages. The stories practically jump out of the page. Traffic in Paris collapsed on Tuesday. Wuhan is still locked down despite the falloff in infections. Drivers of Tokyo are probably wondering why teams are not going to the Olympics this year. Londoners? My guess is they're determined to not let another Brexit deadline slip.

***

I'd hope we go even further than FT when publishing this type of visual analytics involving "Big Data." These business data obtained from private sources typically have OCCAM properties: they are observational, seemingly complete, uncontrolled, adapted and merged. All these properties make the data very challenging to interpret.

The coronavirus case and death counts are simple by comparison. People are now aware of all the problems from differential rates of testing to which groups are selectively tested (i.e. triage) to how an infection or death is defined. The problems involving Big Data are much more complex.

I have three additional proposals:

Disclosure of Biases and Limitations

The private data have many more potential pitfalls. Take OpenTable data for example. The data measure restaurant bookings, not revenues. It measures gross bookings, not net bookings (i.e. removing no-shows). Only a proportion of restaurants use OpenTable (which cost owners money). OpenTable does not strike me as a quasi-monopoly so there are competitors with significant market share. The restaurants that use OpenTable do not form a random subsample of all restaurants. One of the most popular restaurants in the U.S. are pizza joints, with little of no seating, which do not feature in the bookings data. OpenTable also has differential popularity by country, region, or probably cuisine. 

I believe data journalists ought to provide such context in a footnote. Readers should have the information to judge whether they believe the data are sufficiently representative. Private data vendors who want data journalists to feature their datasets should be required to supply a footnote that describes the biases and limitations of their data.

Data journalists should think seriously about how they headline this type of chart. The standard practice is what FT adopted. The headline said "Restaurant bookings have collapsed" with a small footnote saying "Source: OpenTable". Should the headline have said "OpenTable bookings have collapsed" instead?

Disclosure of Definitions and Proxies

In the road traffic chart shown above, the metric is called "TomTom traffic congestion index". In order to earn this free advertising (euphemistically called "earned media" by industry), TomTom should be obliged to explain how this index is constructed. What does index = 100 mean?

[For example, it is curious that the Madrid index values are much lower across the board than those in Paris and Roma.]

For the electric usage chart, FT discloses the name of the data provider as a group of "43 electricity transmission system operators in 36 countries across Europe." Now, that is important context but can be better. The group may consist of 43 operators but how many of them are in the dataset? What proportion of the total electric usage do they account for in each country? If they have low penetration in a particular country, do they just report the low statistics or adjust the numbres?

If the journalist decides to use a proxy, for example, OpenTable restaurant bookings to reflect restaurant revenues, that should be explained, perhaps even justified.

Data as a Public Good

If private businesses choose to supply data to media outlets as a public service, they should allow the underlying data to be published.

Speaking from experience, I've seen a lot of bad data. It's one thing to hold your nose when the data are analyzed to make online advertising more profitable, or to find signals to profit from the stock market. It's another thing for the data analysis to drive public policy, in this case, policies that will have life-or-death implications.


All these charts lament the high prices charged by U.S. hospitals

Nyt_medicalprocedureprices

A former student asked me about this chart from the New York Times that highlights much higher prices of hospital procedures in the U.S. relative to a comparison group of seven countries.

The dot plot is clearly thought through. It is not a default chart that pops out of software.

Based on its design, we surmise that the designer has the following intentions:

  1. The names of the medical procedures are printed to be read, thus the long text is placed horizontally.

  2. The actual price is not as important as the relative price, expressed as an index with the U.S. price at 100%. These reference values are printed in glaring red, unignorable.

  3. Notwithstanding the above point, the actual price is still of secondary importance, and the values are provided as a supplement to the row labels. Getting to the actual prices in the comparison countries requires further effort, and a calculator.

  4. The primary comparison is between the U.S. and the rest of the world (or the group of seven countries included). It is less important to distinguish specific countries in the comparison group, and thus the non-U.S. dots are given pastels that take some effort to differentiate.

  5. Probably due to reader feedback, the font size is subject to a minimum so that some labels are split into two lines to prevent the text from dominating the plotting region.

***

In the Trifecta Checkup view of the world, there is no single best design. The best design depends on the intended message and what’s in the available data.

To illustate this, I will present a few variants of the above design, and discuss how these alternative designs reflect the designer's intentions.

Note that in all my charts, I expressed the relative price in terms of discounts, which is the mirror image of premiums. Instead of saying Country A's price is 80% of the U.S. price, I prefer to say Country A's price is a 20% saving (or discount) off the U.S. price.

First up is the following chart that emphasizes countries instead of hospital procedures:

Redo_medicalprice_hor_dot

This chart encourages readers to draw conclusions such as "Hospital prices are 60-80 percent cheaper in Holland relative to the U.S." But it is more taxing to compare the cost of a specific procedure across countries.

The indexing strategy already creates a barrier to understanding relative costs of a specific procedure. For example, the value for angioplasty in Australia is about 55% and in Switzerland, about 75%. The difference 75%-55% is meaningless because both numbers are relative savings from the U.S. baseline. Comparing Australia and Switzerland requires a ratio (0.75/0.55 = 1.36): Australia's prices are 36% above Swiss prices, or alternatively, Swiss prices are a 64% 26% discount off Australia's prices.

The following design takes it even further, excluding details of individual procedures:

Redo_medicalprice_hor_bar

For some readers, less is more. It’s even easier to get a rough estimate of how much cheaper prices are in the comparison countries, for now, except for two “outliers”, the chart does not display individual values.

The widths of these bars reveal that in some countries, the amount of savings depends on the specific procedures.

The bar design releases the designer from a horizontal orientation. The country labels are shorter and can be placed at the bottom in a vertical design:

Redo_medicalprice_vert_bar

It's not that one design is obviously superior to the others. Each version does some things better. A good designer recognizes the strengths and weaknesses of each design, and selects one to fulfil his/her intentions.

 

P.S. [1/3/20] Corrected a computation, explained in Ken's comment.


This Wimbledon beauty will be ageless

Ft_wimbledonage


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.

 

Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.

 

Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.

***

This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)


Putting the house in order, two Brexit polls

Reader Steve M. noticed an oversight in the Guardian in the following bar chart (link):

Guardian_Brexitpoll_1

The reporter was discussing an important story that speaks to the need for careful polling design. He was comparing two polls, one by Ipsos Mori, and one by YouGov, that estimates the vote support for each party in the future U.K. general election. The bottom line is that the YouGov poll predicts about double the support for the Brexit Party than the Ipsos-Mori poll.

The stacked bar chart should only be used for data that can be added up. Here, we should be comparing the numbers side by side:

Redo_junkcharts_brexitpoll_1

I've always found this standard display inadequate. The story here is the gap in the two bar lengths for the Brexit Party. A secondary story is that the support for the Brexit Party might come from voters breaking from Labour. In other words, we really want the reader to see:

Redo_junkcharts_brexitpoll_1b

Switching to a dot plot helps bring attention to the gaps:

Redo_junkcharts_brexitpoll_2

Now, putting the house in order:

Redo_junkcharts_brexitpoll_2b

Why do these two polls show such different results? As the reporter explained, the answer is in how the question was asked. The Ipsos-Mori is unprompted, meaning the Brexit Party was not announced to the respondent as one of the choices while the YouGov is prompted.

This last version imposes a direction on the gaps to bring out the secondary message - that the support for Brexit might be coming from voters breaking from Labour.

Redo_junkcharts_brexitpoll_2c