Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

Visualcapitalist_spending_generalcommerce

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

Visualcapitalist_spending_fooddelivery

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

Visualcapitalist_spending_travel

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.


More visuals of the economic crisis

As we move into the next phase of the dataviz bonanza arising from the coronavirus pandemic, we will see a shift from simple descriptive graphics of infections and deaths to bivariate explanatory graphics claiming (usually spurious) correlations.

The FT is leading the way with this effort, and I hope all those who follow will make a note of several wise decisions they made.

  • They source their data. Most of the data about business activities come from private entities, many of which are data vendors who make money selling the data. In this article, FT got restaurant data from OpenTable, retail foot traffic data from Springboard, box office data from Box Office Mojo, flight data from Flightradar24, road traffic data from TomTom, and energy use data from European Network of Transmission System Operators for Electricity.
  • They generally let the data and charts speak without "story time". The text primarily describes the trends of the various data series.
  • They selected sectors that are obviously impacted by the shutdowns so any link between the observed trends and the crisis is plausible.

The FT charts are examples of clarity. Here is the one about road traffic patterns in major cities:

Ft_roadusage_corona_wrongsource

The cities are organized into regions: Europe, US, China, other Asia.

The key comparison is the last seven days versus the historical averages. The stories practically jump out of the page. Traffic in Paris collapsed on Tuesday. Wuhan is still locked down despite the falloff in infections. Drivers of Tokyo are probably wondering why teams are not going to the Olympics this year. Londoners? My guess is they're determined to not let another Brexit deadline slip.

***

I'd hope we go even further than FT when publishing this type of visual analytics involving "Big Data." These business data obtained from private sources typically have OCCAM properties: they are observational, seemingly complete, uncontrolled, adapted and merged. All these properties make the data very challenging to interpret.

The coronavirus case and death counts are simple by comparison. People are now aware of all the problems from differential rates of testing to which groups are selectively tested (i.e. triage) to how an infection or death is defined. The problems involving Big Data are much more complex.

I have three additional proposals:

Disclosure of Biases and Limitations

The private data have many more potential pitfalls. Take OpenTable data for example. The data measure restaurant bookings, not revenues. It measures gross bookings, not net bookings (i.e. removing no-shows). Only a proportion of restaurants use OpenTable (which cost owners money). OpenTable does not strike me as a quasi-monopoly so there are competitors with significant market share. The restaurants that use OpenTable do not form a random subsample of all restaurants. One of the most popular restaurants in the U.S. are pizza joints, with little of no seating, which do not feature in the bookings data. OpenTable also has differential popularity by country, region, or probably cuisine. 

I believe data journalists ought to provide such context in a footnote. Readers should have the information to judge whether they believe the data are sufficiently representative. Private data vendors who want data journalists to feature their datasets should be required to supply a footnote that describes the biases and limitations of their data.

Data journalists should think seriously about how they headline this type of chart. The standard practice is what FT adopted. The headline said "Restaurant bookings have collapsed" with a small footnote saying "Source: OpenTable". Should the headline have said "OpenTable bookings have collapsed" instead?

Disclosure of Definitions and Proxies

In the road traffic chart shown above, the metric is called "TomTom traffic congestion index". In order to earn this free advertising (euphemistically called "earned media" by industry), TomTom should be obliged to explain how this index is constructed. What does index = 100 mean?

[For example, it is curious that the Madrid index values are much lower across the board than those in Paris and Roma.]

For the electric usage chart, FT discloses the name of the data provider as a group of "43 electricity transmission system operators in 36 countries across Europe." Now, that is important context but can be better. The group may consist of 43 operators but how many of them are in the dataset? What proportion of the total electric usage do they account for in each country? If they have low penetration in a particular country, do they just report the low statistics or adjust the numbres?

If the journalist decides to use a proxy, for example, OpenTable restaurant bookings to reflect restaurant revenues, that should be explained, perhaps even justified.

Data as a Public Good

If private businesses choose to supply data to media outlets as a public service, they should allow the underlying data to be published.

Speaking from experience, I've seen a lot of bad data. It's one thing to hold your nose when the data are analyzed to make online advertising more profitable, or to find signals to profit from the stock market. It's another thing for the data analysis to drive public policy, in this case, policies that will have life-or-death implications.


Revisiting global car sales

We looked at the following chart in the previous blog. The data concern the growth rates of car sales in different regions of the world over time.

Cnbc zh global car sales

Here is a different visualization of the same data.

Redo_cnbc_globalcarsales

Well, it's not quite the same data. I divided the global average growth rate by four to yield an approximation of the true global average. (The reason for this is explained in the other day's post.)

The chart emphasizes how each region was helping or hurting the global growth. It also features the trend in growth within each region.

 


This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]


Say it thrice: a nice example of layering and story-telling

I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).

It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.

Nyt_candidatemap_1

This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.

I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.

The next section presents the same data in a small-multiples layout. The heads are replaced by dots.

Nyt_candidatemap_2

This allows more precise comparison of one candidate to another, and one location to another.

This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.

Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.

Nyt_candidatemap_3

Click here to see the animation.

When repetition is done right, it doesn't feel like repetition.

 


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Is the visual serving the question?

The following chart concerns California's bullet train project.

California_bullettrain

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

Highspeedtrains_sufficiency

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

Redo_highspeedtrains

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.


Transforming the data to fit the message

A short time ago, there were reports that some theme-park goers were not happy about the latest price hike by Disney. One of these report, from the Washington Post (link), showed a chart that was intended to convey how much Disney park prices have outpaced inflation. Here is the chart:

Wapo_magickingdom_price_changes

I had a lot of trouble processing this chart. The two lines are labeled "original price" and "in 2014 dollars". The lines show a gap back in the 1970s, which completely closes up by 2014. This gives the reader an impression that the problem has melted away - which is the opposite of the designer intended.

The economic concept being marshalled here is the time value of money, or inflation. The idea is that $3.50 in 1971 is equivalent to a much higher ticket price in "2014 dollars" because by virtue of inflation, putting that $3.50 in the bank in 1971 and holding till 2014 would make that sum "nominally" higher. In fact, according to the chart, the $3.50 would have become $20.46, an approx. 7-fold increase.

The gap thus represents the inflation factor. The gap melting away is a result of passing of time. The closer one is to the present, the less the effect of cumulative inflation. The story being visualized is that Disney prices are increasing quickly whether or not one takes inflation into account. Further, if inflation were to be considered, the rate of increase is lower (red line).

What about the alternative story - Disney's price increases are often much higher than inflation? We can take the nominal price increase, and divide it into two parts, one due to inflation (of the prior-period price), and the other in excess of inflation, which we will interpret as a Disney premium.

The following chart then illustrates this point of view:

Redo_disneypricehikes

Most increases are small, and stay close to the inflation rate. But once in a while, and especially in 2010s, the price increases have outpaced inflation by a lot.

Note: since 2013, Disney has introduced price tiers, starting with two and currently at four levels. In the above chart, I took the average of the available prices, making the assumption that all four levels are equally popular. The last number looks like a price decrease because there is a new tier called "Low". The data came from AllEars.net.