Graphing the economic crisis of Covid-19

My friend Ray Vella at The Conference Board has a few charts up on their coronavirus website. TCB is a trusted advisor and consultant to large businesses and thus is a good place to learn how the business community is thinking about this crisis.

I particularly like the following chart:

Tcb_stockmarketindices_fourcrises

This puts the turmoil in the stock market in perspective. We are roughly tracking the decline of the Great Recession of the late 2000s. It's interesting that 9/11 caused very mild gyrations in the S&P index compared to any of the other events. 

The chart uses an index with value 100 at Day 0. Day 0 is defined by the trigger event for each crisis. About three weeks into the current crisis, the S&P has lost over 30% of its value.

The device of a gray background for the bottom half of the chart is surprisingly effective.

***

Here is a chart showing the impact of the Covid-19 crisis on different sectors.

Tcb-COVID-19-manual-services-1170

So the full-service restaurant industry is a huge employer. Restaurants employ 7-8 times more people than airlines. Airlines employ about the same numbers of people as "beverage bars" (which I suppose is the same as "bars" which apparently is different from "drinking places"). Bars employ 7 times more people than "Cafeterias, etc.".

The chart describes where the jobs are, and which sectors they believe will be most impacted. It's not clear yet how deeply these will be impacted. Being in NYC, the complete shutdown is going to impact 100% of these jobs in certain sectors like bars, restaurants and coffee shops.


It's impossible to understand Super Tuesday, this chart says

Twitter people are talking about this chart, from NPR (link):

Npr_delegates

This was published on Wednesday after Super Tuesday, the day on which multiple states held their primary elections. On the Democratic side, something like a third of the delegates were up for grabs (although as the data below this chart shows, a big chunk of the delegates, mostly from California and Texas, have yet to be assigned to a candidate as they were still counting votes.)

Here, I hovered over the Biden line, trying to decipher the secret code in these lines:

Npr_supertuesday_biden

I have to say I failed. Biden won 6 delegates on Feb 3, 9 on Feb 22, 39 on Feb 29, and 512 on Mar 3. I have no idea how those numbers led to this line!

***

Here is what happened so far in the Democratic primary:

Junkcharts_redo_nprsupertuesday_sm

The key tradeoff the designer has to make here is the relative importance of the timeline and the total count. In this chart, it's easiest to compare the total count across candidates as of the Wednesday morning, then to see how each candidate accumulates the delegates over the first five contest days. It takes a little more effort to see who's ahead after each contest day. And it is almost impossible to see the spacing of the contest days over the calendar.

I don't use stacked bar charts often but this chart form makes clear the cumulative counts over time so it's appropriate here.

Also, the as-yet-unassigned delegates is a big part of the story and needs to be visualized.

 

P.S. See comment below. There was a bug in the code and they fixed the line chart.

Npr_supertuesday_2

So, some of the undecided delegates have been awarded and comparing the two charts, it appears that the gap went down from 105 to 76. Still over 150 delegates not assigned.

 


Whither the youth vote

The youth turnout is something that politicians and pundits bring up constantly when talking about the current U.S. presidential primaries. So I decided to look for the data. I found some data at the United States Election Project, a site maintained by Dr. Michael McDonald. The key chart is this one:

Electproject_voterturnoutbyage

This is classic Excel.

***

Here is a quick fix:

Redo_electprojects_voterturnout

The key to the fix is to recognize the structure of the data.

The sawtooth pattern displayed in the original chart does not convey any real trends - it's an artifact that many people only turn out for presidential elections. (As a result, the turnout during presidential election years is driven by the general election turnout.)

The age groups have an order so instead of four different colors, use a progressive color scheme. This is one of the unspoken rules about color usage in data visualization, featured in my Long Read article.

***

What do I learn from this turnout by age group chart?

Younger voters are much more invested in presidential elections than off-year elections. The youth turnout for presidential elections is double that for other years.

Participation increased markedly in the 2018 mid-term elections across all four age groups, reflecting the passion for or against President Donald Trump. This was highly unusual - and in fact, the turnout for that off-year is closer to the turnout of a presidential year election. Whether the turnout will stay at this elevated level is a big question for 2022!

For presidential elections, turnout has been creeping up over time for all age groups. But the increase in 2016 (Hillary Clinton vs Donald Trump) was mild. The growth in participation is more noticable in the younger age groups, including in 2016.

Let's look at the relative jumps in 2018 (right side of the left chart). The younger the age group, the larger the jump. Turnout in the 18-29 group doubled to 32 percent. Turnout in the oldest age group increased by 20%, nothing to sneeze at but less impressive than in the younger age groups.

Why this is the case should be obvious. The 60+ age group has a ceiling. It's already at 60-70%; how much higher can it go? People at that age have many years to develop their preference for voting in elections. It would be hard to convince the holdouts (hideouts?) to vote.

The younger age groups are further from the ceiling. If you're an organizer, will you focus your energy on the 60% non-voting 18-29-years-old, or the 30% non-voting 60+ years-old? [This is the same question any business faces: do you win incremental sales from your more loyal customers, hoping they would spend even more, or your less loyal customers?]

For Democratic candidates, the loss in 2016 is hanging over them. Getting the same people to vote in 2020 as in 2016 is a losing hand. So, they need to expand the base somehow.

If you're a candidate like Joe Biden who relies on the 60+ year old bloc, it's hard to see where he can expand the base. Your advantage is that the core voter bloc is reliable. Your problem is that you don't have appeal to the younger age groups. So a viable path to winning in the general election has to involve flipping older Trump voters. The incremental ex-Trump voters have to offset the potential loss in turnout from younger voters.

If you're a candidate like Bernie Sanders who relies on the youth vote, you'd want to launch a get-out-the-vote effort aimed at younger voters. A viable path can be created by expanding the base through lifting the turnout rate of younger voters. The incremental young voters have to offset the fraction of the 60+ year old bloc who flip to Trump.

 

 

 

 

 

 


This Excel chart looks standard but gets everything wrong

The following CNBC chart (link) shows the trend of global car sales by region (or so we think).

Cnbc zh global car sales

This type of chart is quite common in finance/business circles, and has the fingerprint of Excel. After examining it, I nominate it for the Hall of Shame.

***

The chart has three major components vying for our attention: (1) the stacked columns, (2) the yellow line, and (3) the big red dashed arrow.

The easiest to interpret is the yellow line, which is labeled "Total" in the legend. It displays the annual growth rate of car sales around the globe. The data consist of annual percentage changes in car sales, so the slope of the yellow line represents a change of change, which is not particularly useful.

The big red arrow is making the point that the projected decline in global car sales in 2019 will return the world to the slowdown of 2008-9 after almost a decade of growth.

The stacked columns appear to provide a breakdown of the global growth rate by region. Looked at carefully, you'll soon learn that the visual form has hopelessly mangled the data.

Cnbc_globalcarsales_2006

What is the growth rate for Chinese car sales in 2006? Is it 2.5%, the top edge of China's part of the column? Between 1.5% and 2.5%, the extant of China's section? The answer is neither. Because of the stacking, China's growth rate is actually the height of the relevant section, that is to say, 1 percent. So the labels on the vertical axis are not directly useful to learning regional growth rates for most sections of the chart.

Can we read the vertical axis as global growth rate? That's not proper either. The different markets are not equal in size so growth rates cannot be aggregated by simple summing - they must be weighted by relative size.

The negative growth rates present another problem. Even if we agree to sum growth rates ignoring relative market sizes, we still can't get directly to the global growth rate. We would have to take the total of the positive rates and subtract the total of the negative rates.  

***

At this point, you may begin to question everything you thought you knew about this chart. Remember the yellow line, which we thought measures the global growth rate. Take a look at the 2006 column again.

The global growth rate is depicted as 2 percent. And yet every region experienced growth rates below 2 percent! No matter how you aggregate the regions, it's not possible for the world average to be larger than the value of each region.

For 2006, the regional growth rates are: China, 1%; Rest of the World, 1%; Western Europe, 0.1%; United States, -0.25%. A simple sum of those four rates yields 2%, which is shown on the yellow line.

But this number must be divided by four. If we give the four regions equal weight, each is worth a quarter of the total. So the overall average is the sum of each growth rate weighted by 1/4, which is 0.5%. [In reality, the weights of each region should be scaled to reflect its market size.]

***

tldr; The stacked column chart with a line overlay not only fails to communicate the contents of the car sales data but it also leads to misinterpretation.

I discussed several serious problems of this chart form: 

  • stacking the columns make it hard to learn the regional data

  • the trend by region takes a super effort to decipher

  • column stacking promotes reading meaning into the height of the column but the total height is meaningless (because of the negative section) while the net height (positive minus negative) also misleads due to presumptive equal weighting

  • the yellow line shows the sum of the regional data, which is four times the global growth rate that it purports to represent

 

***

PS. [12/4/2019: New post up with a different visualization.]


This chart tells you how rich is rich - if you can read it

Via twitter, John B. sent me the following YouGov chart (link) that he finds difficult to read:

Yougov_whoisrich

The title is clear enough: the higher your income, the higher you set the bar.

When one then moves from the title to the chart, one gets misdirected. The horizontal axis shows pound values, so the axis naturally maps to "the higher your income". But it doesn't. Those pound values are the "cutoff" values - the line between "rich" and "not rich". Even after one realizes this detail, the axis  presents further challenges: the cutoff values are arbitrary numbers such as "45,001" sterling; and these continuous numbers are treated as discrete categories, with irregular intervals between each category.

There is some very interesting and hard to obtain data sitting behind this chart but the visual form suppresses them. The best way to understand this dataset is to first think about each income group. Say, people who make between 20 to 30 thousand sterling a year. Roughly 10% of these people think "rich" starts at 25,000. Forty percent of this income group think "rich" start at 40,000.

For each income group, we have data on Z percent think "rich" starts at X. I put all of these data points into a heatmap, like this:

Redo_junkcharts_yougovuk_whoisrich

Technical note: in order to restore the horizontal axis to a continuous scale, you can take the discrete data from the original chart, then fit a smoothed curve through those points, and finally compute the interpolated values for any income level using the smoothing model.

***

There are some concerns about the survey design. It's hard to get enough samples for higher-income people. This is probably why the highest income segment starts at 50,000. But notice that 50,ooo is around the level at which lower-income people consider "rich". So, this survey is primarily about how low-income people perceive "rich" people.

The curve for the highest income group is much straighter and smoother than the other lines - that's because it's really the average of a number of curves (for each 10,000 sterling segment).

 

P.S. The YouGov tweet that publicized the small-multiples chart shown above links to a page that no longer contains the chart. They may have replaced it due to feedback.

 

 


Graph literacy, in a sense

Ben Jones tweeted out this chart, which has an unusual feature:

Malefemaleliteracyrates

What's unusual is that time runs in both directions. Usually, the rule is that time runs left to right (except, of course, in right-to-left cultures). Here, the purple area chart follows that convention while the yellow area chart inverts it.

On the one hand, this is quite cute. Lines meeting in the middle. Converging. I get it.

On the other hand, every time a designer defies conventions, the reader has to recognize it, and to rationalize it.

In this particular graphic, I'm not convinced. There are four numbers only. The trend on either side looks linear so the story is simple. Why complicate it using unusual visual design?

Here is an entirely conventional bumps-like chart that tells the story:

Redo_literacyratebygender

I've done a couple of things here that might be considered controversial.

First, I completely straightened out the lines. I don't see what additional precision is bringing to the chart.

Second, despite having just four numbers, I added the year 1996 and vertical gridlines indicating decades. A Tufte purist will surely object.

***

Related blog post: "The Return on Effort in Data Graphics" (link)


Tennis greats at the top of their game

The following chart of world No. 1 tennis players looks pretty but the payoff of spending time to understand it isn't high enough. The light colors against the tennis net backdrop don't work as intended. The annotation is well done, and it's always neat to tug a legend inside the text.

Tableautennisnumberones

The original is found at Tableau Public (link).

The topic of the analysis appears to be the ages at which tennis players attained world #1 ranking. Here are the male players visualized differently:

Redo_junkcharts_no1tennisplayers

Some players like Jimmy Connors and Federer have second springs after dominating the game in their late twenties. It's relatively rare for players to get to #1 after 30.


Choosing between individuals and aggregates

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

Jnci_genemap

This so-called swimmer plot shows one bar per patient:

Jnci_swimlanes

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

Jnci_spaghetti

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Dorian_spaghetti

Compare this to the "cone of uncertainty" map that was featured in the prior post:

AL052019_5day_cone_with_line_and_wind

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.


This Wimbledon beauty will be ageless

Ft_wimbledonage


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.

 

Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.

 

Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.

***

This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)


SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 

Scmp_protestsovertime

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

Scmp_freedomsurvey

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

Scmp_protestcostume

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

Scmp_protestscale_pitches

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

***
Here is the entire infographics.