The why axis

A few weeks ago, I replied to a tweet by someone who was angered by the amount of bad graphics about coronavirus. I take a glass-half-full viewpoint: it's actually heart-warming for  dataviz designers to realize that their graphics are being read! When someone critiques your work, it is proof that they cared enough to look at it. Worse is when you publish something, and no one reacts to it.

That said, I just wasted half an hour trying to get into the head of the person who made the following:

Fox31_co_newcases edited

Longtime reader Chris P. forwarded this tweet to me, and I saw that Andrew Gelman got sent this one, too.

The chart looked harmless until you check out the vertical axis labels. It's... um... the most unusual. The best way to interpret what the designer did is to break up the chart into three components. Like this:

Redo_junkcharts_fox31cocases

The big mystery is why the designer spent the time and energy to make this mischief.

The usual suspect is fake news. The clearest sign of malintent is the huge size of the dots. Each dot spans almost the entirety of the space between gridlines.

But there is almost no fake news here. The overall trend line is intact despite the attempted distortion. The following is a superposition of an unmanipulated line (yellow) on top of the manipulated:

Redo_junkcharts_fox31cocases2

***

The next guess is incompetence. The evidence against this view is the amount of energy required to execute these changes. In Excel, it takes a lot of work. It's easier to do this in R or any programming languages with which you can design your own axis.

Even for the R coders, the easy part is to replicate the design, but the hard part is to come up with the concept in the first place!

You can't just stumble onto a design like this. So I am not convinced the designer is an idiot.

***

How much work? You have to create three separate charts, with three carefully chosen vertical scales, and then clip, merge, and sew the seam. The weirdest bit is throwing away three of the twelve axis labels and writing in three fake numbers.

Here's the recipe: (if the gif doesn't load automatically, click on it)

Fox31_co_cases_B6

Help me readers! I'm stumped. Why oh why did someone make this? What is the point?

 

 


Make your color legend better with one simple rule

The pie chart about COVID-19 worries illustrates why we should follow a basic rule of constructing color legends: order the categories in the way you expect readers to encounter them.

Here is the chart that I discussed the other day, with the data removed since they are not of concern in this post. (link)

Junkcharts_abccovidbiggestworries_sufficiency

First look at the pie chart. Like me, you probably looked at the orange or the yellow slice first, then we move clockwise around the pie.

Notice that the legend leads with the red square ("Getting It"), which is likely the last item you'll see on the chart.

This is the same chart with the legend re-ordered:

Redo_junkcharts_abcbiggestcovidworries_legend

***

Simple charts can be made better if we follow basic rules of construction. When used frequently, these rules can be made silent. I cover rules for legends as well as many other rules in this Long Read article titled "The Unspoken Conventions of Data Visualization" (link).


Graphing the economic crisis of coronavirus 2

Last week, I discussed Ray's chart that compares the S&P 500 performance in this crisis against previous crises.

A reminder:

Tcb_stockmarketindices_fourcrises

Another useful feature is the halo around the right edge of the COVID-19 line. This device directs our eyes to where he wants us to look.

In the same series, he made the following for The Conference Board (link):

TCB-COVID-19-impact-oil-prices-640

Two things I learned from this chart:

The oil market takes a much longer time to recover after crises, compared to the S&P. None of these lines reached above 100 in the first 150 days (5 months).

Just like the S&P, the current crisis is most similar in severity to the 2008 Great Recession, only worse, and currently, the price collapse in oil is quite a bit worse than in 2008.

***
The drop of oil is going to be contentious. This is a drop too many for a Tufte purist. It might as well symbolize a tear shed.

The presence of the icon tells me these lines depict the oil market without having to read text. And I approve.


When the visual runs away from the data

The pressure of the coronavirus news cycle has gotten the better of some graphics designers. Via Twitter, Mark B sent me the following chart:

Junkcharts_abccovidbiggestworries_sufficiency

I applied the self-sufficiency test to this pie chart. That's why you can't see the data which were also printed on the chart.

The idea of self-sufficiency is to test how much work the visual elements of the graphic are doing to convey its message. Look at the above chart, and guess the three values are.

Roughly speaking, all three answers are equally popular, with perhaps a little less than a third of respondents indicating "Getting It" as their biggest COVID-19 worry.

If measured, the slices represent 38%, 35% and 27%.

Now, here is the same chart with the data:

Abc_covidbiggestworries

Each number is way off! In addition, the three numbers sum to 178%.

Trifectacheckup_junkcharts_imageThis is an example of the Visual being at odds with the Data, using a Trifecta Checkup analysis. (Read about the Trifecta here.)

What the Visual is saying is not the same as what the data are saying. So the green arrow between D and V is broken.

***

This is a rather common mistake. This survey question apparently allows each respondent to select more than one answers. Whenever more than one responses are accepted, one cannot use a pie chart.

Here is a stacked bar chart that does right by the data.

Redo_junkcharts_abcbiggestcovidworries

 


The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.

Junkcharts_cnncovidcases_sufficiency_1

Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:

Junkcharts_cnncovidcases_sufficiency_2

U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:

Cnn_covidcases

The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.

***

In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:

Epicurve_coronavirus

(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.

***

Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.

 


Graphing the economic crisis of Covid-19

My friend Ray Vella at The Conference Board has a few charts up on their coronavirus website. TCB is a trusted advisor and consultant to large businesses and thus is a good place to learn how the business community is thinking about this crisis.

I particularly like the following chart:

Tcb_stockmarketindices_fourcrises

This puts the turmoil in the stock market in perspective. We are roughly tracking the decline of the Great Recession of the late 2000s. It's interesting that 9/11 caused very mild gyrations in the S&P index compared to any of the other events. 

The chart uses an index with value 100 at Day 0. Day 0 is defined by the trigger event for each crisis. About three weeks into the current crisis, the S&P has lost over 30% of its value.

The device of a gray background for the bottom half of the chart is surprisingly effective.

***

Here is a chart showing the impact of the Covid-19 crisis on different sectors.

Tcb-COVID-19-manual-services-1170

So the full-service restaurant industry is a huge employer. Restaurants employ 7-8 times more people than airlines. Airlines employ about the same numbers of people as "beverage bars" (which I suppose is the same as "bars" which apparently is different from "drinking places"). Bars employ 7 times more people than "Cafeterias, etc.".

The chart describes where the jobs are, and which sectors they believe will be most impacted. It's not clear yet how deeply these will be impacted. Being in NYC, the complete shutdown is going to impact 100% of these jobs in certain sectors like bars, restaurants and coffee shops.


Proportions and rates: we are no dupes

Reader Lucia G. sent me this chart, from Ars Technica's FAQ about the coronavirus:

Arstechnica_covid-19-2.001-1280x960

She notices something wrong with the axis.

The designer took the advice not to make a dual axis, but didn't realize that the two metrics are not measured on the same scale even though both are expressed as percentages.

The blue bars, labeled "cases", is a distribution of cases by age group. The sum of the blue bars should be 100 percent.

The orange bars show fatality rates by age group. Each orange bar's rate is based on the number of cases in that age group. The sum of the orange bars will not add to 100 percent.

In general, the rates will have much lower values than the proportions. At least that should be the case for viruses that are not extremely fatal.

This is what the 80 and over section looks like.

Screen Shot 2020-03-12 at 1.19.46 AM

It is true that fatality rate (orange) is particularly high for the elderly while this age group accounts for less than 5 percent of total cases (blue). However, the cases that are fatal, which inhabit the orange bar, must be a subset of the total cases for 80 and over, which are shown in the blue bar. Conceptually, the orange bar should be contained inside the blue bar. So, it's counter-intuitive that the blue bar is so much shorter than the orange bar.

The following chart fixes this issue. It reveals the structure of the data, Total cases are separated by age group, then within each age group, a proportion of the cases are fatal.

Junkcharts_redo_arstechnicacovid19

This chart also shows that most patients recover in every age group. (This is only approximately true as some of the cases may not have been discharged yet.)

***

This confusion of rates and proportions reminds me of something about exit polls I just wrote about the other day on the sister blog.

When the media make statements about trends in voter turnout rate in the primary elections, e.g. when they assert that youth turnout has not increased, their evidence is from exit polls, which can measure only the distribution of voters by age group. Exit polls do not and cannot measure the turnout rate, which is the proportion of registered (or eligible) voters in the specific age group who voted.

Like the coronavirus data, the scales of these two metrics are different even though they are both percentages: the turnout rate is typically a number between 30 and 70 percent, and summing the rates across all age groups will exceed 100 percent many times over. Summing the proportions of voters across all age groups should be 100 percent, and no more.

Changes in the proportion of voters aged 18-29 and changes in the turnout rate of people aged 18-29 are not the same thing. The former is affected by the turnout of all age groups while the latter is a clean metric affected only by 18 to 29-years-old.

Basically, ignore pundits who use exit polls to comment on turnout trends. No matter how many times they repeat their nonsense, proportions and rates are not to be confused. Which means, ignore comments on turnout trends because the only data they've got come from exit polls which don't measure rates.

 

P.S. Here is some further explanation of my chart, as a response to a question from Enrico B. on Twitter.

The chart can be thought of as two distributions, one for cases (gray) and one for deaths (red). Like this:

Junkcharts_redo_arstechnicacoronavirus_2

The side-by-side version removes the direct visualization of the fatality rate within each age group. To understand fatality rate requires someone to do math in their head. Readers can qualitatively assess that for the 80 and over, they accounted for 3 percent of cases but also about 21 percent of deaths. People aged 70 to 79 however accounted for 9 percent of cases but 30 percent of deaths, etc.

What I did was to scale the distribution of deaths so that they can be compared to the cases. It's like fitting the red distribution inside the gray distribution. Within each age group, the proportion of red against the length of the bar is the fatality rate.

For every 100 cases regardless of age, 3 cases are for people aged 80 and over within which 0.5 are fatal (red).

So, the axis labels are correct. The values are proportions of total cases, although as the designer of the chart, I hope people are paying attention more to the proportion of red, as opposed to the units.

What might strike people as odd is that the biggest red bar does not appear against 80 and above. We might believe it's deadlier the older you are. That's because on an absolute scale, more people aged 70-79 died than those 80 and above. The absolute deaths is the product of the proportion of cases and the fatality rate. That's really a different story from the usual plot of fatality rates by age group. In those charts, we "control" for the prevalence of cases. If every age group were infected in the same frequency, then COVID-19 does kill more 80 and over.

 

 

 


Comparing chance of death of coronavirus and flu

The COVID-19 charts are proving one thing. When the topic of a dataviz is timely and impactful, readers will study the graphics and ask questions. I've been sent some of these charts lately, and will be featuring them here.

A former student saw this chart from Business Insider (link) and didn't like it.

Businesinsider_coronavirus_flu_compare

My initial reaction was generally positive. It's clear the chart addresses a comparison between death rates of the flu and COVID19, an important current question. The side-by-side panel is effective at allowing such a comparison. The column charts look decent, and there aren't excessive gridlines.

Sure, one sees a few simple design fixes, like removing the vertical axis altogether (since the entire dataset has already been printed). I'd also un-slant the age labels.

***

I'd like to discuss some subtler improvements.

A primary challenge is dealing with the different definitions of age groups across the two datasets. While the side-by-side column charts prompt readers to go left-right, right-left in comparing death rates, it's not easy to identify which column to compare to which. This is not fixable in the datasets because the organizations that compile them define their own age groups.

Also, I prefer to superimpose the death rates on the same chart, using something like a dot plot rather than a column chart. This makes the comparison even easier.

Here is a revised visualization:

Redo_businessinsider_covid19fatalitybyage

The contents of this chart raise several challenges to public health officials. Clearly, hospital resources should be preferentially offered to older patients. But young people could be spreading the virus among the community.

Caution is advised as the data for COVID19 suffers from many types of inaccuracies, as outlined here.


It's impossible to understand Super Tuesday, this chart says

Twitter people are talking about this chart, from NPR (link):

Npr_delegates

This was published on Wednesday after Super Tuesday, the day on which multiple states held their primary elections. On the Democratic side, something like a third of the delegates were up for grabs (although as the data below this chart shows, a big chunk of the delegates, mostly from California and Texas, have yet to be assigned to a candidate as they were still counting votes.)

Here, I hovered over the Biden line, trying to decipher the secret code in these lines:

Npr_supertuesday_biden

I have to say I failed. Biden won 6 delegates on Feb 3, 9 on Feb 22, 39 on Feb 29, and 512 on Mar 3. I have no idea how those numbers led to this line!

***

Here is what happened so far in the Democratic primary:

Junkcharts_redo_nprsupertuesday_sm

The key tradeoff the designer has to make here is the relative importance of the timeline and the total count. In this chart, it's easiest to compare the total count across candidates as of the Wednesday morning, then to see how each candidate accumulates the delegates over the first five contest days. It takes a little more effort to see who's ahead after each contest day. And it is almost impossible to see the spacing of the contest days over the calendar.

I don't use stacked bar charts often but this chart form makes clear the cumulative counts over time so it's appropriate here.

Also, the as-yet-unassigned delegates is a big part of the story and needs to be visualized.

 

P.S. See comment below. There was a bug in the code and they fixed the line chart.

Npr_supertuesday_2

So, some of the undecided delegates have been awarded and comparing the two charts, it appears that the gap went down from 105 to 76. Still over 150 delegates not assigned.

 


Whither the youth vote

The youth turnout is something that politicians and pundits bring up constantly when talking about the current U.S. presidential primaries. So I decided to look for the data. I found some data at the United States Election Project, a site maintained by Dr. Michael McDonald. The key chart is this one:

Electproject_voterturnoutbyage

This is classic Excel.

***

Here is a quick fix:

Redo_electprojects_voterturnout

The key to the fix is to recognize the structure of the data.

The sawtooth pattern displayed in the original chart does not convey any real trends - it's an artifact that many people only turn out for presidential elections. (As a result, the turnout during presidential election years is driven by the general election turnout.)

The age groups have an order so instead of four different colors, use a progressive color scheme. This is one of the unspoken rules about color usage in data visualization, featured in my Long Read article.

***

What do I learn from this turnout by age group chart?

Younger voters are much more invested in presidential elections than off-year elections. The youth turnout for presidential elections is double that for other years.

Participation increased markedly in the 2018 mid-term elections across all four age groups, reflecting the passion for or against President Donald Trump. This was highly unusual - and in fact, the turnout for that off-year is closer to the turnout of a presidential year election. Whether the turnout will stay at this elevated level is a big question for 2022!

For presidential elections, turnout has been creeping up over time for all age groups. But the increase in 2016 (Hillary Clinton vs Donald Trump) was mild. The growth in participation is more noticable in the younger age groups, including in 2016.

Let's look at the relative jumps in 2018 (right side of the left chart). The younger the age group, the larger the jump. Turnout in the 18-29 group doubled to 32 percent. Turnout in the oldest age group increased by 20%, nothing to sneeze at but less impressive than in the younger age groups.

Why this is the case should be obvious. The 60+ age group has a ceiling. It's already at 60-70%; how much higher can it go? People at that age have many years to develop their preference for voting in elections. It would be hard to convince the holdouts (hideouts?) to vote.

The younger age groups are further from the ceiling. If you're an organizer, will you focus your energy on the 60% non-voting 18-29-years-old, or the 30% non-voting 60+ years-old? [This is the same question any business faces: do you win incremental sales from your more loyal customers, hoping they would spend even more, or your less loyal customers?]

For Democratic candidates, the loss in 2016 is hanging over them. Getting the same people to vote in 2020 as in 2016 is a losing hand. So, they need to expand the base somehow.

If you're a candidate like Joe Biden who relies on the 60+ year old bloc, it's hard to see where he can expand the base. Your advantage is that the core voter bloc is reliable. Your problem is that you don't have appeal to the younger age groups. So a viable path to winning in the general election has to involve flipping older Trump voters. The incremental ex-Trump voters have to offset the potential loss in turnout from younger voters.

If you're a candidate like Bernie Sanders who relies on the youth vote, you'd want to launch a get-out-the-vote effort aimed at younger voters. A viable path can be created by expanding the base through lifting the turnout rate of younger voters. The incremental young voters have to offset the fraction of the 60+ year old bloc who flip to Trump.