Finding the hidden information behind nice-looking charts

This chart from Business Insider caught my attention recently. (link)

Bi_householdwealthchart

There are various things they did which I like. The use of color to draw a distinction between the top 3 lines and the line at the bottom - which tells the story that the bottom 50% has been left far behind. Lines being labelled directly is another nice touch. I usually like legends that sit atop the chart; in this case, I'd have just written the income groups into the line labels.

Take a closer look at the legend text, and you'd notice they struggled with describing the income percentiles.

Bi_householdwealth_legend

This is a common problem with this type of data. The top and bottom categories are easy, as it's most natural to say "top x%" and "bottom y%". By doing so, we establish two scales, one running from the top, and the other counting from the bottom - and it's a head scratcher which scale to use for the middle categories.

The designer decided to lose the "top" and "bottom" descriptors, and went with "50-90%" and "90-99%". Effectively, these follow the "bottom" scale. "50-90%" is the bottom 50 to 90 percent, which corresponds to the top 10 to 50 percent. "90-99%" is the bottom 90-99%, which corresponds to the top 1 to 10%. On this chart, since we're lumping the top three income groups, I'd go with "top 1-10%" and "top 10-50%".

***

The Business Insider chart is easy to mis-read. It appears that the second group from the top is the most well-off, and the wealth of the top group is almost 20 times that of the bottom group. Both of those statements are false. What's confusing us is that each line represents very different numbers of people. The yellow line is 50% of the population while the "top 1%" line is 1% of the population. To see what's really going on, I look at a chart showing per-capita wealth. (Just divide the data of the yellow line by 50, etc.)

Redo_bihouseholdwealth_legend

For this chart, I switched to a relative scale, using the per-capita wealth of the Bottom 50% as the reference level (100). Also, I applied a 4-period moving average to smooth the line. The data actually show that the top 1% holds much more wealth per capita than all other income segments. Around 2011, the gap between the top 1% and the rest was at its widest - the average person in the top 1% is about 3,000 times wealthier than someone in the bottom 50%.

This chart raises another question. What caused the sharp rise in the late 2000s and the subsequent decline? By 2020, the gap between the top and bottom groups is still double the size of the gap from 20 years ago. We'd need additional analyses and charts to answer this question.

***

If you are familiar with our Trifecta Checkup, the Business Insider chart is a Type D chart. The problem with it is in how the data was analyzed.


Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.

***

I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.

Statista_avgnewcases

The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:

WHO_horizontal_casesbyregion

On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.

***

What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.

Junkcharts_redo_statistawho_covidregional

In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.

 

 


Circular areas offer misleading cues of their underlying data

John M. pointed me on Twitter to this chart about the progress of U.S.'s vaccination campaign:

Whgov_proportiongettingvaccinated

This looks like a White House production, retweeted by WHO. John is unhappy about this nested bubble format, which I'll come back to later.

Let's zoom in on what matters:

Whgov_proportiongettingvaccinated_clip

An even bigger problem with this chart is the Q corner in our Trifecta Checkup. What is the question they are trying to address? It would appear to be the proportion of population that has "already received [one or more doses of] vaccine". And the big words tell us the answer is 8 percent.

_junkcharts_trifectacheckupBut is that really the question? Check out the dark blue circle. It is labeled "population that has already received vaccine" and thus we infer this bubble represents 8 percent. Now look at the outer bubble. Its annotation is "new population that received vaccine since January 27, 2021". The only interpretation that makes sense is that 8 percent  is not the most current number. If that is the case, why would the headline highlight an older statistic, and not the most up-to-date one?

Perhaps the real question is how fast is the progress in vaccination. Perhaps it took weeks to get to the dark circle and then days to get beyond. In order to improve this data visualization, we must first decide what the question really is.

***

Now let's get to those nested bubbles. The bubble chart is a format that is not "sufficient," by which I mean the visual by itself does not convey the data without the help of aids such as labels. Try to answer the following questions:

Junkcharts_whgov_vaccineprogress_bubblequiz

In my view, if your answer to the last question is anything more than 5 seconds, the dataviz has failed. A successful data visualization should not make readers solve puzzles.

The first two questions depict the confusing nature of concentric circle diagrams. The first data point is coded to the inner circle. Where is the second data point? Is it encoded to the outer circle, or just the outer ring?

In either case, human brains are not trained to compare circular areas. For question 1, the outer circle is 70% larger than the smaller circle. For question 2, the ring is 70% of the area of the dark blue circle. If you're thinking those numbers seem unreasonable, I can tell you that was my first reaction too! So I made the following to convince myself that the calculation was correct:

Junkcharts_whgov_vaccineprogress_bubblequiz_2

Circular areas offer misleading visual cues, and should be used sparingly.

[P.S. 2/10/2021. In the next post, I sketch out an alternative dataviz for this dataset.]


Illustrating differential growth rates

Reader Mirko was concerned about a video published in Germany that shows why the new coronavirus variant is dangerous. He helpfully provided a summary of the transcript:

The South African and the British mutations of the SARS-COV-2 virus are spreading faster than the original virus. On average, one infected person infects more people than before. Researchers believe the new variant is 50 to 70 % more transmissible.

Here are two key moments in the video:

Germanvid_newvariant1

This seems to be saying the original virus (left side) replicates 3 times inside the infected person while the new variant (right side) replicates 19 times. So we have a roughly 6-fold jump in viral replication.

Germanvid_newvariant2

Later in the video, it appears that every replicate of the old virus finds a new victim while the 19 replicates of the new variant land on 13 new people, meaning 6 replicates didn't find a host.

As Mirko pointed out, the visual appears to have run away from the data. (In our Trifecta Checkup, we have a problem with the arrow between the D and the V corners. What the visual is saying is not aligned with what the data are saying.)

***

It turns out that the scientists have been very confusing when talking about the infectiousness of this new variant. The most quoted line is that the British variant is "50 to 70 percent more transmissible". At first, I thought this is a comment on the famous "R number". Since the R number around December was roughly 1 in the U.K, the new variant might bring the R number up to 1.7.

However, that is not the case. From this article, it appears that being 5o to 70 percent more transmissible means R goes up from 1 to 1.4. R is interpreted as the average number of people infected by one infected person.

Mirko wonders if there is a better way to illustrate this. I'm sure there are many better ways. Here's one I whipped up:

Junkcharts_redo_germanvideo_newvariant

The left side is for the 40% higher R number. Both sides start at the center with 10 infected people. At each time step, if R=1 (right side), each of the 10 people infects 10 others, so the total infections increase by 10 per time step. It's immediately obvious that a 40% higher R is very serious indeed. Starting with 10 infected people, in 10 steps, the total number of infections is almost 1,000, almost 10 times higher than when R is 1.

The lines of the graphs simulate the transmission chains. These are "average" transmission chains since R is an average number.

 

P.S. [1/29/2021: Added the missing link to the article in which it is reported that 50-70 percent more transmissible implies R increasing by 40%.]

 

 


Handling partial data on graphics

Last week, I posted on the book blog a piece about excess deaths and accelerated deaths (link). That whole piece is about how certain types of analysis have to be executed at certain moments of time.  The same analysis done at the wrong time yields the wrong conclusions.

Here is a good example of what I'm talking about. This is a graph of U.S. monthly deaths from Covid-19 during the entire pandemic. The chart is from the COVID Tracking Project, although I pulled it down from my Twitter feed.

Covidtracking_monthlydeaths

There is nothing majorly wrong with this column chart (I'd remove the axis labels). But there is a big problem. Are we seeing a boomerang of deaths from November to December to January?

Junkcharts_covidtrackingproject_monthlydeaths_1

Not really. This trend is there only because the chart is generated on January 12. The last column contains 12 days while the prior two columns contain 30-31 days.

Junkcharts_covidtrackingproject_monthlydeaths_2

The Trifecta Checkup picks up this problem. What the visual is showing isn't what the data are saying. I'd call this a Type D chart.

***

What to fix this?

One solution is to present partial data for all the other columns, so that the readers can compare the January column to the others.

Junkcharts_covidtrackingmonthydeaths_first12days

One critique of this is the potential seasonality. The first 38% (12 out of 31) of a month may not be comparable across months. A further seasonal adjustment makes this better - if we decide the benefits outweight the complexity.

Another solution is to project the full-month tally.

Junkcharts_covidtrackingmonthydeaths_projected

The critique here is the accuracy of the projection.

But the point is that not making the adjustment would be worse.

 

 


Is this an example of good or bad dataviz?

This chart is giving me feelings:

Trump_mcconnell_chart

I first saw it on TV and then a reader submitted it.

Let's apply a Trifecta Checkup to the chart.

Starting at the Q corner, I can say the question it's addressing is clear and relevant. It's the relationship between Trump and McConnell's re-election. The designer's intended message comes through strongly - the chart offers evidence that McConnell owes his re-election to Trump.

Visually, the graphic has elements of great story-telling. It presents a simple (others might say, simplistic) view of the data - just the poll results of McConnell vs McGrath at various times, and the election result. It then flags key events, drawing the reader's attention to those. These events are selected based on key points on the timeline.

The chart includes wise design choices, such as no gridlines, infusing the legend into the chart title, no decimals (except for last pair of numbers, the intention of which I'm not getting), and leading with the key message.

I can nitpick a few things. Get rid of the vertical axis. Also, expand the scale so that the difference between 51%-40% and 58%-38% becomes more apparent. Space the time points in proportion to the dates. The box at the bottom is a confusing afterthought that reduces rather than assists the messaging.

But the designer got the key things right. The above suggestions do not alter the reader's expereince that much. It's a nice piece of visual story-telling, and from what I can see, has made a strong impact with the audience it is intended to influence.

_trifectacheckup_junkchartsThis chart is proof why the Trifecta Checkup has three corners, plus linkages between them. If we just evaluate what the visual is conveying, this chart is clearly above average.

***

In the D corner, we ask: what the Data are saying?

This is where the chart runs into several problems. Let's focus on the last two sets of numbers: 51%-40% and 58%-38%. Just add those numbers and do you notice something?

The last poll sums to 91%. This means that up to 10% of the likely voters responded "not sure" or some other candidate. If these "shy" voters show up at the polls as predicted by the pollsters, and if they voted just like the not shy voters, then the election result would have been 56%-44%, not 51%-40%. So, the 58%-38% result is within the margin of error of these polls. (If the "shy" voters break for McConnell in a 75%-25% split, then he gets 58% of the total votes.)

So, the data behind the line chart aren't suggesting that the election outcome is anomalous. This presents a problem with the Q-D and D-V green arrows as these pairs are not in sync.

***

In the D corner, we should consider the totality of the data available to the designer, not just what the designer chooses to utilize. The pivot of the chart is the flag annotating the "Trump robocall."

Here are some questions I'd ask the designer:

What else happened on October 31 in Kentucky?

What else happened on October 31, elsewhere in the country?

Was Trump featured in any other robocalls during the period portrayed?

How many robocalls were made by the campaign, and what other celebrities were featured?

Did any other campaign event or effort happen between the Trump robocall and election day?

Is there evidence that nothing else that happened after the robocall produced any value?

The chart commits the XYopia (i.e. X-Y myopia) fallacy of causal analysis. When the data analyst presents one cause and one effect, we are cued to think the cause explains the effect but in every scenario that is not a designed experiment, there are multiple causes at play. Sometimes, the more influential cause isn't the one shown in the chart.

***

Finally, let's draw out the connection between the last set of poll numbers and the election results. This shows why causal inference in observational data is such a beast.

Poll numbers are about a small number of people (500-1,000 in the case of Kentucky polls) who respond to polling. Election results are based on voters (> 2 million). An assumption made by the designer is that these polls are properly conducted, and their results are credible.

The chart above makes the claim that Trump's robocall gave McConnell 7% more votes than expected. This implies the robocall influenced at least 140,000 voters. Each such voter must fit the following criteria:

  • Was targeted by the Trump robocall
  • Was reached by the Trump robocall (phone was on, etc.)
  • Responded to the Trump robocall, by either picking up the phone or listening to the voice recording or dialing a call-back number
  • Did not previously intend to vote for McConnell
  • If reached by a pollster, would refuse to respond, or say not sure, or voting for McGrath or a third candidate
  • Had no other reason to change his/her behavior

Just take the first bullet for example. If we found a voter who switched to McConnell after October 31, and if this person was not on the robocall list, then this voter contributes to the unexpected gain in McConnell votes but weakens the case that the robocall influenced the election.

As analysts, our job is to find data to investigate all of the above. Some of these are easier to investigate. The campaign knows, for example, how many people were on the target list, and how many listened to the voice recording.

 

 

 

 


Aligning the visual and the data

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.

Wapo_donations

The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!

***

This is the same chart flipped over.

Junkcharts_redo_wapo_donations

Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for DataJournalism.com.

***

In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync. 

 


Visualizing change over time: case study via Arstechnica

ArsTechnica published the following chart in its article titled "Grim new analyses spotlight just how hard the U.S. is failing in  pandemic" (link).

Artechnica-covid-mortality

There are some very good things about this chart, so let me start there.

In a Trifecta Checkup, I'd give the Q corner high marks. The question is clear: how has the U.S. performed relative to other countries? In particular, the chart gives a nuanced answer to this question. The designer realizes that there are phases in the pandemic, so the same question is asked three times: how has the U.S. performed relative to other countries since June, since May, and since the start of the pandemic?

In the D corner, this chart also deserves a high score. It selects a reasonable measure of mortality, which is deaths per population. It simplifies cognition by creating three grades of mortality rates per 100,000. Grade A is below 5 deaths, Grade B, between 5 and 25, and Grade C is above 25. 

A small deduction for not including the source of the data (the article states it's from a JAMA article). If any reader notices problems with the underlying data or calculations, please leave a comment.

***

So far so good. And yet, you might feel like I'm over-praising a chart that feels distinctly average. Not terrible, not great.

The reason for our ambivalence is the V corner. This is what I call a Type V chart. The visual design isn't doing justice to the underlying question and data analysis.

The grouped bar chart isn't effective here because the orange bars dominate our vision. It's easy to see how each country performed over the course of the pandemic but it's hard to learn how countries compare to each other in different periods.

How are the countries ordered? It would seem like the orange bars may be the sorting variable but this interpretation fails in the third group of countries.

The designer apparently made the decision to place the U.S. at the bottom (i.e. the worst of the league table). As I will show later, this is justified but the argument cannot be justified by the orange bars alone. The U.S. is worse in both the blue and purple bars but not the orange.

This points out that there is interest in the change in rates (or ranks) over time. And in the following makeover, I used the Bumps chart as the basis, as its chief use is in showing how ranking changes over time.

Redo_junkcharts_at_coviddeathstable_1

 

Better clarity can often be gained by subtraction:

Redo_junkcharts_at_coviddeathstable_2


Bloomberg made me digest these graphics slowly

Ask the experts to name the success metric of good data visualization, and you will receive a dozen answers. The field doesn't have an all-encompassing metric. A useful reference is Andrew Gelman and Antony Urwin (2012) in which they discussed the tradeoff between beautiful and informative, which derives from the familiar tension between art and science.

For a while now, I've been intrigued by metrics that measure "effort". Some years ago, I described the concept of a "return on effort" in this post. Such a metric can be constructed like the dominating financial metric of return on investment. The investment here is an investment of time, of attention. I strongly believe that if the consumer judges a data visualization to be compelling, engaging or  ell constructed, s/he will expend energy to devour it.

Imagine grub you discard after the first bite, compared to the delicious food experienced slowly, savoring every last bit.

Bloomberg_ambridge_smI'm writing this post while enjoying the September issue of Bloomberg Businessweek, which focuses on the upcoming U.S. Presidential election. There are various graphics infused into the pages of the magazine. Many of these graphics operate at a level of complexity above what typically show up in magazines, and yet I spent energy learning to understand them. This response, I believe, is what visual designers should aim for.

***

Today, I discuss one example of these graphics, shown on the right. You might be shocked by the throwback style of these graphics. They look like they arrived from decades ago!

Grayscale, simple forms, typewriter font, all caps. Have I gone crazy?

The article argues that a town like Ambridge in Beaver County, Pennslyvania may be pivotal in the November election. The set of graphics provides relevant data to understand this argument.

It's evidence that data visualization does not need whiz-bang modern wizardry to excel.

Let me focus on the boxy charts from the top of the column. These:

Bloomberg_ambridge_topboxes

These charts solve a headache with voting margin data in the U.S.  We have two dominant political parties so in any given election, the vote share data split into three buckets: Democratic, Republican, and a catch-all category that includes third parties, write-ins, and none of the above. The third category rarely exceeds 5 percent.  A generic pie chart representation looks like this:

Redo_junkcharts_bloombergambridgebox_pies

Stacked bars have this look:

Redo_junkcharts_bloombergambridgebox_bars

In using my Trifecta framework (link), the top point is articulating the question. The primary issue here is the voting margin between the winner and the second-runner-up, which is the loser in what is typically a two-horse race. There exist two sub-questions: the vote-share difference between the top two finishers, and the share of vote effectively removed from the pot by the remaining candidates.

Now, take another look at the unusual chart form used by Bloomberg:

Bloomberg_ambridge_topboxes1

The catch-all vote share sits at the bottom while the two major parties split up the top section. This design demonstrates a keen understanding of the context. Consider the typical outcome, in which the top two finishers are from the two major parties. When answering the first sub-question, we can choose the raw vote shares, or the normalized vote shares. Normalizing shifts the base from all candidates to the top two candidates.

The Bloomberg chart addresses both scales. The normalized vote shares can be read directly by focusing only on the top section. In an even two-horse race, the top section is split by half - this holds true regardless of the size of the bottom section.

This is a simple chart that packs a punch.

 


The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.