The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


Proportions and rates: we are no dupes

Reader Lucia G. sent me this chart, from Ars Technica's FAQ about the coronavirus:

Arstechnica_covid-19-2.001-1280x960

She notices something wrong with the axis.

The designer took the advice not to make a dual axis, but didn't realize that the two metrics are not measured on the same scale even though both are expressed as percentages.

The blue bars, labeled "cases", is a distribution of cases by age group. The sum of the blue bars should be 100 percent.

The orange bars show fatality rates by age group. Each orange bar's rate is based on the number of cases in that age group. The sum of the orange bars will not add to 100 percent.

In general, the rates will have much lower values than the proportions. At least that should be the case for viruses that are not extremely fatal.

This is what the 80 and over section looks like.

Screen Shot 2020-03-12 at 1.19.46 AM

It is true that fatality rate (orange) is particularly high for the elderly while this age group accounts for less than 5 percent of total cases (blue). However, the cases that are fatal, which inhabit the orange bar, must be a subset of the total cases for 80 and over, which are shown in the blue bar. Conceptually, the orange bar should be contained inside the blue bar. So, it's counter-intuitive that the blue bar is so much shorter than the orange bar.

The following chart fixes this issue. It reveals the structure of the data, Total cases are separated by age group, then within each age group, a proportion of the cases are fatal.

Junkcharts_redo_arstechnicacovid19

This chart also shows that most patients recover in every age group. (This is only approximately true as some of the cases may not have been discharged yet.)

***

This confusion of rates and proportions reminds me of something about exit polls I just wrote about the other day on the sister blog.

When the media make statements about trends in voter turnout rate in the primary elections, e.g. when they assert that youth turnout has not increased, their evidence is from exit polls, which can measure only the distribution of voters by age group. Exit polls do not and cannot measure the turnout rate, which is the proportion of registered (or eligible) voters in the specific age group who voted.

Like the coronavirus data, the scales of these two metrics are different even though they are both percentages: the turnout rate is typically a number between 30 and 70 percent, and summing the rates across all age groups will exceed 100 percent many times over. Summing the proportions of voters across all age groups should be 100 percent, and no more.

Changes in the proportion of voters aged 18-29 and changes in the turnout rate of people aged 18-29 are not the same thing. The former is affected by the turnout of all age groups while the latter is a clean metric affected only by 18 to 29-years-old.

Basically, ignore pundits who use exit polls to comment on turnout trends. No matter how many times they repeat their nonsense, proportions and rates are not to be confused. Which means, ignore comments on turnout trends because the only data they've got come from exit polls which don't measure rates.

 

P.S. Here is some further explanation of my chart, as a response to a question from Enrico B. on Twitter.

The chart can be thought of as two distributions, one for cases (gray) and one for deaths (red). Like this:

Junkcharts_redo_arstechnicacoronavirus_2

The side-by-side version removes the direct visualization of the fatality rate within each age group. To understand fatality rate requires someone to do math in their head. Readers can qualitatively assess that for the 80 and over, they accounted for 3 percent of cases but also about 21 percent of deaths. People aged 70 to 79 however accounted for 9 percent of cases but 30 percent of deaths, etc.

What I did was to scale the distribution of deaths so that they can be compared to the cases. It's like fitting the red distribution inside the gray distribution. Within each age group, the proportion of red against the length of the bar is the fatality rate.

For every 100 cases regardless of age, 3 cases are for people aged 80 and over within which 0.5 are fatal (red).

So, the axis labels are correct. The values are proportions of total cases, although as the designer of the chart, I hope people are paying attention more to the proportion of red, as opposed to the units.

What might strike people as odd is that the biggest red bar does not appear against 80 and above. We might believe it's deadlier the older you are. That's because on an absolute scale, more people aged 70-79 died than those 80 and above. The absolute deaths is the product of the proportion of cases and the fatality rate. That's really a different story from the usual plot of fatality rates by age group. In those charts, we "control" for the prevalence of cases. If every age group were infected in the same frequency, then COVID-19 does kill more 80 and over.

 

 

 


It's impossible to understand Super Tuesday, this chart says

Twitter people are talking about this chart, from NPR (link):

Npr_delegates

This was published on Wednesday after Super Tuesday, the day on which multiple states held their primary elections. On the Democratic side, something like a third of the delegates were up for grabs (although as the data below this chart shows, a big chunk of the delegates, mostly from California and Texas, have yet to be assigned to a candidate as they were still counting votes.)

Here, I hovered over the Biden line, trying to decipher the secret code in these lines:

Npr_supertuesday_biden

I have to say I failed. Biden won 6 delegates on Feb 3, 9 on Feb 22, 39 on Feb 29, and 512 on Mar 3. I have no idea how those numbers led to this line!

***

Here is what happened so far in the Democratic primary:

Junkcharts_redo_nprsupertuesday_sm

The key tradeoff the designer has to make here is the relative importance of the timeline and the total count. In this chart, it's easiest to compare the total count across candidates as of the Wednesday morning, then to see how each candidate accumulates the delegates over the first five contest days. It takes a little more effort to see who's ahead after each contest day. And it is almost impossible to see the spacing of the contest days over the calendar.

I don't use stacked bar charts often but this chart form makes clear the cumulative counts over time so it's appropriate here.

Also, the as-yet-unassigned delegates is a big part of the story and needs to be visualized.

 

P.S. See comment below. There was a bug in the code and they fixed the line chart.

Npr_supertuesday_2

So, some of the undecided delegates have been awarded and comparing the two charts, it appears that the gap went down from 105 to 76. Still over 150 delegates not assigned.

 


Whither the youth vote

The youth turnout is something that politicians and pundits bring up constantly when talking about the current U.S. presidential primaries. So I decided to look for the data. I found some data at the United States Election Project, a site maintained by Dr. Michael McDonald. The key chart is this one:

Electproject_voterturnoutbyage

This is classic Excel.

***

Here is a quick fix:

Redo_electprojects_voterturnout

The key to the fix is to recognize the structure of the data.

The sawtooth pattern displayed in the original chart does not convey any real trends - it's an artifact that many people only turn out for presidential elections. (As a result, the turnout during presidential election years is driven by the general election turnout.)

The age groups have an order so instead of four different colors, use a progressive color scheme. This is one of the unspoken rules about color usage in data visualization, featured in my Long Read article.

***

What do I learn from this turnout by age group chart?

Younger voters are much more invested in presidential elections than off-year elections. The youth turnout for presidential elections is double that for other years.

Participation increased markedly in the 2018 mid-term elections across all four age groups, reflecting the passion for or against President Donald Trump. This was highly unusual - and in fact, the turnout for that off-year is closer to the turnout of a presidential year election. Whether the turnout will stay at this elevated level is a big question for 2022!

For presidential elections, turnout has been creeping up over time for all age groups. But the increase in 2016 (Hillary Clinton vs Donald Trump) was mild. The growth in participation is more noticable in the younger age groups, including in 2016.

Let's look at the relative jumps in 2018 (right side of the left chart). The younger the age group, the larger the jump. Turnout in the 18-29 group doubled to 32 percent. Turnout in the oldest age group increased by 20%, nothing to sneeze at but less impressive than in the younger age groups.

Why this is the case should be obvious. The 60+ age group has a ceiling. It's already at 60-70%; how much higher can it go? People at that age have many years to develop their preference for voting in elections. It would be hard to convince the holdouts (hideouts?) to vote.

The younger age groups are further from the ceiling. If you're an organizer, will you focus your energy on the 60% non-voting 18-29-years-old, or the 30% non-voting 60+ years-old? [This is the same question any business faces: do you win incremental sales from your more loyal customers, hoping they would spend even more, or your less loyal customers?]

For Democratic candidates, the loss in 2016 is hanging over them. Getting the same people to vote in 2020 as in 2016 is a losing hand. So, they need to expand the base somehow.

If you're a candidate like Joe Biden who relies on the 60+ year old bloc, it's hard to see where he can expand the base. Your advantage is that the core voter bloc is reliable. Your problem is that you don't have appeal to the younger age groups. So a viable path to winning in the general election has to involve flipping older Trump voters. The incremental ex-Trump voters have to offset the potential loss in turnout from younger voters.

If you're a candidate like Bernie Sanders who relies on the youth vote, you'd want to launch a get-out-the-vote effort aimed at younger voters. A viable path can be created by expanding the base through lifting the turnout rate of younger voters. The incremental young voters have to offset the fraction of the 60+ year old bloc who flip to Trump.

 

 

 

 

 

 


Bad data leave chart hanging by the thread

IGNITE National put out a press release saying that Gen Z white men are different from all other race-gender groups because they are more likely to be or lean Republican. The evidence is in this chart:

Genz_survey

Or is it?

Following our Trifecta Checkup framework (link), let's first look at the data. White men is the bottom left group. Democratic = 42%, Independent = 28%, Republican = 48%. That's a total of 118%. Unfortunately, this chart construction error erases the message. We don't know which of the three columns were incorrectly sized, or perhaps the data were incorrectly weighted so that the error is spread out between the three columns.

But the story of the graphic is hanging by the thread - the gap between Democratic and Republican lean amongst white men is 6 percent, which is smaller than the data error of 10 percent. I sent them a tweet asking for a correction. Will post the corrected version if they respond.

Update: The thread didn't break. They replied quickly and issued the following corrected chart:

Genz_corrected

Now, the data for white men are: Democratic = 35%, Independent = 22%; Republican = 40%. Roughly 7% shift for each party affilitation so they may have just started the baseline at the wrong level when inverting the columns.

***

The Visual design also has some problems. I am not a fan of inverting columns. In fact, column inversion may be the root of the error above.

Genz_whitemenLet me zoom in on the white men columns. (see right)

Without looking at the legend, can you guess which color is Democratic, Independent or Republican? Go ahead and take your best guess.

For me, I think red is Republican (by convention), then white is Independent (a neutral color) which means yellow is Democratic.

Here is the legend:

Genz-legend

So I got the yellow and white reversed. And that is another problem with the visual design. For a chart that shows two-party politics in the U.S., there is really no good reason to deviate from the red-blue convention. The color for Independents doesn't matter since it would be understood that the third color would represent them.

If the red-blue convention were followed, readers do not need to consult the legend.

***

In my Long Read article at DataJournalism.com, I included an "unspoken rule" about color selection: use the natural color mapping whenever possible. Go here to read about this and other rules.

The chart breaks another one of the unspoken conventions. When making a legend, place it near the top of the chart. Readers need to know the color mapping before they can understand the chart.

In addition, you want the reader's eyes to read the legend in the same way they read the columns. The columns goes left to right from Democratic to Independent to Republican. The legend should do the same!

***

Here is a quick re-do that fixes the visual issues (except the data error). It's an Excel chart but it doesn't have to be bad.

Redo_genzsurvey

 


Where are the Democratic donors?

I like Alberto's discussion of the attractive maps about donors to Democratic presidential candidates, produced by the New York Times (direct link).

Here is the headline map:

Nyt_demdonormaps

The message is clear: Bernie Sanders is the only candidate with nation-wide appeal. The breadth of his coverage is breath-taking. (I agree with Alberto's critique about the lack of a color scale. It's impossible to know if the counts are trivial or not.)

Bernie's coverage is so broad that his numbers overwhelm those of all other candidates except in their home bases (e.g. O'Rourke in Texas).

A remedy to this is to look at the data after removing Bernie's numbers.

Nyt_demdonormap_2

 

This pair of maps reminds me of the Sri Lanka religions map that I revisualized in this post.

Redo_srilankareligiondistricts_v2

The first two maps divide the districts into those in which one religion dominates and those in which multiple religions share the limelight. The third map then shows the second-rank religion in the mixed-religions districts.

The second map in the NYT's donor map series plots the second-rank candidate in all the precincts that Bernie Sanders lead. It's like the designer pulled off the top layer (blue: Bernie) to reveal what's underneath.

Because all of Bernie's data are removed, O'Rourke is still dominating Texas, Buttigieg in Indiana, etc. An alternative is to pull off the top layer in those pockets as well. Then, it's likely to see Bernie showing up in those areas.

The other startling observation is how small Joe Biden's presence is on these maps. This is likely because Biden relies primarily on big donors.

See here for the entire series of donor maps. See here for past discussion of New York Times's graphics.


What is a bad chart?

In the recent issue of Madolyn Smith’s Conversations with Data newsletter hosted by DataJournalism.com, she discusses “bad charts,” featuring submissions from several dataviz bloggers, including myself.

What is a “bad chart”? Based on this collection of curated "bad charts", it is not easy to nail down “bad-ness”. The common theme is the mismatch between the message intended by the designer and the message received by the reader, a classic error of communication. How such mismatch arises depends on the specific example. I am able to divide the “bad charts” into two groups: charts that are misinterpreted, and charts that are misleading.

 

Charts that are misinterpreted

The Causes of Death entry, submitted by Alberto Cairo, is a “well-designed” chart that requires “reading the story where it is inserted and the numerous caveats.” So readers may misinterpret the chart if they do not also partake the story at Our World in Data which runs over 1,500 words not including the appendix.

Ourworldindata_causesofdeath

The map of Canada, submitted by Highsoft, highlights in green the provinces where the majority of residents are members of the First Nations. The “bad” is that readers may incorrectly “infer that a sizable part of the Canadian population is First Nations.”

Highsoft_CanadaFirstNations

In these two examples, the graphic is considered adequate and yet the reader fails to glean the message intended by the designer.

 

Charts that are misleading

Two fellow bloggers, Cole Knaflic and Jon Schwabish, offer the advice to start bars at zero (here's my take on this rule). The “bad” is the distortion introduced when encoding the data into the visual elements.

The Color-blindness pictogram, submitted by Severino Ribecca, commits a similar faux pas. To compare the rates among men and women, the pictograms should use the same baseline.

Colourblindness_pictogram

In these examples, readers who correctly read the charts nonetheless leave with the wrong message. (We assume the designer does not intend to distort the data.) The readers misinterpret the data without misinterpreting the graphics.

 

Using the Trifecta Checkup

In the Trifecta Checkup framework, these problems are second-level problems, represented by the green arrows linking up the three corners. (Click here to learn more about using the Trifecta Checkup.)

Trifectacheckup_img

The visual design of the Causes of Death chart is not under question, and the intended message of the author is clearly articulated in the text. Our concern is that the reader must go outside the graphic to learn the full message. This suggests a problem related to the syncing between the visual design and the message (the QV edge).

By contrast, in the Color Blindness graphic, the data are not under question, nor is the use of pictograms. Our concern is how the data got turned into figurines. This suggests a problem related to the syncing between the data and the visual (the DV edge).

***

When you complain about a misleading chart, or a chart being misinterpreted, what do you really mean? Is it a visual design problem? a data problem? Or is it a syncing problem between two components?


SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 

Scmp_protestsovertime

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

Scmp_freedomsurvey

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

Scmp_protestcostume

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

Scmp_protestscale_pitches

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

***
Here is the entire infographics.


Three estimates, two differences trip up an otherwise good design

Reader Fernando P. was baffled by this chart from the Perception Gap report by More in Common. (link to report)

Moreincommon_perceptiongap_republicans

Overall, this chart is quite good. Its flaws are subtle. There is so much going on, perhaps even the designer found it hard to keep level.

The title is "Democrat's Perception Gap" which actually means the gap between Democrats' perception of Republicans and Republican's self-reported views. We are talking about two estimates of Republican views. Conversely, in Figure 2 (not shown), the "Republican's Perception Gap" describes two estimates of Democrat views.

The gap is visually shown as the gray bar between the red dot and the blue dot. This is labeled perception gap, and its values are printed on the right column, also labeled perception gap.

Perhaps as an after-thought, the designer added the yellow stripes, which is a third estimate of Republican views, this time by Independents. This little addition wreaks havoc. There are now three estimates - and two gaps. There is a new gap, between Independents' perception of Republican views, and Republican's self-reported views. This I-gap is hidden in plain sight. The words "perception gap" obstinately sticks to the D-gap.

***

Here is a slightly modified version of the same chart.

Redo_perceptiongap_republicans

 

The design focuses attention on the two gaps (bars). It also identifies the Republican self-perception as the anchor point from which the gaps are computed.

I have chosen to describe the Republican dot as "self-perception" rather than "actual view," which connotes a form of "truth." Rather than considering the gap as an error of estimation, I like to think of the gap as the difference between two groups of people asked to estimate a common quantity.

Also, one should note that on the last two issues, there is virtual agreement.

***

Aside from the visual, I have doubts about the value of such a study. Only the most divisive issues are being addressed here. Adding a few bipartisan issues would provide controls that can be useful to tease out what is the baseline perception gap.

I wonder whether there is a self-selection in survey response, such that people with extreme views (from each party) will be under-represented. Further, do we believe that all survey respondents will provide truthful answers to sensitive questions that deal with racism, sexism, etc.? For example, if I am a moderate holding racist views, would I really admit to racism in a survey?