Is this an example of good or bad dataviz?

This chart is giving me feelings:

Trump_mcconnell_chart

I first saw it on TV and then a reader submitted it.

Let's apply a Trifecta Checkup to the chart.

Starting at the Q corner, I can say the question it's addressing is clear and relevant. It's the relationship between Trump and McConnell's re-election. The designer's intended message comes through strongly - the chart offers evidence that McConnell owes his re-election to Trump.

Visually, the graphic has elements of great story-telling. It presents a simple (others might say, simplistic) view of the data - just the poll results of McConnell vs McGrath at various times, and the election result. It then flags key events, drawing the reader's attention to those. These events are selected based on key points on the timeline.

The chart includes wise design choices, such as no gridlines, infusing the legend into the chart title, no decimals (except for last pair of numbers, the intention of which I'm not getting), and leading with the key message.

I can nitpick a few things. Get rid of the vertical axis. Also, expand the scale so that the difference between 51%-40% and 58%-38% becomes more apparent. Space the time points in proportion to the dates. The box at the bottom is a confusing afterthought that reduces rather than assists the messaging.

But the designer got the key things right. The above suggestions do not alter the reader's expereince that much. It's a nice piece of visual story-telling, and from what I can see, has made a strong impact with the audience it is intended to influence.

_trifectacheckup_junkchartsThis chart is proof why the Trifecta Checkup has three corners, plus linkages between them. If we just evaluate what the visual is conveying, this chart is clearly above average.

***

In the D corner, we ask: what the Data are saying?

This is where the chart runs into several problems. Let's focus on the last two sets of numbers: 51%-40% and 58%-38%. Just add those numbers and do you notice something?

The last poll sums to 91%. This means that up to 10% of the likely voters responded "not sure" or some other candidate. If these "shy" voters show up at the polls as predicted by the pollsters, and if they voted just like the not shy voters, then the election result would have been 56%-44%, not 51%-40%. So, the 58%-38% result is within the margin of error of these polls. (If the "shy" voters break for McConnell in a 75%-25% split, then he gets 58% of the total votes.)

So, the data behind the line chart aren't suggesting that the election outcome is anomalous. This presents a problem with the Q-D and D-V green arrows as these pairs are not in sync.

***

In the D corner, we should consider the totality of the data available to the designer, not just what the designer chooses to utilize. The pivot of the chart is the flag annotating the "Trump robocall."

Here are some questions I'd ask the designer:

What else happened on October 31 in Kentucky?

What else happened on October 31, elsewhere in the country?

Was Trump featured in any other robocalls during the period portrayed?

How many robocalls were made by the campaign, and what other celebrities were featured?

Did any other campaign event or effort happen between the Trump robocall and election day?

Is there evidence that nothing else that happened after the robocall produced any value?

The chart commits the XYopia (i.e. X-Y myopia) fallacy of causal analysis. When the data analyst presents one cause and one effect, we are cued to think the cause explains the effect but in every scenario that is not a designed experiment, there are multiple causes at play. Sometimes, the more influential cause isn't the one shown in the chart.

***

Finally, let's draw out the connection between the last set of poll numbers and the election results. This shows why causal inference in observational data is such a beast.

Poll numbers are about a small number of people (500-1,000 in the case of Kentucky polls) who respond to polling. Election results are based on voters (> 2 million). An assumption made by the designer is that these polls are properly conducted, and their results are credible.

The chart above makes the claim that Trump's robocall gave McConnell 7% more votes than expected. This implies the robocall influenced at least 140,000 voters. Each such voter must fit the following criteria:

  • Was targeted by the Trump robocall
  • Was reached by the Trump robocall (phone was on, etc.)
  • Responded to the Trump robocall, by either picking up the phone or listening to the voice recording or dialing a call-back number
  • Did not previously intend to vote for McConnell
  • If reached by a pollster, would refuse to respond, or say not sure, or voting for McGrath or a third candidate
  • Had no other reason to change his/her behavior

Just take the first bullet for example. If we found a voter who switched to McConnell after October 31, and if this person was not on the robocall list, then this voter contributes to the unexpected gain in McConnell votes but weakens the case that the robocall influenced the election.

As analysts, our job is to find data to investigate all of the above. Some of these are easier to investigate. The campaign knows, for example, how many people were on the target list, and how many listened to the voice recording.

 

 

 

 


Aligning the visual and the data

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.

Wapo_donations

The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!

***

This is the same chart flipped over.

Junkcharts_redo_wapo_donations

Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for DataJournalism.com.

***

In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync. 

 


Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

 

 


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


This exercise plan for your lock-down work-out is inspired by Venn

A twitter follower did not appreciate this chart from Nature showing the collection of flu-like symptoms that people reported they have to an UK tracking app. 

Nature tracking app venn diagram

It's a super-complicated Venn diagram. I have written about this type of chart before (see here); it appears to be somewhat popular in the medicine/biology field.

A Venn diagram is not a data visualization because it doesn't plot the data.

Notice that the different compartments of the Venn diagram do not have data encoded in the areas. 

The chart also fails the self-sufficiency test because if you remove the data from it, you end up with a data container - like a world map showing country boundaries and no data.

If you're new here: if a graphic requires the entire dataset to be printed on it for comprehension, then the visual elements of the graphic are not doing any work. The graphic cannot stand on its own.

When the Venn diagram gets complicated, teeming with many compartments, there will be quite a few empty compartments. If I have to make this chart, I'd be nervous about leaving out a number or two by accident. An empty cell can be truly empty or an oversight.

Another trap is that the total doesn't add up. The numbers on this graphic add to 1,764 whereas the study population in the preprint was 1,702. Interestingly, this diagram doesn't show up in the research paper. Given how they winnowed down the study population from all the app downloads, I'm sure there is an innocent explanation as to why those two numbers don't match.

***

The chart also strains the reader. Take the number 18, right in the middle. What combination of symptoms did these 18 people experience? You have to figure out the layers sitting beneath the number. You see dark blue, light blue, orange. If you blink, you might miss the gray at the bottom. Then you have to flip your eyes up to the legend to map these colors to diarrhoea, shortness of breath, anosmia, and fatigue. Oops, I missed the yellow, which is the cough. To be sure, you look at the remaining categories to see where they stand - I've named all of them except fever. The number 18 lies outside fever so this compartment represents everything except fever. 

What's even sadder is there is not much gain from having done it once. Try to interpret the number 50 now. Maybe I'm just slow but it doesn't get better the second or third time around. This graphic not only requires work but painstaking work!

Perhaps a more likely question is how many people who had a loss of smell also had fever. Now it's pretty easy to locate the part of the dark gray oval that overlaps with the orange oval. But now, I have to add all those numbers, 69+17+23+50+17+46 = 222. That's not enough. Next, I must find the total of all the numbers inside the orange oval, which is 222 plus what is inside the orange and outside the dark gray. That turns out to be 829. So among those who had lost smell, the proportion who also had fever is 222/(222+829) = 21 percent. 

How many people had three or more symptoms? I'll let you figure this one out!

 

 

 

 

 

 

 


Make your color legend better with one simple rule

The pie chart about COVID-19 worries illustrates why we should follow a basic rule of constructing color legends: order the categories in the way you expect readers to encounter them.

Here is the chart that I discussed the other day, with the data removed since they are not of concern in this post. (link)

Junkcharts_abccovidbiggestworries_sufficiency

First look at the pie chart. Like me, you probably looked at the orange or the yellow slice first, then we move clockwise around the pie.

Notice that the legend leads with the red square ("Getting It"), which is likely the last item you'll see on the chart.

This is the same chart with the legend re-ordered:

Redo_junkcharts_abcbiggestcovidworries_legend

***

Simple charts can be made better if we follow basic rules of construction. When used frequently, these rules can be made silent. I cover rules for legends as well as many other rules in this Long Read article titled "The Unspoken Conventions of Data Visualization" (link).


Bad data leave chart hanging by the thread

IGNITE National put out a press release saying that Gen Z white men are different from all other race-gender groups because they are more likely to be or lean Republican. The evidence is in this chart:

Genz_survey

Or is it?

Following our Trifecta Checkup framework (link), let's first look at the data. White men is the bottom left group. Democratic = 42%, Independent = 28%, Republican = 48%. That's a total of 118%. Unfortunately, this chart construction error erases the message. We don't know which of the three columns were incorrectly sized, or perhaps the data were incorrectly weighted so that the error is spread out between the three columns.

But the story of the graphic is hanging by the thread - the gap between Democratic and Republican lean amongst white men is 6 percent, which is smaller than the data error of 10 percent. I sent them a tweet asking for a correction. Will post the corrected version if they respond.

Update: The thread didn't break. They replied quickly and issued the following corrected chart:

Genz_corrected

Now, the data for white men are: Democratic = 35%, Independent = 22%; Republican = 40%. Roughly 7% shift for each party affilitation so they may have just started the baseline at the wrong level when inverting the columns.

***

The Visual design also has some problems. I am not a fan of inverting columns. In fact, column inversion may be the root of the error above.

Genz_whitemenLet me zoom in on the white men columns. (see right)

Without looking at the legend, can you guess which color is Democratic, Independent or Republican? Go ahead and take your best guess.

For me, I think red is Republican (by convention), then white is Independent (a neutral color) which means yellow is Democratic.

Here is the legend:

Genz-legend

So I got the yellow and white reversed. And that is another problem with the visual design. For a chart that shows two-party politics in the U.S., there is really no good reason to deviate from the red-blue convention. The color for Independents doesn't matter since it would be understood that the third color would represent them.

If the red-blue convention were followed, readers do not need to consult the legend.

***

In my Long Read article at DataJournalism.com, I included an "unspoken rule" about color selection: use the natural color mapping whenever possible. Go here to read about this and other rules.

The chart breaks another one of the unspoken conventions. When making a legend, place it near the top of the chart. Readers need to know the color mapping before they can understand the chart.

In addition, you want the reader's eyes to read the legend in the same way they read the columns. The columns goes left to right from Democratic to Independent to Republican. The legend should do the same!

***

Here is a quick re-do that fixes the visual issues (except the data error). It's an Excel chart but it doesn't have to be bad.

Redo_genzsurvey

 


Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.

Altmetric_tetanus

I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.

Altmetric_worldwidedeclineofentomofauna

Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:

Altmetric_dnananodevices

None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.

***

Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.

Michaelbales_citationimpact

Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).

***

I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.


Taking small steps to bring out the message

Happy new year! Good luck and best wishes!

***

We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.

Econ_internationalgroceries_sm

The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:

Redo_econgroceriesintl_1

The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:

Redo_econgroceriesintl_2

This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.

***

Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?