« November 2020 | Main | January 2021 »

Atypical time order and bubble labeling

This chart appeared in a Charles Schwab magazine in Summer, 2019.

Schwab_volatility2018

This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:

Junkcharts_redo_schwabvolatility_columnrank

Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.

***

In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.

Junkcharts_redo_schwabvolatility_hashyear

What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .

Junkcharts_redo_schwabvolatility_hashyearrelative

Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)

 

 


Happy holidays

A message of hope.

Junkcharts_2020_yearendmessage

In past years, I've featured pictures from great food from my travels. In this very different year, I'm showing some joyful creations from my kitchen.


Is this an example of good or bad dataviz?

This chart is giving me feelings:

Trump_mcconnell_chart

I first saw it on TV and then a reader submitted it.

Let's apply a Trifecta Checkup to the chart.

Starting at the Q corner, I can say the question it's addressing is clear and relevant. It's the relationship between Trump and McConnell's re-election. The designer's intended message comes through strongly - the chart offers evidence that McConnell owes his re-election to Trump.

Visually, the graphic has elements of great story-telling. It presents a simple (others might say, simplistic) view of the data - just the poll results of McConnell vs McGrath at various times, and the election result. It then flags key events, drawing the reader's attention to those. These events are selected based on key points on the timeline.

The chart includes wise design choices, such as no gridlines, infusing the legend into the chart title, no decimals (except for last pair of numbers, the intention of which I'm not getting), and leading with the key message.

I can nitpick a few things. Get rid of the vertical axis. Also, expand the scale so that the difference between 51%-40% and 58%-38% becomes more apparent. Space the time points in proportion to the dates. The box at the bottom is a confusing afterthought that reduces rather than assists the messaging.

But the designer got the key things right. The above suggestions do not alter the reader's expereince that much. It's a nice piece of visual story-telling, and from what I can see, has made a strong impact with the audience it is intended to influence.

_trifectacheckup_junkchartsThis chart is proof why the Trifecta Checkup has three corners, plus linkages between them. If we just evaluate what the visual is conveying, this chart is clearly above average.

***

In the D corner, we ask: what the Data are saying?

This is where the chart runs into several problems. Let's focus on the last two sets of numbers: 51%-40% and 58%-38%. Just add those numbers and do you notice something?

The last poll sums to 91%. This means that up to 10% of the likely voters responded "not sure" or some other candidate. If these "shy" voters show up at the polls as predicted by the pollsters, and if they voted just like the not shy voters, then the election result would have been 56%-44%, not 51%-40%. So, the 58%-38% result is within the margin of error of these polls. (If the "shy" voters break for McConnell in a 75%-25% split, then he gets 58% of the total votes.)

So, the data behind the line chart aren't suggesting that the election outcome is anomalous. This presents a problem with the Q-D and D-V green arrows as these pairs are not in sync.

***

In the D corner, we should consider the totality of the data available to the designer, not just what the designer chooses to utilize. The pivot of the chart is the flag annotating the "Trump robocall."

Here are some questions I'd ask the designer:

What else happened on October 31 in Kentucky?

What else happened on October 31, elsewhere in the country?

Was Trump featured in any other robocalls during the period portrayed?

How many robocalls were made by the campaign, and what other celebrities were featured?

Did any other campaign event or effort happen between the Trump robocall and election day?

Is there evidence that nothing else that happened after the robocall produced any value?

The chart commits the XYopia (i.e. X-Y myopia) fallacy of causal analysis. When the data analyst presents one cause and one effect, we are cued to think the cause explains the effect but in every scenario that is not a designed experiment, there are multiple causes at play. Sometimes, the more influential cause isn't the one shown in the chart.

***

Finally, let's draw out the connection between the last set of poll numbers and the election results. This shows why causal inference in observational data is such a beast.

Poll numbers are about a small number of people (500-1,000 in the case of Kentucky polls) who respond to polling. Election results are based on voters (> 2 million). An assumption made by the designer is that these polls are properly conducted, and their results are credible.

The chart above makes the claim that Trump's robocall gave McConnell 7% more votes than expected. This implies the robocall influenced at least 140,000 voters. Each such voter must fit the following criteria:

  • Was targeted by the Trump robocall
  • Was reached by the Trump robocall (phone was on, etc.)
  • Responded to the Trump robocall, by either picking up the phone or listening to the voice recording or dialing a call-back number
  • Did not previously intend to vote for McConnell
  • If reached by a pollster, would refuse to respond, or say not sure, or voting for McGrath or a third candidate
  • Had no other reason to change his/her behavior

Just take the first bullet for example. If we found a voter who switched to McConnell after October 31, and if this person was not on the robocall list, then this voter contributes to the unexpected gain in McConnell votes but weakens the case that the robocall influenced the election.

As analysts, our job is to find data to investigate all of the above. Some of these are easier to investigate. The campaign knows, for example, how many people were on the target list, and how many listened to the voice recording.

 

 

 

 


Aligning the visual and the data

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.

Wapo_donations

The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!

***

This is the same chart flipped over.

Junkcharts_redo_wapo_donations

Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for DataJournalism.com.

***

In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync. 

 


Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:

Policies_cases_hosp_static

Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.

***

Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.

Cases_over_time_grouped_by_policies

This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.

***

The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:

NewYork_policies_vs_cases

Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.

***

You can find the full article here.