Nov 23, 2020

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.

A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:

To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.

***

What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:

Trump led from beginning to end, with his lead widening slightly as more votes were counted.

***

As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.

(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!

***

Locating the political center

Oct 26, 2020

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

***

Here are the rightmost two charts.

Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

On these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

Visualizing change over time: case study via Arstechnica

Oct 22, 2020

ArsTechnica published the following chart in its article titled "Grim new analyses spotlight just how hard the U.S. is failing in  pandemic" (link).

In a Trifecta Checkup, I'd give the Q corner high marks. The question is clear: how has the U.S. performed relative to other countries? In particular, the chart gives a nuanced answer to this question. The designer realizes that there are phases in the pandemic, so the same question is asked three times: how has the U.S. performed relative to other countries since June, since May, and since the start of the pandemic?

In the D corner, this chart also deserves a high score. It selects a reasonable measure of mortality, which is deaths per population. It simplifies cognition by creating three grades of mortality rates per 100,000. Grade A is below 5 deaths, Grade B, between 5 and 25, and Grade C is above 25.

A small deduction for not including the source of the data (the article states it's from a JAMA article). If any reader notices problems with the underlying data or calculations, please leave a comment.

***

So far so good. And yet, you might feel like I'm over-praising a chart that feels distinctly average. Not terrible, not great.

The reason for our ambivalence is the V corner. This is what I call a Type V chart. The visual design isn't doing justice to the underlying question and data analysis.

The grouped bar chart isn't effective here because the orange bars dominate our vision. It's easy to see how each country performed over the course of the pandemic but it's hard to learn how countries compare to each other in different periods.

How are the countries ordered? It would seem like the orange bars may be the sorting variable but this interpretation fails in the third group of countries.

The designer apparently made the decision to place the U.S. at the bottom (i.e. the worst of the league table). As I will show later, this is justified but the argument cannot be justified by the orange bars alone. The U.S. is worse in both the blue and purple bars but not the orange.

This points out that there is interest in the change in rates (or ranks) over time. And in the following makeover, I used the Bumps chart as the basis, as its chief use is in showing how ranking changes over time.

Better clarity can often be gained by subtraction:

Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Sep 25, 2020

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

This is an edited version of the chart used in Peter Frazier's presentation.

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

A testing mess: one chart, four numbers, four colors, three titles, wrong units, wrong lengths, wrong data

Aug 25, 2020

Twitterstan wanted to vote the following infographic off the island:

(The publisher's website is here but I can't find a direct link to this graphic.)

The mishap is particularly galling given the controversy swirling around this year's A-Level results in the U.K. For U.S. readers, you can think of A-Levels as SAT Subject Tests, which in the U.K. are required of all university applicants, and represent the most important, if not the sole, determinant of admissions decisions. Please see the upcoming post on my book blog for coverage of the brouhaha surrounding the statistical adjustments (to be posted sometime this week, it's here.).

The first issue you may notice about the chart is that the bar lengths have no relationship with the numbers printed on them. Here is a scatter plot correlating the bar lengths and the data.

As you can see, nothing.

Then, you may wonder what the numbers mean. The annotation at the bottom right says "Average number of A level qualifications per student". Wow, the British (in this case, English) education system is a genius factory - with the average student mastering close to three thousand subjects in secondary (high) school!

TES is the cool name for what used to be the Times Educational Supplement. I traced the data back to Ofqual, which is the British regulator for these examinations. This is the Ofqual version of the above chart:

The data match. You may see that the header of the data table reads "Number of students in England getting 3 x A*". This is a completely different metric than number of qualifications - in fact, this metric measures geniuses. "A*" is the U.K. equivalent of "A+". When I studied under the British system, there was no such grade. I guess grade inflation is happening all over the world. What used to be A is now A+, and what used to be B is now A. Scoring three A*s is tops - I wonder if this should say 3 or more because I recall that you can take as many subjects as you desire but most students max out at three (may have been four).

The number of students attaining the highest achievement has increased in the last two years compared to the two years before. We can't interpret these data unless we know if the number of students also grew at similar rates.

The units are students while the units we expect from the TES graphic should be subjects. The cutoff for the data defines top students while the TES graphic should connote minimum qualification, i.e. a passing grade.

***
Now, the next section of the Ofqual infographic resolves the mystery. Here is the chart:

This dataset has the right units and measurement. There is almost no meaningful shift in the last four years. The average number of qualifications per student is only different at the second decimal place. Replacing the original data with this set removes the confusion.

While I was re-making this chart, I also cleaned out the headers and sub-headers. This is an example of software hegemony: the designer wouldn't have repeated the same information three times on a chart with four numbers if s/he wasn't prompted by software defaults.

***

The corrected chart violates one of the conventions I described in my tutorial for DataJournalism.com: color difference should reflect data difference.

In the following side-by-side comparison, you see that the use of multiple colors on the left chart signals different data - note especially the top and bottom bars which carry the same number, but our expectation is frustrated.

***

[P.S. 8/25/2020. Dan V. pointed out another problem with these bar charts: the bars were truncated so that the bar lengths are not proportional to the data. The corrected chart is shown on the right below:

Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Aug 18, 2020

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

Aug 12, 2020

A reader and colleague Georgette A was frustrated with the following graphic that appeared in the otherwise commendable article in National Geographic (link). The NatGeo article provides a history lesson on past pandemics that killed millions.

What does the design want to convey to readers?

Our attention is drawn to the larger objects, the red triangle on the left or the green triangle on the right. Regarding the red triangle, we learn that the base is the duration of the pandemic while the height of the black bar represents the total deaths.

An immediate curiosity is why a green triangle is lodged in the middle of the red triangle. Answering this question requires figuring out the horizontal layout. Where we expect axis labels we find an unexpected series of numbers (0, 16, 48, 5, 2, 4, ...). These are durations that measure the widths of the triangular bases.

To solve this puzzle, imagine the chart with the triangles removed, leaving just the black columns. Now replace the durations with index numbers, 1 to 13, corresponding to the time order of the ending years of these epidemics. In other words, there is a time axis hidden behind the chart. [As Ken reminded me on Twitter, I forgot to mention that details of each pandemic are revealed by hovering over each triangle.]

This explains why the green triangle (Antonine Plague) is sitting inside the large red triangle (Plague of Justinian). The latter's duration is 3 times that of the former, and the Antonine Plague ended before the Plague of Justinian. In fact, the Antonine occurred during 165-180 while the Justinian happened during 541-588. The overlap is an invention of the design. To receive what the design gives, we have to think of time as a sequence, not of dates.

***

Now, compare the first and second red triangles. Their black columns both encode 50 million deaths. The Justinian Plague however was spread out over 48 years while the Black Death lasted just 5 years. This suggests that the Black Death was more fearsome than the Justinian Plague. And yet, the graphic presents the opposite imagery.

This is a pretty tough dataset to visualize. Here is a side-by-side bar chart that lets readers first compare deaths, and then compare durations.

In the meantime, I highly recommend the NatGeo article.

How many details to include in a chart

Aug 04, 2020

This graphic by Bloomberg provides the context for understanding the severity of the Atlantic storm season. (link)

At this point of the season, 2020 appears to be one of the most severe in history.

I was momentarily fascinated by a feature of modern browser-based data visualization: the death of the aspect ratio. When the browser window is stretched sufficiently wide, the chart above is transformed to this look:

The chart designer has lost control of the aspect ratio.

***

This Bloomberg chart is an example of the spaghetti-style plots that convey variability by displaying individual units of data (here, storm years). The envelope of the growth curves gives the range of historical counts while the density of curves roughly offers some sense of the most likely counts at different points of the season.

But these spaghetti-style plots are not precise at conveying the variability because the density is hard to gauge. That's where aggregating the individual units helps.

The following chart does not show individual storm years. It shows the counts for the median season at selected points in time, and also a band of variability (for example, you'd include say 90 or 95% of the seasons).

I don't have the raw data so the aggregating is done by eyeballing the spaghetti.

I prefer this presentation even though it does not plot every single data point one has in the dataset.

Everything in Texas is big, but not this BIG

Jul 27, 2020

The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.

The final product:

An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.

Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.

On data volume, reliability, uncertainty and confidence bands

Jul 22, 2020

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.

The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.

One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.

***

The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.

***

The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.

Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.

***

An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)

The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.