I made a streamgraph

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.

 

538_votetalliesovertimemap

A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:

538leadchanges_michigan

To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.

***

What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:

538leadchanges_iowa

Trump led from beginning to end, with his lead widening slightly as more votes were counted.

***

As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.

Junkcharts_redo_538leadchange_mi_ia

(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!

***

If you come up with a better idea, make sure to leave a comment.

 

 

 

 


Podcast highlights

Recently, I made a podcast for Ryan Ray, which you can access here. The link sends you to a 14-day free trial to his newsletter, which is where he publishes his podcasts.

Kaiserfung_warroommedia

Ryan contacted me after he read my book Numbers Rule Your World (link). I was happy to learn that he enjoyed the stories, and during the podcast, he gave an example of how he applied the statistical concepts to other situations.

During the podcast, you will hear:

  • I have a line in my course syllabus that reads "after you take this class, you will not be able to look at numbers (in the media) with a straight face ever again." That's a goal of mine. And it also applies to my books.

  • Why are most statisticians skeptics

  • Figuring out the statistical conclusions is the easy part while the hardest challenge is to find a way to communicate them to a non-technical audience. I went through many drafts before I landed on the precise language used in those stories.

  • Why "correlation is not causation" is not useful practical advice
  • You can't unsee something you've already seen, and this creates hindsight bias
  • The biggest bang for the buck when improving statistical models is improving data quality

  • Some models, such as polls and election forecasts, can be thought of as thermometers measuring the mood of the respondents at the time of polling.

***

To hear the podcast, visit Ryan Ray's website.


Using comparison to enrich a visual story

Just found this beauty deep in my submission pile (from Howie H.):

Iwillvote_texas

What's great about this pie chart is the story it's trying to tell. Almost half of the electorate did not vote in Texas in the 2016 Presidential election. The designer successfully draws my attention to the white sector that makes the point.

There are a few problems.

Showing two decimals is too much precision.

The purple sector is not labeled.

The white area seems exaggerated. The four sectors do not appear to meet at the center of the circle. The distortion is not too much but it's schizophrenic: the pie slices are drawn with low precision while the data labels have high precision.

***

The following fixes those problems, and also adds a second chart to contrast the two ways of thinking:

Redo_junkcharts_iwillvotecomtexas


Locating the political center

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

Bloomberg_politicalcenter_print_sm

***

Here are the rightmost two charts.

Bloomberg_politicalcenter_rightside Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

Junkcharts_redo_bloombergpoliticalcenterrightsideOn these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

 


Bloomberg made me digest these graphics slowly

Ask the experts to name the success metric of good data visualization, and you will receive a dozen answers. The field doesn't have an all-encompassing metric. A useful reference is Andrew Gelman and Antony Urwin (2012) in which they discussed the tradeoff between beautiful and informative, which derives from the familiar tension between art and science.

For a while now, I've been intrigued by metrics that measure "effort". Some years ago, I described the concept of a "return on effort" in this post. Such a metric can be constructed like the dominating financial metric of return on investment. The investment here is an investment of time, of attention. I strongly believe that if the consumer judges a data visualization to be compelling, engaging or  ell constructed, s/he will expend energy to devour it.

Imagine grub you discard after the first bite, compared to the delicious food experienced slowly, savoring every last bit.

Bloomberg_ambridge_smI'm writing this post while enjoying the September issue of Bloomberg Businessweek, which focuses on the upcoming U.S. Presidential election. There are various graphics infused into the pages of the magazine. Many of these graphics operate at a level of complexity above what typically show up in magazines, and yet I spent energy learning to understand them. This response, I believe, is what visual designers should aim for.

***

Today, I discuss one example of these graphics, shown on the right. You might be shocked by the throwback style of these graphics. They look like they arrived from decades ago!

Grayscale, simple forms, typewriter font, all caps. Have I gone crazy?

The article argues that a town like Ambridge in Beaver County, Pennslyvania may be pivotal in the November election. The set of graphics provides relevant data to understand this argument.

It's evidence that data visualization does not need whiz-bang modern wizardry to excel.

Let me focus on the boxy charts from the top of the column. These:

Bloomberg_ambridge_topboxes

These charts solve a headache with voting margin data in the U.S.  We have two dominant political parties so in any given election, the vote share data split into three buckets: Democratic, Republican, and a catch-all category that includes third parties, write-ins, and none of the above. The third category rarely exceeds 5 percent.  A generic pie chart representation looks like this:

Redo_junkcharts_bloombergambridgebox_pies

Stacked bars have this look:

Redo_junkcharts_bloombergambridgebox_bars

In using my Trifecta framework (link), the top point is articulating the question. The primary issue here is the voting margin between the winner and the second-runner-up, which is the loser in what is typically a two-horse race. There exist two sub-questions: the vote-share difference between the top two finishers, and the share of vote effectively removed from the pot by the remaining candidates.

Now, take another look at the unusual chart form used by Bloomberg:

Bloomberg_ambridge_topboxes1

The catch-all vote share sits at the bottom while the two major parties split up the top section. This design demonstrates a keen understanding of the context. Consider the typical outcome, in which the top two finishers are from the two major parties. When answering the first sub-question, we can choose the raw vote shares, or the normalized vote shares. Normalizing shifts the base from all candidates to the top two candidates.

The Bloomberg chart addresses both scales. The normalized vote shares can be read directly by focusing only on the top section. In an even two-horse race, the top section is split by half - this holds true regardless of the size of the bottom section.

This is a simple chart that packs a punch.

 


Election visuals 4: the snake pit is the best election graphic ever

This is the final post on the series of data visualization deployed by FiveThirtyEight to explain their election forecasting model. The previous posts are here, here and here.

I'm saving the best for last.

538_snakepit

This snake-pit chart brings me great joy - I wish I came up with it!

This chart wins by focusing on a limited set of questions, and doing so excellently. As with many election observers, we understand that the U.S. presidential election will turn on so-called "swing states," and the candidates' strength in these swing states are variable, as the name suggests. Thus, we like to know which states are in play, and within these states, which ones are most unpredictable.

This chart lines up all the states from the reddest of red up top to the bluest of blue at the bottom. Each state is ranked by the voting margin predicted by 538's election forecasting model. The swing states are found in the middle.

Since each state confers a fixed number of electoral votes, and a candidate must amass 270 to win, there is a "tipping" state. In the diagram above, it's Pennsylvania. This pivotal state is neatly foregrounded as the one crossing the line in the middle.

The lengths of the segments correspond to the number of electoral votes and so do not change with the data. What change are the sequencing of the segments, and the color shading.

This data visualization is a gem of visual story-telling. The form lends itself to a story.

***

The snake-pit chart succeeds by not doing too much. There are many items that the chart does not directly communicate.

The exact number of electoral votes by state is not explicit, nor is it easy to compare the lengths of bending segments. The color scale for conveying the predicted voting margins is crude, and it's not clear what is the difference between a deep color and a light color. It's also challenging to learn the electoral vote split; the actual winning margin is not even stated.

The reality is the average reader doesn't care. I got everything I wanted from the chart, and I ain't got the time to explore every state.

There is a hover-over effect that reveals some of the additional information:

538_snakepitchart_detail

One can keep going on. I have no idea how the 40,000 scenarios presented in the other graphics in this series have been reduced to the forecast shown in the inset. But again, those omissions did not lessen my enjoyment. The point is: let your graphics breathe.

***

I'm thinking of potential variations even though I'm fully satisfied with this effort.

I wonder if the color shading should be reversed. The light shading encodes a smaller voting margin, which indicates a tighter race. But our attention is typically drawn first to the darker shades. If the shading scheme is reversed, the color should be described as how tight the race is.

I also wonder if a third color (purple) should be introduced. Doing so would require the editors to make judgment calls on which set of states are swing states.

One strange thing about election day is the specific sequence of when TV stations (!) call the state results, which not only correlates with voting margin but also with time zones. I wonder if the time zone information can be worked into the sequencing of segments.

Let me know what you think of these ideas, or leave your own ideas, in the comments below.

***

I have already praised this graphic when it first came out in 2016. (link)

A key improvement is tilting the chart, which avoids vertical state labels.

The previous post was written around election day 2016. The snake pit further cements its status as a story-telling device. As states are called, they are taken out of the picture. So it works very well as a dynamic chart on election day.

I'm nominating this snake-pit chart as the best election graphic ever. Kudos to the FiveThirtyEight team.


Election visual 3: a strange, mash-up visualization

Continuing our review of FiveThirtyEight's election forecasting model visualization (link), I now look at their headline data visualization. (The previous posts in this series are here, and here.)

538_topchartofmaps

It's a set of 22 maps, each showing one election scenario, with one candidate winning. What chart form is this?

Small multiples may come to mind. A small-multiples chart is a grid in which every component graphic has the same form - same chart type, same color scheme, same scale, etc. The only variation from graphic to graphic is the data. The data are typically varied along a dimension of interest, for example, age groups, geographic regions, years. The following small-multiples chart, which I praised in the past (link), shows liquor consumption across the world.

image from junkcharts.typepad.com

Each component graphic changes according to the data specific to a country. When we scan across the grid, we draw conclusions about country-to-country variations. As with convention, there are as many graphics as there are countries in the dataset. Sometimes, the designer includes only countries that are directly relevant to the chart's topic.

***

What is the variable FiveThirtyEight chose to vary from map to map? It's the scenario used in the election forecasting model.

This choice is unconventional. The 22 scenarios is a subset of the 40,000 scenarios from the simulation - we are left wondering how those 22 are chosen.

Returning to our question: what chart form is this?

Perhaps you're reminded of the dot plot from the previous post. On that dot plot, the designer summarized the results of 40,000 scenarios using 100 dots. Since Biden is the winner in 75 percent of all scenarios, the dot plot shows 75 blue dots (and 25 red).

The map is the new dot. The 75 blue dots become 16 blue maps (rounded down) while the 25 red dots become 6 red maps.

Is it a pictogram of maps? If we ignore the details on the maps, and focus on the counts of colors, then yes. It's just a bit challenging because of the hole in the middle, and the atypical number of maps.

As with the dot plot, the map details are a nice touch. It connects readers with the simulation model which can feel very abstract.

Oddly, if you're someone familiar with probabilities, this presentation is quite confusing.

With 40,000 scenarios reduced to 22 maps, each map should represent 1818 scenarios. On the dot plot, each dot should represent 400 scenarios. This follows the rule for creating pictograms. Each object in a pictogram - dot, map, figurine, etc. - should encode an equal amount of the data. For the 538 visualization, is it true that each of the six red maps represents 1818 scenarios? This may be the case but not likely.

Recall the dot plot where the most extreme red dot shows a scenario in which Trump wins 376 out of 538 electoral votes (margin = 214). Each dot should represent 400 scenarios. The visualization implies that there are 400 scenarios similar to the one on display. For the grid of maps, the following red map from the top left corner should, in theory, represent 1,818 similar scenarios. Could be, but I'm not sure.

538_electoralvotemap_topleft

Mathematically, each of the depicted scenario, including the blowout win above, occurs with 1/40,000 chance in the simulation. However, one expects few scenarios that look like the extreme scenario, and ample scenarios that look like the median scenario.  

So, the right way to read the 538 chart is to ignore the map details when reading the embedded pictogram, and then look at the small multiples of detailed maps bearing in mind that extreme scenarios are unique while median scenarios have many lookalikes.

(Come to think about it, the analogous situation in the liquor consumption chart is the relative population size of different countries. When comparing country to country, we tend to forget that the data apply to large numbers of people in populous countries, and small numbers in tiny countries.)

***

There's a small improvement that can be made to the detailed maps. As I compare one map to the next, I'm trying to pick out which states that have changed to change the vote margin. Conceptually, the number of states painted red should decrease as the winning margin decreases, and the states that shift colors should be the toss-up states.

So I'd draw the solid Republican (Democratic) states with a lighter shade, forming an easily identifiable bloc on all maps, while the toss-up states are shown with a heavier shade.

Redo_junkcharts_538electoralmap_shading

Here, I just added a darker shade to the states that disappear from the first red map to the second.


Election visuals 2: informative and playful

In yesterday's post, I reviewed one section of 538's visualization of its election forecasting model, specifically, the post focuses on the probability plot visualization.

The visualization, technically called  a pdf, is a mainstay of statistical graphics. While every one of 40,000 scenarios shows up on this chart, it doesn't offer a direct answer to our topline question. What is Nate's call at this point in time? Elsewhere in their post, we learn that the 538 model currently gives Biden a 75% chance of winning, thrice that of Trump's.

538_pdf_pair

In graphical terms, the area to the right of the 270-line is three times the size of the left area (on the bottom chart). That's not apparent in the pdf representation. Addressing this, statisticians may convert the pdf into a cdf, which depicts the cumulative area as we sweep from the left to the right along the horizontal axis.  

The cdf visualization rarely leaves the pages of a scientific journal because it's not easy for a novice to understand. Not least because the relevant probability is 1 minus the cumulative probability. The cdf for the bottom chart will show 25% at the 270-line while the chance of Biden winning is 1 - 25% = 75%.

The cdf presentation is also wasteful for the election scenario. No one cares about any threshold other than the 270 votes needed to win, but the standard cdf shows every possible threshold.

The second graphical concept in the 538 post (link) is an attempt to solve this problem.

538_dotplot

If you drop all the dots to an imaginary horizontal baseline, the above dotplot looks like this:

Redo_junkcharts_538electionforecast_dotplot_1

There is a recent trend toward centering dots to produce symmetry. It's actually harder to perceive the differences in heights of the band.

The secret sauce is to put down 100 dots, with a 75-25 blue-red split that conveys the 75% chance of a Biden win. Imposing the pdf line from the other visualization, I find that the density of dots roughly mimics the probability of outcomes.

Redo_junkcharts_538electionforecast_dotplot_2

It's easier to estimate the blue vs red areas using those dots than the lines.

The dots are stuffed toys. Clicking on each dot reveals a map showing one of the 40,000 scenarios. It displays which candidate wins which state. For example, the most extreme example of a Trump win is:

538_dotplot_redextreme

Here is a scenario of a razor-tight election won by Trump:

538_dotplot_redmiddle

This presentation has a weakness as well. It gives the impression that each of the dots is equally important because they are the same size. In reality, the importance of each dot is proportional to the height of the band. Since the band is generally wider near the middle, the dots near the middle are more likely scenarios than the dots shown on the two edges.

On balance, I like this visualization that is both informative and playful.

As before, what strikes me about the simulation result is the flatness of the probability surface. This feature is obscured when we summarize the result as 75% chance of a Biden victory.


Election visuals: three views of FiveThirtyEight's probabilistic forecasts

As anyone who is familiar with Nate Silver's forecasting of U.S. presidential elections knows, he runs a simulation that explores the space of possible scenarios. The polls that provide a baseline forecast make certain assumptions, such as who's a likely voter. Nate's model unshackles these assumptions from the polling data, exploring how the outcomes vary as these assumptions shift.

In the most recent simulation, his computer explores 40,000 scenarios, each of which predicts a split of the electoral vote, from which the winner of the election can be determined. The model's outcome is usually summarized by a winning probability, which is just the proportion of scenarios under which one candidate wins.

This type of forecasting was responsible for the infamous meltdown in 2016 when most of these models - Nate's being an exception - issued extremely confident predictions that Hillary Clinton wins with 95% or higher probability. Essentially, the probability distribution collapses to a point. This is analogous to an extremely narrow confidence band, indicating almost zero uncertainty about the event. It was as if almost all of the 40,000 scenarios predicted Clinton to be the winner.

The 538 data team has come up with various ways of visualizing the outputs of the model (link). The entire post is worth reading. Here, I'll highlight the most scientific, and direct visual representation, which is the third display.

538_pdf_pair

We start by looking at the bottom of the two charts, showing the predicted electoral votes won  by Democratic challenger Joe Biden, in each of the 40,000 scenarios. Our attention is directed to the thick line that gives the relative chance of Biden's electoral-vote tally. This line is a smoothed summary of the columns in the background, which show the number of times the simulation produces each electoral-vote count.

The highlighted, right side of the chart recounts scenarios in which Biden becomes President, that is to say, he wins more than 270 electoral votes (out of 538, doh). The faded, left side represents scenarios in which Biden is defeated and Trump wins a second term.

The reason I focused on the bottom chart is that the top chart is merely a mirror image of this one. Just reflect the bottom chart around the vertical axis of 270 electoral votes, change the color scheme to red, and swap annotations related to Trump and Biden, and you get the other chart. This is because the narrative has excluded third-party and write-in candidates, leaving us with a zero-sum situation.

Alternatively, one can jam both charts into one, while supplying extra labels, like this:

Redo_junkcharts_538forecastpdf_1

I prefer the denser single chart because my mind wanders away searching for extra meaning when chart elements are mirrored.

One advantage of the mirrored presentation is that the probability profiles of the potential Trump or Biden wins can be directly compared. We learn that Trump's winning margins are smaller, rarely above 150, and never above 250.

This comparison is made easier by flipping left side of the chart onto the right side:

Redo_junkcharts_538forecastpdf_2

Those are three different visualizations using the same chart form. I'd have to run a poll to figure out which is the best. What's your opinion?


Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

Germanextremists_bars

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Redo_junkcharts_germanextremists_sidebysidelines

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

Redo_junkcharts_germanextremists_dots

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.

 

[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]