## Think twice before you spiral

##### Jan 10, 2022

After Nathan at FlowingData sang praises of the following chart, a debate ensued on Twitter as others dislike it.

The chart was printed in an opinion column in the New York Times (link).

I have found few uses for spiral charts, and this example has not changed my mind.

The canonical time-series chart is like this:

***

The area chart takes no effort to understand. We can see when the peaks occurred. We notice that the current surge is already double the last peak seen a year ago.

It's instructive to trace how one gets from the simple area chart to the spiral chart.

Step 1 is to center the area on the zero baseline, instead of having the zero baseline as the baseline. While this technique frequently makes for a more pleasant visual (because of our preference for symmetry), it actually makes it harder to see the trend over time. Effectively, any change is split in half, which is why the envelope of the area is less sharp.

In Step 2, I massively compress the vertical scale. That's because when you plot a spiral, you are forced to fit each cycle of data into a much shorter range. Such compression causes the year on year doubling of cases to appear less dramatic. (Actually, the aspect ratio is devastated because while the vertical scale is hugely compressed, the horizontal scale is dramatically stretched out due to the curled up design)

Step 3 may elude your attention. If you simply curl up the compressed, centered area chart, you don't get the spiral chart. The key is to ask about the radius of the spiral. As best I can tell, the radius has no meaning; it is gradually increased so that each year of data has its own "orbit". What would the change in radius translate to on our non-circular chart? It should mean that the center of the area is gradually lifted away from the zero line. On the right chart, I mimic this effect (I only measured the change in radius every 3 months so the change is more angular than displayed in the spiral chart.) The problem I have with this Step is that it serves no purpose, while it complicates cognition,

In Step 4, just curl up the object into a ball based on aligning months of the year.

This is the point when I realized I missed a Step 2B. I carefully aligned the scales of both charts so that the 150K cases shown in the legend on the right have the same vertical representation as on the left. This exposes a severe horizontal rescaling. The length of the horizontal axis on the left chart is many times smaller than the circumference of the spiral! That's why earlier, I said one of the biggest feature of this spiral chart is that it imposes a dubious aspect ratio, that is extremely wide and extremely short.

As usual, think twice before you spiral.

##### Sep 28, 2021

In the prior post about Canadian elections, I suggested that designers expand beyond plots of one variable at a time. Today, I look at a project by DataWrapper on the German elections which happened this week. Thanks to long-time blog supporter Antonio for submitting the chart.

The following is the centerpiece of Lisa's work:

CDU/CSU is Angela Merkel's party, represented by the black color. The chart answers one question only: did polls correctly predict election results?

The time period from 1994 to 2021 covers eight consecutive elections (counting the one this week). There are eight vertical blocks on the chart representing each administration. The right vertical edge of each block coincides with an election. The chart is best understood as the superposition of two time series.

You can trace the first time series by following a step function - let your eyes follow the flat lines between elections. This dataset shows the popular vote won by the party at each election, with the value updated after each election. The last vertical block represents an election that has not yet happened when this chart was created. As explained in the footnote, Lisa took the average poll result for the last month leading up to the 2021 election - in the context of this chart, she made the assumption that this cycle of polls will be 100% accurate.

The second time series corresponds to the ragged edges of the gray and black areas. If you ignore the colors, and the flat lines, you'll discover that the ragged edges form a contiguous data series. This line encodes the average popularity of the CDU/CSU party according to election polls.

Thus, the area between the step function and the ragged line measures the gap between polls and election day results. When the polls underestimate the actual outcome, the area is colored gray; when the polls are over-optimistic, the area is colored black. In the last completed election of 2017, Merkel's party underperformed relative to the polls. In fact, the polls in the entire period between the 2013 and 2017 uniformly painted a rosier picture for CDU/CSU than actually happened.

The last vertical block is interpreted a little differently. Since the reference level is the last month of polls (rather than the actual popular vote), the abundance of black indicates that Merkel's party has been suffering from declining poll numbers on the approach of this week's election.

***

The picture shown above seems to indicate that these polls are not particularly good. It appears they have limited ability to self-correct within each election cycle. Aside from the 1998-2002 period, the area colors seldom changed within each cycle. That means if the first polling average overestimated the party's popularity, then all subsequent polling averages were also optimistic. (The original post focused on a single pollster, which exacerbates this issue. Compare the following chart with the above, and you'll find even fewer color changes within cycle here:

Each pollster may be systematically biased but the poll aggregate is less so.)

Here's the chart for SDP, which is CDU/CSU's biggest opponent, and likely winner of this week's election:

Overall, this chart has similar features as the CDU/CSU chart. The most recent polls seem to favor the SPD - the pink area indicates that the older polls of this cycle underestimates the last month's poll result.

Both these parties are in long-term decline, with popularity dropping from the 40% range in the 1990s to the 20% range in the 2020s.

One smaller party that seems to have gained followers is the Green party:

The excess of dark green, however, does not augur well for this election.

## Ridings, polls, elections, O Canada

##### Sep 20, 2021

Stephen Taylor reached out to me about his work to visualize Canadian elections data. I took a look. I appreciate the labor of love behind this project.

He led with a streamgraph, which presents a quick overview of relative party strengths over time.

I am no Canadian election expert, and I did a bare minimum of research in writing this blog. From this chart, I learn that:

• the Canadians have an irregular election schedule
• The two dominant parties are Liberals and Conservatives. The Liberals currently hold just less than half of the seats. The Conservatives have more than half of the seats not held by Liberals
• The Conservative party (maybe) rebranded as "progressive conservative" for several decades. The Reform/Alliance party was (maybe) a splinter movement within the Conservatives as well.
• Since the "width" of the entire stream increased over time, I'm guessing the number of seats has expanded

That's quite a bit of information obtained at a glance. This shows the power of data visualization. Notice Stephen didn't even have to include a "how to read this" box.

The streamgraph form has its limitations.

The feature that makes it more attractive than an area chart is its middle anchoring, resulting in a form of symmetry. The same feature produces erroneous intuition - the red patch draws out a declining trend; the reader must fight the urge to interpret the lines and focus on the areas.

The breadcrumbs are well hidden. The legend below discloses that the Green Party holds 3 seats currently. The party has never held enough seats to appear on the streamgraph though.

The bars showing proportions in the legend is a very nice touch. (The numbers appear messed up - I have to ask Stephen whether the seats shown are current values, or some kind of historical average.) I am a big fan of informative legends.

***

The next featured chart is a dot plot of polling results since 2020.

One can see a three-tier system: the two main parties, then the NDP (yellow) is the clear majority of the minority, and finally you have a host of parties that don't poll over 10%.

It looks like the polls are favoring the Conservatives over the Liberals in this election but it may be an election-day toss-up.

The purple dots represent "PPC" which is a party not found elsewhere on the page.

This chart is clear as crystal because of the structure of the underlying data. It just amazes me that the polls are so highly correlated. For example, across all these polls, the NDP has never once polled better than either the Liberals or the Conservatives, and in addition, it has never polled worse than any of the small parties.

What I'd like to see is a chart that merges the two datasets, addressing the question of how well these polls predicted the actual election outcomes.

***

The project goes very deep as Stephen provides charts for individual "ridings" (perhaps similar to U.S. precincts).

Here we see population pyramids for Vancouver Center, versus British Columbia (Province), versus Canada.

This riding has a large surplus of younger people in their twenties and thirties. Be careful about the changing scales though. The relative difference in proportions are more drastic than visually displayed because the maximum values (5%) on the Province and Canada charts are half that on the Riding chart (10%). Imagine squashing the Province and Canada charts to half their widths.

Analyses of income and rent/own status are also provided.

This part of the dashboard exhibits a problem common in most dashboards - they present each dimension of the data separately and miss out on the more interesting stuff: the correlation between dimensions. Do people in their twenties and thirties favor specific parties? Do richer people vote for certain parties?

***

The riding-level maps are the least polished part of the site. This is where I'm looking for a "how to read it" box.

It took me a while to realize that the colors represent the parties. If I haven't come in from the front page, I'd have been totally lost.

Next, I got confused by the use of the word "poll". Clicking on any of the subdivisions bring up details of an actual race, with party colors, candidates and a donut chart showing proportions. The title gives a "poll id" and the name of the riding in parentheses. Since the poll id changes as I mouse over different subdivisions, I'm wondering whether a "poll" is the term for a subdivision of a riding. A quick wiki search indicates otherwise.

My best guess is the subdivisions are indicated by the numbers.

Back to the donut charts, I prefer a different sorting of the candidates. For this chart, the two most logical orderings are (a) order by overall popularity of the parties, fixed for all ridings and (b) order by popularity of the candidate, variable for each riding.

The map shown above gives the winner in each subdivision. This type of visualization dumps a lot of information. Stephen tackles this issue by offering a small multiples view of each party. Here is the Liberals in Vancouver.

Again, we encounter ambiguity about the color scheme. Liberals have been associated with a red color but we are faced with abundant yellow. After clicking on the other parties, you get the idea that he has switched to a divergent continuous color scale (red - yellow - green). Is red or green the higher value? (The answer is red.)

I'd suggest using a gray scale for these charts. The hardest decision is going to be the encoding between values and shading. Should each gray scale be different for each riding and each party?

If I were to take a guess, Stephen must have spent weeks if not months creating these maps (depending on whether he's full-time or part-time). What he has published here is a great start. Fine-tuning the issues I've mentioned may take more weeks or months more.

****

Stephen is brave and smart to send this project for review. For one thing, he's got some free consulting. More importantly, we should always send work around for feedback; other readers can tell us where our blind spots are.

## Hanging things on your charts

##### Jul 20, 2021

The Financial Times published the following chart that shows the rollout of vaccines in the U.K.

(I can't find the online link to the article. The article is titled "AstraZeneca and Oxford face setbacks and success as battle enters next phase", May 29/30 2021.)

This chart form is known as a "streamgraph", and it is a stacked area chart in disguise.

The same trick can be applied to a column chart. See the "hanging" column chart below:

The two charts show exactly the same data. The left one roots the columns at the bottom. The right one aligns the middle of the columns.

I have rarely found these hanging charts useful. The realignment makes it harder to compare the sizes of the different column segments. On the normal stacked column chart, the yellow segments are the easiest to compare because they share the same base level. Even this is taken away from the reader on the right side.

Note also that the hanging version does not admit a vertical axis

The same comments apply to the streamgraph.

***

Nevertheless, I was surprised that the FT chart shown above actually works. The main message I learned was that initially U.K. primarily rolled out AstraZeneca and, to a lesser extent, Pfizer, shots while later, they introduced other vaccines, including Johnson & Johnson, Novavax, CureVac, Moderna, and "Other".

I can also see that the supply of AstraZeneca has not changed much through the entire time window. Pfizer has grown to roughly the same scale as AstraZeneca. Moderna remains a small fraction of total shots.

I can even roughly see that the total number of vaccinations has grown about six times from start to finish.

That's quite a lot for one chart, so job well done!

There is one problem with the FT chart. It should have labelled end of May as "today". Half the chart is history, and the other half is the future.

***

For those following Covid-19 news, the FT chart is informative in a different way.

There is a misleading statement going around blaming the U.K.'s recent surge in cases on the Astrazeneca vaccine, claiming that the U.K. mostly uses AZ. This chart shows that from the start, about a third of the shots administered in the U.K. are Pfizer, and Pfizer's share has been growing over time.

U.K. compared to some countries mostly using mRNA vaccines

U.K. is almost back to the winter peak. That's because the U.K. is serious about counting cases. Look at the state of testing in these countries:

What's clear about the U.S. case count is that it is kept low by cutting the number of tests by two-thirds, thus, our data now is once again severely biased towards severe cases.

We can do a back-of-the-envelope calculation. The drop in testing may directly lead to a proportional drop in reported cases, thus removing 500 (asymptomatic, or mild) cases per million from the case count. The case count goes below 250 per million so the additional 200 or so reduction is due to other reasons such as vaccinations.

## Same data + same chart form = same story. Maybe.

##### Feb 18, 2021

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.

***

I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.

The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:

On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.

***

What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.

In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.

##### Jan 22, 2021

Let's explore an infographic by SCMP, which draws attention to the alarming temperature recorded at Verkhoyansk in Russia on June 20, 2020. The original work was on the back page of the printed newspaper, referred to in this tweet.

This view of the globe brings out the two key pieces of evidence presented in the infographic: the rise in temperature in unexpected places, and the shrinkage of the Arctic ice.

A notable design decision is to omit the color scale. On inspection, the scale is present - it was sewn into the graphic.

I applaud this decision as it does not take the reader's eyes away from the graphic. Some information is lost as the scale isn't presented in full details but I doubt many readers need those details.

A key takeaway is that the temperature in Verkhoyansk, which is on the edge of the Arctic Circle, was the same as in New Delhi in India on that day. We can see how the red was encroaching upon the Arctic Circle.

***

Next, the rapid shrinkage of the Arctic ice is presented in two ways. First, a series of maps.

The annotations are pared to the minimum. The presentation is simple enough such that we can visually judge that the amount of ice cover has roughly halved from 1980 to 2009.

A numerical measure of the drop is provided on the side.

Then, a line chart reinforces this message.

The line chart emphasizes change over time while the series of maps reveals change over space.

This chart suggests that the year 2020 may break the record for the smallest ice cover since 1980. The maps of Australia and India provide context to interpret the size of the Arctic ice cover.

I'd suggest reversing the pink and black colors so as to refer back to the blue and pink lines in the globe above.

***

The final chart shows the average temperature worldwide and in the Arctic, relative to a reference period (1981-2000).

This one is tough. It looks like an area chart but it should be read as a line chart. The darker line is the anomaly of Arctic average temperature while the lighter line is the anomaly of the global average temperature. The two series are synced except for a brief period around 1940. Since 2000, the temperatures have been dramatically rising above that of the reference period.

If this is a stacked area chart, then we'd interpret the two data series as summable, with the sum of the data series signifying something interesting. For example, the market shares of different web browsers sum to the total size of the market.

But the chart above should not be read as a stacked area chart because the outside envelope isn't the sum of the two anomalies. The problem is revealed if we try to articulate what the color shades mean.

On the far right, it seems like the dark shade is paired with the lighter line and represents global positive anomalies while the lighter shade shows Arctic's anomalies in excess of global. This interpretation only works if the Arctic line always sits above the global line. This pattern is broken in the late 1990s.

Around 1999, the Arctic's anomaly is negative while the global anomaly is positive. Here, the global anomaly gets the lighter shade while the Arctic one is blue.

One possible fix is to encode the size of the anomaly into the color of the line. The further away from zero, the darker the red/blue color.

## Aligning the visual and the data

##### Dec 16, 2020

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.

The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!

***

This is the same chart flipped over.

Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for DataJournalism.com.

***

In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync.

##### Nov 23, 2020

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.

A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:

To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.

***

What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:

Trump led from beginning to end, with his lead widening slightly as more votes were counted.

***

As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.

(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!

***

## Locating the political center

##### Oct 26, 2020

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

***

Here are the rightmost two charts.

Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

On these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

## Putting vaccine trials in boxes

##### Sep 08, 2020

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes.

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.