Illustrating differential growth rates

Reader Mirko was concerned about a video published in Germany that shows why the new coronavirus variant is dangerous. He helpfully provided a summary of the transcript:

The South African and the British mutations of the SARS-COV-2 virus are spreading faster than the original virus. On average, one infected person infects more people than before. Researchers believe the new variant is 50 to 70 % more transmissible.

Here are two key moments in the video:


This seems to be saying the original virus (left side) replicates 3 times inside the infected person while the new variant (right side) replicates 19 times. So we have a roughly 6-fold jump in viral replication.


Later in the video, it appears that every replicate of the old virus finds a new victim while the 19 replicates of the new variant land on 13 new people, meaning 6 replicates didn't find a host.

As Mirko pointed out, the visual appears to have run away from the data. (In our Trifecta Checkup, we have a problem with the arrow between the D and the V corners. What the visual is saying is not aligned with what the data are saying.)


It turns out that the scientists have been very confusing when talking about the infectiousness of this new variant. The most quoted line is that the British variant is "50 to 70 percent more transmissible". At first, I thought this is a comment on the famous "R number". Since the R number around December was roughly 1 in the U.K, the new variant might bring the R number up to 1.7.

However, that is not the case. From this article, it appears that being 5o to 70 percent more transmissible means R goes up from 1 to 1.4. R is interpreted as the average number of people infected by one infected person.

Mirko wonders if there is a better way to illustrate this. I'm sure there are many better ways. Here's one I whipped up:


The left side is for the 40% higher R number. Both sides start at the center with 10 infected people. At each time step, if R=1 (right side), each of the 10 people infects 10 others, so the total infections increase by 10 per time step. It's immediately obvious that a 40% higher R is very serious indeed. Starting with 10 infected people, in 10 steps, the total number of infections is almost 1,000, almost 10 times higher than when R is 1.

The lines of the graphs simulate the transmission chains. These are "average" transmission chains since R is an average number.


P.S. [1/29/2021: Added the missing link to the article in which it is reported that 50-70 percent more transmissible implies R increasing by 40%.]



Say it thrice: a nice example of layering and story-telling

I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).

It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.


This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.

I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.

The next section presents the same data in a small-multiples layout. The heads are replaced by dots.


This allows more precise comparison of one candidate to another, and one location to another.

This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.

Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.


Click here to see the animation.

When repetition is done right, it doesn't feel like repetition.


Check out the Lifespan of News project

Alberto Cairo introduces another one of his collaborations with Google, visualizing Google search data. We previously looked at other projects here.

The latest project, designed by Schema, Axios, and Google News Initiative, tracks the trending of popular news stories over time and space, and it's a great example of making sense of a huge pile of data.

The design team produced a sequence of graphics to illustrate the data. The top news stories are grouped by category, such as Politics & Elections, Violence & War, and Environment & Science, each given a distinct color maintained throughout the project.

The first chart is an area chart that looks at individual stories, and tracks the volume over time.


To read this chart, you have to notice that the vertical axis measuring volume is a log scale, meaning that each tick mark up represents a 10-fold increase. Log scale is frequently used to draw far-away data closer to the middle, making it possible to see both ends of a wide distribution on the same chart. The log transformation introduces distortion deliberately. The smaller data look disproportionately large because of it.

The time scrolls automatically so that you feel a rise and fall of various news stories. It's a great way to experience the news cycle in the past year. The overlapping areas show competing news stories that shared the limelight at that point in time.

Just bear in mind that you have to mentally reverse the distortion introduced by the log scale.


In the second part of the project, they tackle regional patterns. Now you see a map with proportional symbols. The top story in each locality is highlighted with the color of the topic. As time flows by, the sizes of the bubbles expand and contract.


Sometimes, the entire nation was consumed by the same story, e.g. certain obituaries. At other times, people in different regions focused on different topics.


In the last part of the project, they describe general shapes of the popularity curves. Most stories have one peak although certain stories like U.S. government shutdown will have multiple peaks. There is also variation in terms of how fast a story rises to the peak and how quickly it fades away.

The most interesting aspect of the project can be learned from the footnote. The data are not direct hits to the Google News stories but searches on Google. For each story, one (or more) unique search terms are matched, and only those stories are counted. A "control" is established, which is an excellent idea. The control gives meaning to those counts. The control used here is the number of searches for the generic term "Google News." Presumably this is a relatively stable number that is a proxy for general search activity. Thus, the "volume" metric is really a relative measure against this control.





Environmental science can use better graphics

Mike A. pointed me to two animated maps made by Caltech researchers published in LiveScience (here).

The first map animation shows the rise and fall of water levels in a part of California over time. It's an impressive feat of stitching together satellite images. Click here to play the video.


The animation grabs your attention. I'm not convinced by the right side of the color scale in which the white comes after the red. I'd want the white in the middle then the yellow and finally the red.

In order to understand this map and the other map in the article, the reader has to bring a lot of domain knowledge. This visualization isn't easy to decipher for a layperson.

Here I put the two animations side by side:


The area being depicted is the same. One map shows "ground deformation" while the other shows "subsidence". Are they the same? What's the connection between the two concepts (if any)?  On a further look, one notices that the time window for the two charts differ: the right map is clearly labeled 1995 to 2003 but there is no corresponding label on the left map. To find the time window of the left map, the reader must inspect the little graph on the top right (1996 to 2000).

This means the time window of the left map is a subset of the time window of the right map. The left map shows a sinusoidal curve that moves up and down rhythmically as the ground shifts. How should I interpret the right map? The periodicity is no longer there despite this map illustrating a longer time window. The scale on the right map is twice the magnitude of the left map. Maybe on average the ground level is collapsing? If that were true, shouldn't the sinusoidal curve drift downward over time?

Caltech_groundwater_sineThe chart on the top right of the left map is a bit ugly. The year labels are given in decimals e.g. 1997.5. In R, this can be fixed by customizing the axis labels.

I also wonder how this curve is related to the map it accompanies. The curve looks like a model - perfect oscillations of a fixed period and amplitude. But one suppose the amount of fluctuation should vary by location, based on geographical features and human activities.

The author of the article points to both natural and human impacts on the ground level. Humans affect this by water usage and also by management policies dictated by law. It would be very helpful to have a map that sheds light on the causes of the movements.

Visualizing the Thai cave rescue operation

The Thai cave rescue was a great story with a happy ending. It's also one that lends itself to visualization. A good visualization can explain the rescue operation more efficiently than mere words.

A good visual should bring out the most salient features of the story, such as:

  • Why the operation was so daunting?
  • What were the tactics used to overcome those challenges?
  • How long did it take?
  • What were the specific local challenges that must be overcome?
  • Were there any surprises?

In terms of what made the rescue challenging, some of the following are pertinent:

  • How far in they were?
  • How deep were they trapped?
  • How much of the caves were flooded? Why couldn't they come out by themselves?
  • How much headroom was there in different sections of the cave "tunnel"?

There were many attempts at visualizing the Thai cave rescue operation. The best ones I saw were: BBC (here, here), The New York Times (here), South China Morning Post (here) and Straits Times (here). It turns out each of these efforts focuses on some of the aspects above, and you have to look at all of them to get the full picture.


BBC's coverage began with a top-down view of the route of the rescue, which seems to be the most popular view adopted by news organizations. This is easily understood because of the standard map aesthetic.


The BBC map is missing a smaller map of Thailand to place this in a geographical context.

While this map provides basic information, it doesn't address many of the elements that make the Thai cave rescue story compelling. In particular, human beings are missing from this visualization. The focus is on the actions ("diving", "standing"). This perspective also does not address the water level, the key underlying environmental factor.


Another popular perspective is the sideway cross-section. The Straits Times has one:

Straittimes_thai rescue_part

The excerpt of the infographic presents a nice collection of data that show the effort of the rescue. The sideway cross-sectional section shows the distance and the up-and-down nature of the journey, the level of flooding along the route, plus a bit about the headroom available at different points. Most of these diagrams bring out the "horizontal" distance but somehow ignore the "vertical" distance. One possibility is that the real trajectory is curvy - but if we can straighten out the horizontal, we should be able to straighten out the vertical too.

The NYT article gives a more detailed view of the same perspective, with annotations that describe key moments along the rescue route.


If, like me, you like to place humans into this picture, then you have to go back to the Straits Times, where they have an expanded version of the sideway cross-section.


This is probably my most favorite single visualization of the rescue operation.

There are better cartoons of the specific diving actions, though. For example, the BBC has this visual that shows the particularly narrow part of the route, corresponding to the circular inset in the Straits Times version above.


The drama!

NYT also has a set of cartoons. Here's one:



There is one perspective that curiously has been underserved in all of the visualizations - this is the first-person perspective. Imagine the rescuer (or the kids) navigating the rescue route. It's a cross-section from the front, not from the side.

Various publications try to address this by augmenting the top-down route view with sporadic cross-sectional diagrams. Recall the first map we showed from the BBC. On the right column are little annotations of this type (here):


I picked out this part of the map because it shows that the little human figure serves two potentially conflicting purposes. In the bottom diagram, the figurine shows that there is limited headroom in this part of the cave, plus the actual position of the figurine on the ledge conveys information about where the kids were. However, on the top cross-section, the location of the figure conveys no information; the only purpose of the human figure is to show how tall the cave is at that site.

The South China Morning Post (here - site appears to be down when I wrote this) has this wonderful animation of how the shape of the headroom changed as they navigated the route. Please visit their page to see the full animation. Here are two screenshots:



This little clip adds a lot to the story! It'd be even better if the horizontal timeline at the bottom is replaced by the top-down route map.

Thank you all the various dataviz teams for these great efforts.




Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:


The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:


I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:


First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:


The white color still stands out more than necessary. Fix this using a gray color.


What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.


The gridlines are also too thick. Here's a slimmed-down look:


The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.


Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)



PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.


Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from queries written in French? queries from an IP address in France? A combination of the above?)


I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:


Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:


I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.


The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.



Playfulness in data visualization

The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.

Newslab_language_wordsnakeThe data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.

The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.

The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.

The alternative of presenting a data table loses this virtue without gaining much in return.

Readers are asked to click on the top word in each country to reveal further statistics on the word.

For example, the word "good" leads to the following:




The second chart presents the top 10 words by language in a lollipop style:


The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.

The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.

The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.



Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)


Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:



I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.


When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 




What do we think of the "packed" bar chart?

Xan Gregg - my partner in the #onelesspie campaign to replace terrible Wikipedia pie charts one at a time - has come up with a new chart form that he calls "packed bars". It's a combination of bar charts and the treemap.

Here is an example of a packed barchart, in which the top 10 companies on the S&P500 index are displayed:


What he's doing is to add context to help interpret the data. So frequently these days, we encounter data analyses of the "Top X" or "Bottom Y" type. Such analyses are extremely limited in utility as it ignores the bulk of the data. The extreme values have little to nothing to say about the rest of the data. This problem is particularly acute in skewed data.

Compare the two versions:


The left chart is a Top 10 analysis. The reader knows nothing about the market cap of the other 490 companies. The right chart provides the context. We can see that the Top 10 companies have a combined market cap that is roughly a quarter of the total market cap in the S&P 500. We also learn about the size of the next 10 versus the Top 10, etc.

As with any chart form, a nice dataset can really surface its power. I really like what the packed barchart reveals about the election data by county:


(Thanks to Xan for providing me this image.)

Notice the preponderance of red on the right side and the gradual shift from blue/purple to pink/red moving left to right. This is very effective at showing one of the most important patterns in American politics - the small counties are mostly deep red while the Democratic base is to be found primarily in large metropolitan areas. I have previously featured a number of interesting election graphics here. Washington Post's nation of peaks is another way to surface this pattern.

Xan would love to get feedback about this chart type. He has put up a blog post here with more details. I also love this animation he created to show how the packing occurs.