Here's a radar chart that works, sort of

In the same Reuters article that featured the speedometer chart which I discussed in this blog post (link), the author also deployed a small multiples of radar charts.

These radar charts are supposed to illustrate the article's theme that "European countries are racing to fill natural gas storage sites ahead of winter."

Here's the aggregate chart that shows all countries:


In general, I am not a fan of radar charts. When I first looked at this chart, I also disliked it. But keep reading because I eventually decided that this usage is an exception. One just needs to figure out how to read it.

One reason why I dislike radar charts is that they always come with a lot of non-data-ink baggage. We notice that the months of the year are plotted in a circle starting at the top. They marked off the start of the war on Feb 24, 2022 in red. Then, they place the dotted circle, which represents the 80% target gas storage amount.

The trick is to avoid interpreting the areas, or the shapes of the blue and gray patches. I know, they look cool and grab our attention but in the context of conveying data, they are meaningless.

Redo_reuters_eugasradarall_1Instead of areas, focus on the boundaries of those patches. Don't follow one boundary around the circle. Pick a point in time, corresponding to a line between the center of the circle and the outermost circle, and look at the gap between the two lines. In the diagram shown right, I marked off the two relevant points on the day of the start of the war.

From this, we observe that across Europe, the gas storage was far less than the 80% target (recently set).

By comparing two other points (the blue and gray boundaries), we see that during February, Redo_reuters_eugasradarall_2gas storage is at a seasonal low, and in 2022, it is on the low side of the 5-year average. 

However, the visual does not match well with the theme of the article! While the gap between the blue and gray boundaries decreased since the start of the war, the blue boundary does not exceed the historical average, and does not get close to 80% until August, a month in which gas storage reaches 80% in a typical year.

This is example of a chart in which there is a misalignment between the Q and the V corners of the Trifecta Checkup (link).


The question/message is that Europeans are reacting to the war by increasing their gas storage beyond normal. The visual actually says that they are increasing the gas storage as per normal.


As I noted before, when read in a particular way, these radar charts serve their purpose, which is more than can be said for most radar charts.

The designer made several wise choices:

Instead of drawing one ring for each year of data, the designer averaged the past 5 years and turned that into one single ring (patch). You can imagine what this radar chart would look like if the prior data were not averaged: hoola hoop mania!


Simplifying the data in this way also makes the small multiples work. The designer uses the aggregate chart as a legend/how to read this. And in a further section below, the designer plots individual countries, without the non-data-ink baggage:


Thanks againto longtime reader Antonio R. who submitted this chart.

Happy Labor Day weekend for those in the U.S.!




Metaphors give and take

Another submission came in from Euro Twitter. The following chart is probably from Germany:


As JB noted, this chart explains a financial pyramid scheme. I believe the numbers on the left are participants while the numbers on the right are the potential ill-gotten gains per person. The longer the pyramid scheme lasts, the more people participate, the more money flows to the top.

The pyramid is a natural metaphor for visualizing pyramid schemes. The levels of the pyramid correspond to levels of a pyramid scheme - the newly recruited participants expand the base while passing revenues up the pyramid.


The chart fails because it's not really a dataviz. There are exactly three bars that are scaled according to data. Everything else is presented as data labels.

Let's look at the two data series separately:


Each series is exponentially growing (in opposite directions). [Some of the data labels for participants may be incorrect.]

Unfortunately, the triangle is not a good medium to display exponential growth. In fact, the triangular structure imposes a linear growth constraint. The length of the base is directly proportional to the height from the top. As one traverses downwards level by level, the width of the base grows linearly - not exponentially.

To illustrate exponential growth, the edge of the triangle cannot be a straight line - it has to be s steep curve!


While natural, the pyramid metaphor is also severely restricting. The choice of chart form has unexpected consequences.


Funnel is just for fun

This is part 2 of a review of a recent video released by NASA. Part 1 is here.

The NASA video that starts with the spiral chart showing changes in average global temperature takes a long time (about 1 minute) to run through 14 decades of data, and for those who are patient, the chart then undergoes a dramatic transformation.

With a sleight of hand, the chart went from a set of circles to a funnel. Here is a look:


What happens is the reintroduction of a time dimension. Imagine pushing the center of the spiral down into the screen to create a third dimension.

Our question as always is - what does this chart tell readers?


The chart seems to say that the variability of temperature has increased over time (based on the width of the funnel). The red/blue color says the temperature is getting hotter especially in the last 20-40 years.

When the reader looks beneath the surface, the chart starts to lose sense.

The width of the funnel is really a diameter of the spiral chart in the given year. But, if you recall, the diameter of the spiral (polar) chart isn't the same between any pairs of months.


In the particular rendering of this video, the width of the funnel is the diameter linking the April and October values.

Remember the polar gridlines behind the spiral:


Notice the hole in the middle. This hole has arbitrary diameter. It can be as big or as small as the designer makes it. Thus, the width of the funnel is as big or as small as the designer wants it. But the first thing that caught our attention is the width of the funnel.


The entire section between -1 and + 1 is, in fact, meaningless. In the following chart, I removed the core of the funnel, adding back the -1 degree line. Doing so exposes an incompatibility between the spiral and funnel views. The middle of the polar grid is negative infinity, a black hole.


For a moment, the two sides of the funnel look like they are mirror images. That's not correct, either. Each width of the funnel represents a year, and the extreme values represent April and October values. The line between those two values does not signify anything real.

Let's take a pair of values to see what I mean.


I selected two values for October 2021 and October 1899 such that the first value appears as a line double the length of the second. The underlying values are +0.99C and -0.04C, roughly speaking, +1 and 0, so the first value is definitely not twice the size of the second.

The funnel chart can be interpreted, in an obtuse way, as a pair of dot plots. As shown below, if we take dot plots for Aprils and Octobers of every year, turn the chart around, and then connect the corresponding dots, we arrive at the funnel chart.



This NASA effort illustrates a central problem in visual communications: attention (what Andrew Gelman calls "grabbiness") and information integrity. On the one hand, what's the point of an accurate chart when no one is paying attention? On the other hand, what's the point of a grabby chart when anyone who pays attention gets the wrong information? It's not easy to find that happy medium.

How does the U.K. vote in the U.N.?

Through my twitter feed, I found my way to this chart, made by jamie_bio.


This is produced using R code even though it looks like a slide.

The underlying dataset concerns votes at the United Nations on various topics. Someone has already classified these topics. Jamie looked at voting blocs, specifically, countries whose votes agree most often or least often with the U.K.

If you look at his Github, this is one in a series of works he produced to hone his dataviz skills. Ultimately, I think this effort can benefit from some re-thinking. However, I also appreciate the work he has put into this.

Let's start with the things I enjoyed.

Given the dataset, I imagine the first visual one might come up with is a heatmap that shows countries in rows and topics in columns. That would work ok, as any standard chart form would but it would be a data dump that doesn't tell a story. There are almost 200 countries in the entire dataset. The countries can only be ordered in one way so if it's ordered for All Votes, it's not ordered for any of the other columns.

What Jamie attempts here is story-telling. The design leads the reader through a narrative. We start by reading the how-to-read-this box on the top left. This tells us that he's using a lunar eclipse metaphor. A full circle in blue indicates 0% agreement while a full circle in white indicates 100% agreement. The five circles signal that he's binning the agreement percentages into five discrete buckets, which helps simplify our understanding of the data.

Then, our eyes go to the circle of circles, labelled "All votes". This is roughly split in half, with the left side showing mostly blue and the right showing mostly white. That's because he's extracting the top 5 and bottom 5 countries, measured by their vote alignment with the U.K. The countries names are clearly labelled.

Next, we see the votes broken up by topics. I'm assuming not all topics are covered but six key topics are highlighted on the right half of the page.

What I appreciate about this effort is the thought process behind how to deliver a message to the audience. Selecting a specific subset that addresses a specific question. Thinning the materials in a way that doesn't throw the kitchen sink at the reader. Concocting the circular layout that presents a pleasing way of consuming the data.


Now, let me talk about the things that need more work.

I'm not convinced that he got his message across. What is the visual telling us? Half of the cricle are aligned with the U.K. while half aren't so the U.K. sits on the fence on every issue? But this isn't the message. It's a bit of a mirage because the designer picked out the top 5 and bottom 5 countries. The top 5 are surely going to be voting almost 100% with the U.K. while the bottom 5 are surely going to be disagreeing with the U.K. a lot.

I did a quick sketch to understand the whole distribution:


This is not intended as a show-and-tell graphic, just a useful way of exploring the dataset. You can see that Arms Race/Disarmament and Economic Development are "average" issues that have the same form as the "All issues" line. There are a small number of countries that are extremely aligned with the UK, and then about 50 countries that are aligned over 50% of the time, then the other 150 countries are within the 30 to 50% aligned. On human rights, there is less alignment. On Palestine, there is more alignment.

What the above chart shows is that the top 5 and bottom 5 countries both represent thin slithers of this distribution, which is why in the circular diagrams, there is little differentiation. The two subgroups are very far apart but within each subgroup, there is almost no variation.

Another issue is the lunar eclipse metaphor. It's hard to wrap my head around a full white circle indicating 100% agreement while a full blue circle shows 0% agreement.

In the diagrams for individual topics, the two-letter acronyms for countries are used instead of the country names. A decoder needs to be provided, or just print the full names.







Dreamy Hawaii

I really enjoyed this visual story by ProPublica and Honolulu Star-Advertiser about the plight of beaches in Hawaii (link).

The story begins with a beautiful invitation:


This design reminds me of Vimeo's old home page. (It no longer looks like this today but this screenshot came from when I was the data guy there.) In both cases, the images are not static but moving.


The tour de force of this visual story is an annotated walk along the Lanikai Beach. Here is a snapshot at one of the stops:


This shows a particular homeowner who, according to documents, was permitted to rebuild a destroyed seawall even though officials were supposed to disallow reconstruction in order to protect beaches from eroding. The property is marked on the map above. The image inside the box is a gif showing waves smashing the seawall.

As the reader scrolls down, the image window runs through a carousel of gifs of houses along the beach. The images are synchronized to the reader's progress along the shore. The narrative makes stops at specific houses at which point a text box pops up to provide color commentary.


The erosion crisis is shown in this pair of maps.


There's some fancy work behind the scenes to patch together images, and estimate the boundaries of th beaches.


The following map is notable for its simplicity. There are no unnecessary details and labels. We don't need to know the name of every street or a specific restaurant. Removing excess details makes readers focus on the informative parts. 


Clicking on the dots brings up more details.


Enjoy the entire story here.

I made a streamgraph

The folks at FiveThirtyEight were excited about the following dataviz they published last week two weeks ago, illustrating the progression of vote-counting by state. (link) That was indeed the unique and confusing feature of the 2020 Presidential election in the States. For those outside the U.S., what happened (by and large) was that many Americans, skewing Biden supporters, voted by mail before Election Day but their votes were sometimes counted after the same-day votes were tallied.



A number of us kept staring at these charts, hoping for a how-to-read-it explanation. Here is a zoom-in for the state of Michigan:


To save you the trouble, here is how.

The key is to fight your urge to look at the brown area. I know, it's pretty hard to ignore the biggest areas of every chart. But try to make them disappear.

Focus on the top edge of the chart. This line gives the total number of votes counted so far. In Michigan, by hour 12, about 2.4 million votes were counted, and by hour 72, 2.8 million votes were on the book. This line gives the sum of the two major parties' vote totals [since third parties got negligible votes in this election, I'm ignoring them so as to simplify the discussion].

Next, look at the red and blue areas. These represent the gap in the number of votes between the two parties' current vote totals. If the area is red, Trump was leading; if blue, Biden was leading. Each color flip represents a lead change. Suppress the urge to interpret red as the number or share of Trump votes.


What have we learned about the vote counting in Michigan?

Counting significantly slowed after the 12th hour. Trump raced to a lead on Election Day, and around hour 20, the race was dead even, and after that, Biden overtook Trump and never looked back. Throughout most of this period, the vote lead was small compared to the total votes cast although at the end, the Biden lead was noticeable.

If you insist on interpreting the brown area, it is equal to twice the vote total of the second-place candidate, so it really isn't something you want to look at.

Just for contrast, here is the chart for Iowa:


Trump led from beginning to end, with his lead widening slightly as more votes were counted.


As I was stewing over this chart, a ominous thought overcame me. Would a streamgraph work for this data? You don't hear much about streamgraphs here because I rarely favor them (see this long-ago post) but let's just try one and see.


(These streamgraphs were made in R using the streamgraph package. Post-processing was applied to customize the labeling.)

This chart conveys all the key points listed before. You can see how the gap evolved over time, the lead flips, which candidate was in the lead, and the total mass of votes counted at different times. The gap is shown in the middle.

I can't say I'm completely happy with the streamgraph - I hope readers don't care about the numbers because it's hard to evaluate a difference when it's split two ways on either side of the middle axis!


If you come up with a better idea, make sure to leave a comment.





Election visuals 4: the snake pit is the best election graphic ever

This is the final post on the series of data visualization deployed by FiveThirtyEight to explain their election forecasting model. The previous posts are here, here and here.

I'm saving the best for last.


This snake-pit chart brings me great joy - I wish I came up with it!

This chart wins by focusing on a limited set of questions, and doing so excellently. As with many election observers, we understand that the U.S. presidential election will turn on so-called "swing states," and the candidates' strength in these swing states are variable, as the name suggests. Thus, we like to know which states are in play, and within these states, which ones are most unpredictable.

This chart lines up all the states from the reddest of red up top to the bluest of blue at the bottom. Each state is ranked by the voting margin predicted by 538's election forecasting model. The swing states are found in the middle.

Since each state confers a fixed number of electoral votes, and a candidate must amass 270 to win, there is a "tipping" state. In the diagram above, it's Pennsylvania. This pivotal state is neatly foregrounded as the one crossing the line in the middle.

The lengths of the segments correspond to the number of electoral votes and so do not change with the data. What change are the sequencing of the segments, and the color shading.

This data visualization is a gem of visual story-telling. The form lends itself to a story.


The snake-pit chart succeeds by not doing too much. There are many items that the chart does not directly communicate.

The exact number of electoral votes by state is not explicit, nor is it easy to compare the lengths of bending segments. The color scale for conveying the predicted voting margins is crude, and it's not clear what is the difference between a deep color and a light color. It's also challenging to learn the electoral vote split; the actual winning margin is not even stated.

The reality is the average reader doesn't care. I got everything I wanted from the chart, and I ain't got the time to explore every state.

There is a hover-over effect that reveals some of the additional information:


One can keep going on. I have no idea how the 40,000 scenarios presented in the other graphics in this series have been reduced to the forecast shown in the inset. But again, those omissions did not lessen my enjoyment. The point is: let your graphics breathe.


I'm thinking of potential variations even though I'm fully satisfied with this effort.

I wonder if the color shading should be reversed. The light shading encodes a smaller voting margin, which indicates a tighter race. But our attention is typically drawn first to the darker shades. If the shading scheme is reversed, the color should be described as how tight the race is.

I also wonder if a third color (purple) should be introduced. Doing so would require the editors to make judgment calls on which set of states are swing states.

One strange thing about election day is the specific sequence of when TV stations (!) call the state results, which not only correlates with voting margin but also with time zones. I wonder if the time zone information can be worked into the sequencing of segments.

Let me know what you think of these ideas, or leave your own ideas, in the comments below.


I have already praised this graphic when it first came out in 2016. (link)

A key improvement is tilting the chart, which avoids vertical state labels.

The previous post was written around election day 2016. The snake pit further cements its status as a story-telling device. As states are called, they are taken out of the picture. So it works very well as a dynamic chart on election day.

I'm nominating this snake-pit chart as the best election graphic ever. Kudos to the FiveThirtyEight team.

Ask how you can give

A reader and colleague Georgette A was frustrated with the following graphic that appeared in the otherwise commendable article in National Geographic (link). The NatGeo article provides a history lesson on past pandemics that killed millions.


What does the design want to convey to readers?

Our attention is drawn to the larger objects, the red triangle on the left or the green triangle on the right. Regarding the red triangle, we learn that the base is the duration of the pandemic while the height of the black bar represents the total deaths.

An immediate curiosity is why a green triangle is lodged in the middle of the red triangle. Answering this question requires figuring out the horizontal layout. Where we expect axis labels we find an unexpected series of numbers (0, 16, 48, 5, 2, 4, ...). These are durations that measure the widths of the triangular bases.

To solve this puzzle, imagine the chart with the triangles removed, leaving just the black columns. Now replace the durations with index numbers, 1 to 13, corresponding to the time order of the ending years of these epidemics. In other words, there is a time axis hidden behind the chart. [As Ken reminded me on Twitter, I forgot to mention that details of each pandemic are revealed by hovering over each triangle.]

This explains why the green triangle (Antonine Plague) is sitting inside the large red triangle (Plague of Justinian). The latter's duration is 3 times that of the former, and the Antonine Plague ended before the Plague of Justinian. In fact, the Antonine occurred during 165-180 while the Justinian happened during 541-588. The overlap is an invention of the design. To receive what the design gives, we have to think of time as a sequence, not of dates.


Now, compare the first and second red triangles. Their black columns both encode 50 million deaths. The Justinian Plague however was spread out over 48 years while the Black Death lasted just 5 years. This suggests that the Black Death was more fearsome than the Justinian Plague. And yet, the graphic presents the opposite imagery.

This is a pretty tough dataset to visualize. Here is a side-by-side bar chart that lets readers first compare deaths, and then compare durations.


In the meantime, I highly recommend the NatGeo article.

Twitter people UpSet with that Covid symptoms diagram

Been busy with an exciting project, which I might talk about one day. But I promised some people I'll follow up on Covid symptoms data visualization, so here it is.

After I posted about the Venn diagram used to depict self-reported Covid-19 symptoms by users of the Covid Symptom Tracker app (reported by Nature), Xan and a few others alerted me to Twitter discussion about alternative visualizations that people have made after they suffered the indignity of trying to parse the Venn diagram.

To avoid triggering post-trauma, for those want to view the Venn diagram, please click here.

[In the Twitter links below, you almost always have to scroll one message down - saving tweets, linking to tweets, etc. are all stuff I haven't fully figured out.]

Start with the Questions

Xan’s final comment is especially appropriate: "There's an over-riding Type-Q issue: count charts answer the wrong question".

As dataviz designers, we frequently get locked into the mindset of “what is the best way to present this dataset?” This line of thinking leads to overloaded graphics that attempt to answer every possible question that may arise from the data in one panoptic chart, akin to juggling 10 balls at once.

For complex datasets, it is often helpful to narrow down the list of questions, and provide a series of charts, each addressing one or two questions. I’ll come back to this point. I want to first show some of the nicer visuals that others have produced, which brings out the structure and complexity of this dataset.


The UpSet chart

The primary contender is the “UpSet” chart form, as best exemplified by Bart’s effort


The centerpiece of this chart is the matrix of dots. The horizontal rows of dots represent the presence of specific symptoms such as cough and anosmia (loss of smell and taste). The vertical columns are intuitive, once you get it. They represent combinations of symptoms, and the fill/no-fill of the dots indicates which symptoms are being combined. For example, the first column counts people reporting fatigue plus anosmia (but nothing else).

The UpSet chart clearly communicates the structure of the data. In many survey questions (including this one conducted by the Symptom Tracker app), respondents are allowed to check/tick more than one answer choices. This creates a situation where the number of answers (here, symptoms) per respondent can be zero up to the total number of answer choices.

So far, we have built a structure like we have drawn country outlines on a map. There is no data yet. The data are primarily found in the sidebar histograms (column/bar charts). Reading horizontally to the right side, one learns that the most frequently reported symptom was fatigue, covering 88 percent of the users.* Reading vertically, one learns that the top combination of symptoms was fatigue plus anosmia, covering 16 percent of users.


Now come the divisive acts.

Act 1: Bart orders the columns in a particular way that meets his subjective view of how he wants readers to see the data. The columns are sorted from the most frequent combinations to the least. The histogram has a “long tail”, with most of the combinations receiving a small proportion of the total. The top five combinations is where the bulk of the data is – I’d have liked to see all five columns labeled, without decimal places.

This is a choice on the part of the designer. Nils, for example, made two versions of his UpSet charts. The second version arranges the combinations from singles to quintuples.

Nils Gehlenborg_upsetplot_sortedbynumberofsymptoms


Digression: The Visual in Data Visualization

The two rendering of “UpSet” charts, by Nils and Bart, is a perfect illustration of the Trifecta Checkup framework. Each corner of the Trifecta is an independent dimension, and yet all must sync. With the same data and the same question types, what differentiates the two versions is the visual design.

See how many differences you can find, and make your own design choices!


I place the digression here because Act 1 above has to do with the Q corner, and both visual designs can accommodate the sorting decisions. But Act 2 below pertains to the V corner.

Act 2: Bart applies a blue gradient to the matrix of dots that reinforces his subjective view about identifying frequent combinations of symptoms. Nils, by contrast, uses the matrix to show present/absent only.

I’m not sure about Act 2. I think the addition of the color gradient overloads the matrix in the chart. It has the nice effect of focusing the reader’s attention on the top 5 combinations but it also requires the reader to have understood the meaning of columns first. Perhaps applying the gradient to the histogram up top rather than the dots in the matrix can achieve the same goal with less confusion.


Getting Obtuse

For example, some readers (e.g. Robin) expressed confusion.

Robin is alleging something the chart doesn’t do. He pointed out (correctly) that while 16 percent experienced fatigue and anosmia only (without other symptoms), more than 50 percent reported fatigue and anosmia, plus other symptoms. That nugget of information is deeply buried inside Bart’s chart – it’s the sum of each column for which the first two dots are filled in. For example, the second column represents fatigue+anosmia+cough. So Robin wants to aggregate those up.

Robin’s critique arises from the Q(uestion) corner. If the designer wants to highlight specific combinations that occur most frequently in the data, then Bart’s encoding makes perfect sense. On the other hand, if the purpose is to highlight pairs of symptoms that occur most frequently together (disregarding symptoms outside each pair), then the data must be further aggregated. The switch in the Question requires more Data manipulation, which then affects the Visualization. That's the essence of the Trifecta Checkup framework.

Rest assured, the version that addresses Robin’s point will not give an easy answer to Bart’s question. In fact, Xan whipped up a bar chart in response:


This is actually hard to comprehend because Robin’s question is even hard to state. The first bar shows 87 percent of users reported fatigue as a symptom, the same number that appeared on Bart’s version on the right side. Then, the darkened section of the bar indicates the proportion of users who reported only fatigue and nothing else, which appears to be about 10 percent. So 1 out of 9 reported just fatigue while 8 out of 9 who reported fatigue also experienced other symptoms.


Xan’s bar chart can be flipped 90 degrees and replace Bart’s histogram on top of the matrix. But you see, we end up with the same problem as I mentioned up top. By jamming more insights from more questions onto the same chart, we risk dropping the other balls that were already in the air.

So, my advice is always to first winnow down the list of questions you want to address. And don’t be afraid of making a series of charts instead of one panoptic chart.


Act 3: Bart decides to leave out labels for the columns.

This is a curious choice given the key storyline we’ve been working with so far (the Top 5 combinations of symptoms). But notice how annoying this problem is. Combinations require long text, which must be written vertically or slanted on this design. Transposing could help but not really. It’s just a limitation of this chart form. For me, reading the filled dots underneath the columns as column labels isn’t a show-stopper.


Histograms vs Bar Charts

It’s worth pointing out that the sidebar “histograms” are not both histograms. I tend to think of histograms as a specific type of bar (column) chart, in which the sum of the bars (columns) can be interpreted as a whole. So all histograms are bar charts but only some bar charts are histograms.

The column chart up top is a histogram. The combinations of symptoms are disjoint, and the total of the combinations should be the total number of answer choices selected by all respondents. The bar chart on the right side however is not a histogram. Each percentage is a proportion to the whole, and adding those percentages yields way above 100%.

I like the annotation on Bart’s chart a lot. They are succinct and they give just the right information to explain how to read the chart.



I already mentioned the vertical labeling issue for UpSet charts. Here are two other considerations for you.

The majority of the plotting area is dedicated to the matrix of dots. The matrix contains merely labels for data. They are like country boundaries on a map. While it lays out the structure of data very clearly, the designer should ask whether it is essential for the readers to see the entire landscape.

In real-world data, the “long tail” phenomenon we saw earlier is very common. With six featured symptoms, there are 2^6 = 64 possible combinations of symptoms (minus 1 if they filtered out those not reporting symptoms*), almost all of which will be empty. Should the low-frequency columns be removed? This is not as controversial as you think, because implicitly both Bart and Nils already dropped all empty combinations!


Data and Code

Kieran Healy left a comment on the last post, and you can find both the data (thank you!) and some R code for UpSet charts at his blog.

Also, Nils has a Shiny app on Github.


(*) One must be very careful about what “users” are being represented. They form a tiny subset of users of the Symptom Tracker app, just those who have previously taken a diagnostic test and have self-reported at least one symptom. I have separately commented on the analyses of this dataset by the team behind the app. The first post discusses their analytical methods, the second post examines how they pre-processed the data, and a future post will describe the data collection practices. For the purpose of this blog post, I’ll ignore any data issues.

(#) Bart’s chart is conceptual because some of the columns of dots are repeated, and there is one column without fills, which should have been removed by a pre-processing step applied by the research team.

This exercise plan for your lock-down work-out is inspired by Venn

A twitter follower did not appreciate this chart from Nature showing the collection of flu-like symptoms that people reported they have to an UK tracking app. 

Nature tracking app venn diagram

It's a super-complicated Venn diagram. I have written about this type of chart before (see here); it appears to be somewhat popular in the medicine/biology field.

A Venn diagram is not a data visualization because it doesn't plot the data.

Notice that the different compartments of the Venn diagram do not have data encoded in the areas. 

The chart also fails the self-sufficiency test because if you remove the data from it, you end up with a data container - like a world map showing country boundaries and no data.

If you're new here: if a graphic requires the entire dataset to be printed on it for comprehension, then the visual elements of the graphic are not doing any work. The graphic cannot stand on its own.

When the Venn diagram gets complicated, teeming with many compartments, there will be quite a few empty compartments. If I have to make this chart, I'd be nervous about leaving out a number or two by accident. An empty cell can be truly empty or an oversight.

Another trap is that the total doesn't add up. The numbers on this graphic add to 1,764 whereas the study population in the preprint was 1,702. Interestingly, this diagram doesn't show up in the research paper. Given how they winnowed down the study population from all the app downloads, I'm sure there is an innocent explanation as to why those two numbers don't match.


The chart also strains the reader. Take the number 18, right in the middle. What combination of symptoms did these 18 people experience? You have to figure out the layers sitting beneath the number. You see dark blue, light blue, orange. If you blink, you might miss the gray at the bottom. Then you have to flip your eyes up to the legend to map these colors to diarrhoea, shortness of breath, anosmia, and fatigue. Oops, I missed the yellow, which is the cough. To be sure, you look at the remaining categories to see where they stand - I've named all of them except fever. The number 18 lies outside fever so this compartment represents everything except fever. 

What's even sadder is there is not much gain from having done it once. Try to interpret the number 50 now. Maybe I'm just slow but it doesn't get better the second or third time around. This graphic not only requires work but painstaking work!

Perhaps a more likely question is how many people who had a loss of smell also had fever. Now it's pretty easy to locate the part of the dark gray oval that overlaps with the orange oval. But now, I have to add all those numbers, 69+17+23+50+17+46 = 222. That's not enough. Next, I must find the total of all the numbers inside the orange oval, which is 222 plus what is inside the orange and outside the dark gray. That turns out to be 829. So among those who had lost smell, the proportion who also had fever is 222/(222+829) = 21 percent. 

How many people had three or more symptoms? I'll let you figure this one out!