Graphing the extreme
Twitter people UpSet with that Covid symptoms diagram

This exercise plan for your lock-down work-out is inspired by Venn

A twitter follower did not appreciate this chart from Nature showing the collection of flu-like symptoms that people reported they have to an UK tracking app. 

Nature tracking app venn diagram

It's a super-complicated Venn diagram. I have written about this type of chart before (see here); it appears to be somewhat popular in the medicine/biology field.

A Venn diagram is not a data visualization because it doesn't plot the data.

Notice that the different compartments of the Venn diagram do not have data encoded in the areas. 

The chart also fails the self-sufficiency test because if you remove the data from it, you end up with a data container - like a world map showing country boundaries and no data.

If you're new here: if a graphic requires the entire dataset to be printed on it for comprehension, then the visual elements of the graphic are not doing any work. The graphic cannot stand on its own.

When the Venn diagram gets complicated, teeming with many compartments, there will be quite a few empty compartments. If I have to make this chart, I'd be nervous about leaving out a number or two by accident. An empty cell can be truly empty or an oversight.

Another trap is that the total doesn't add up. The numbers on this graphic add to 1,764 whereas the study population in the preprint was 1,702. Interestingly, this diagram doesn't show up in the research paper. Given how they winnowed down the study population from all the app downloads, I'm sure there is an innocent explanation as to why those two numbers don't match.


The chart also strains the reader. Take the number 18, right in the middle. What combination of symptoms did these 18 people experience? You have to figure out the layers sitting beneath the number. You see dark blue, light blue, orange. If you blink, you might miss the gray at the bottom. Then you have to flip your eyes up to the legend to map these colors to diarrhoea, shortness of breath, anosmia, and fatigue. Oops, I missed the yellow, which is the cough. To be sure, you look at the remaining categories to see where they stand - I've named all of them except fever. The number 18 lies outside fever so this compartment represents everything except fever. 

What's even sadder is there is not much gain from having done it once. Try to interpret the number 50 now. Maybe I'm just slow but it doesn't get better the second or third time around. This graphic not only requires work but painstaking work!

Perhaps a more likely question is how many people who had a loss of smell also had fever. Now it's pretty easy to locate the part of the dark gray oval that overlaps with the orange oval. But now, I have to add all those numbers, 69+17+23+50+17+46 = 222. That's not enough. Next, I must find the total of all the numbers inside the orange oval, which is 222 plus what is inside the orange and outside the dark gray. That turns out to be 829. So among those who had lost smell, the proportion who also had fever is 222/(222+829) = 21 percent. 

How many people had three or more symptoms? I'll let you figure this one out!









Feed You can follow this conversation by subscribing to the comment feed for this post.

Michael Thompson

Yikes. Simple boring matrix would have been a better start. Or even better, asking: why would I even make such a thing?

Cheryl Renee Thompson Smith

Appreciated is missing its, "r."

Karl Ove Hufthammer

A much better way of visualizing the data is an UpSet plot. Some has actually created such a plot for these data:


KOH: Yes, Xan alerted me to those and other charts. I'll be posting about them next week. Thanks for your note!

CRTS: I'm making many typos recently. It's quite disturbing. It's fixed.


It would be interesting to look at the full dataset using latent class analysis to see how many classes of subjects there were and how much class membership determines whether someone has COVID-19. I had a moderate case of the flu last year and lost my sense of smell and taste. I didn't notice for a few days, until I realised that I didn't feel hungry because everything I hate seemed to have no taste.

The reason they don't have a matrix is because it requires probably 50 lines or more. The inconsistency in the numbers shows the importance of reproducible research. Having a program that outputs everything means not having to redo things every time that you find a data error.

The comments to this entry are closed.