This exercise plan for your lock-down work-out is inspired by Venn
Apr 23, 2020
A twitter follower did not appreciate this chart from Nature showing the collection of flu-like symptoms that people reported they have to an UK tracking app.
It's a super-complicated Venn diagram. I have written about this type of chart before (see here); it appears to be somewhat popular in the medicine/biology field.
A Venn diagram is not a data visualization because it doesn't plot the data.
Notice that the different compartments of the Venn diagram do not have data encoded in the areas.
The chart also fails the self-sufficiency test because if you remove the data from it, you end up with a data container - like a world map showing country boundaries and no data.
If you're new here: if a graphic requires the entire dataset to be printed on it for comprehension, then the visual elements of the graphic are not doing any work. The graphic cannot stand on its own.
When the Venn diagram gets complicated, teeming with many compartments, there will be quite a few empty compartments. If I have to make this chart, I'd be nervous about leaving out a number or two by accident. An empty cell can be truly empty or an oversight.
Another trap is that the total doesn't add up. The numbers on this graphic add to 1,764 whereas the study population in the preprint was 1,702. Interestingly, this diagram doesn't show up in the research paper. Given how they winnowed down the study population from all the app downloads, I'm sure there is an innocent explanation as to why those two numbers don't match.
***
The chart also strains the reader. Take the number 18, right in the middle. What combination of symptoms did these 18 people experience? You have to figure out the layers sitting beneath the number. You see dark blue, light blue, orange. If you blink, you might miss the gray at the bottom. Then you have to flip your eyes up to the legend to map these colors to diarrhoea, shortness of breath, anosmia, and fatigue. Oops, I missed the yellow, which is the cough. To be sure, you look at the remaining categories to see where they stand - I've named all of them except fever. The number 18 lies outside fever so this compartment represents everything except fever.
What's even sadder is there is not much gain from having done it once. Try to interpret the number 50 now. Maybe I'm just slow but it doesn't get better the second or third time around. This graphic not only requires work but painstaking work!
Perhaps a more likely question is how many people who had a loss of smell also had fever. Now it's pretty easy to locate the part of the dark gray oval that overlaps with the orange oval. But now, I have to add all those numbers, 69+17+23+50+17+46 = 222. That's not enough. Next, I must find the total of all the numbers inside the orange oval, which is 222 plus what is inside the orange and outside the dark gray. That turns out to be 829. So among those who had lost smell, the proportion who also had fever is 222/(222+829) = 21 percent.
How many people had three or more symptoms? I'll let you figure this one out!
Yikes. Simple boring matrix would have been a better start. Or even better, asking: why would I even make such a thing?
Posted by: Michael Thompson | Apr 23, 2020 at 09:41 AM
Appreciated is missing its, "r."
Posted by: Cheryl Renee Thompson Smith | Apr 24, 2020 at 08:23 AM
A much better way of visualizing the data is an UpSet plot. Some has actually created such a plot for these data: https://kieranhealy.org/blog/archives/2020/04/16/upset-plots/.
Posted by: Karl Ove Hufthammer | Apr 24, 2020 at 12:26 PM
KOH: Yes, Xan alerted me to those and other charts. I'll be posting about them next week. Thanks for your note!
CRTS: I'm making many typos recently. It's quite disturbing. It's fixed.
Posted by: Kaiser | Apr 24, 2020 at 02:56 PM
It would be interesting to look at the full dataset using latent class analysis to see how many classes of subjects there were and how much class membership determines whether someone has COVID-19. I had a moderate case of the flu last year and lost my sense of smell and taste. I didn't notice for a few days, until I realised that I didn't feel hungry because everything I hate seemed to have no taste.
The reason they don't have a matrix is because it requires probably 50 lines or more. The inconsistency in the numbers shows the importance of reproducible research. Having a program that outputs everything means not having to redo things every time that you find a data error.
Posted by: Ken | May 15, 2020 at 03:24 AM