Twitter people UpSet with that Covid symptoms diagram
May 01, 2020
Been busy with an exciting project, which I might talk about one day. But I promised some people I'll follow up on Covid symptoms data visualization, so here it is.
After I posted about the Venn diagram used to depict self-reported Covid-19 symptoms by users of the Covid Symptom Tracker app (reported by Nature), Xan and a few others alerted me to Twitter discussion about alternative visualizations that people have made after they suffered the indignity of trying to parse the Venn diagram.
To avoid triggering post-trauma, for those want to view the Venn diagram, please click here.
[In the Twitter links below, you almost always have to scroll one message down - saving tweets, linking to tweets, etc. are all stuff I haven't fully figured out.]
Start with the Questions
Xan’s final comment is especially appropriate: "There's an over-riding Type-Q issue: count charts answer the wrong question".
As dataviz designers, we frequently get locked into the mindset of “what is the best way to present this dataset?” This line of thinking leads to overloaded graphics that attempt to answer every possible question that may arise from the data in one panoptic chart, akin to juggling 10 balls at once.
For complex datasets, it is often helpful to narrow down the list of questions, and provide a series of charts, each addressing one or two questions. I’ll come back to this point. I want to first show some of the nicer visuals that others have produced, which brings out the structure and complexity of this dataset.
The UpSet chart
The primary contender is the “UpSet” chart form, as best exemplified by Bart’s effort
The centerpiece of this chart is the matrix of dots. The horizontal rows of dots represent the presence of specific symptoms such as cough and anosmia (loss of smell and taste). The vertical columns are intuitive, once you get it. They represent combinations of symptoms, and the fill/no-fill of the dots indicates which symptoms are being combined. For example, the first column counts people reporting fatigue plus anosmia (but nothing else).
The UpSet chart clearly communicates the structure of the data. In many survey questions (including this one conducted by the Symptom Tracker app), respondents are allowed to check/tick more than one answer choices. This creates a situation where the number of answers (here, symptoms) per respondent can be zero up to the total number of answer choices.
So far, we have built a structure like we have drawn country outlines on a map. There is no data yet. The data are primarily found in the sidebar histograms (column/bar charts). Reading horizontally to the right side, one learns that the most frequently reported symptom was fatigue, covering 88 percent of the users.* Reading vertically, one learns that the top combination of symptoms was fatigue plus anosmia, covering 16 percent of users.
***
Now come the divisive acts.
Act 1: Bart orders the columns in a particular way that meets his subjective view of how he wants readers to see the data. The columns are sorted from the most frequent combinations to the least. The histogram has a “long tail”, with most of the combinations receiving a small proportion of the total. The top five combinations is where the bulk of the data is – I’d have liked to see all five columns labeled, without decimal places.
This is a choice on the part of the designer. Nils, for example, made two versions of his UpSet charts. The second version arranges the combinations from singles to quintuples.
Digression: The Visual in Data Visualization
The two rendering of “UpSet” charts, by Nils and Bart, is a perfect illustration of the Trifecta Checkup framework. Each corner of the Trifecta is an independent dimension, and yet all must sync. With the same data and the same question types, what differentiates the two versions is the visual design.
See how many differences you can find, and make your own design choices!
I place the digression here because Act 1 above has to do with the Q corner, and both visual designs can accommodate the sorting decisions. But Act 2 below pertains to the V corner.
Act 2: Bart applies a blue gradient to the matrix of dots that reinforces his subjective view about identifying frequent combinations of symptoms. Nils, by contrast, uses the matrix to show present/absent only.
I’m not sure about Act 2. I think the addition of the color gradient overloads the matrix in the chart. It has the nice effect of focusing the reader’s attention on the top 5 combinations but it also requires the reader to have understood the meaning of columns first. Perhaps applying the gradient to the histogram up top rather than the dots in the matrix can achieve the same goal with less confusion.
Getting Obtuse
For example, some readers (e.g. Robin) expressed confusion.
Robin is alleging something the chart doesn’t do. He pointed out (correctly) that while 16 percent experienced fatigue and anosmia only (without other symptoms), more than 50 percent reported fatigue and anosmia, plus other symptoms. That nugget of information is deeply buried inside Bart’s chart – it’s the sum of each column for which the first two dots are filled in. For example, the second column represents fatigue+anosmia+cough. So Robin wants to aggregate those up.
Robin’s critique arises from the Q(uestion) corner. If the designer wants to highlight specific combinations that occur most frequently in the data, then Bart’s encoding makes perfect sense. On the other hand, if the purpose is to highlight pairs of symptoms that occur most frequently together (disregarding symptoms outside each pair), then the data must be further aggregated. The switch in the Question requires more Data manipulation, which then affects the Visualization. That's the essence of the Trifecta Checkup framework.
Rest assured, the version that addresses Robin’s point will not give an easy answer to Bart’s question. In fact, Xan whipped up a bar chart in response:
This is actually hard to comprehend because Robin’s question is even hard to state. The first bar shows 87 percent of users reported fatigue as a symptom, the same number that appeared on Bart’s version on the right side. Then, the darkened section of the bar indicates the proportion of users who reported only fatigue and nothing else, which appears to be about 10 percent. So 1 out of 9 reported just fatigue while 8 out of 9 who reported fatigue also experienced other symptoms.
Xan’s bar chart can be flipped 90 degrees and replace Bart’s histogram on top of the matrix. But you see, we end up with the same problem as I mentioned up top. By jamming more insights from more questions onto the same chart, we risk dropping the other balls that were already in the air.
So, my advice is always to first winnow down the list of questions you want to address. And don’t be afraid of making a series of charts instead of one panoptic chart.
***
Act 3: Bart decides to leave out labels for the columns.
This is a curious choice given the key storyline we’ve been working with so far (the Top 5 combinations of symptoms). But notice how annoying this problem is. Combinations require long text, which must be written vertically or slanted on this design. Transposing could help but not really. It’s just a limitation of this chart form. For me, reading the filled dots underneath the columns as column labels isn’t a show-stopper.
Histograms vs Bar Charts
It’s worth pointing out that the sidebar “histograms” are not both histograms. I tend to think of histograms as a specific type of bar (column) chart, in which the sum of the bars (columns) can be interpreted as a whole. So all histograms are bar charts but only some bar charts are histograms.
The column chart up top is a histogram. The combinations of symptoms are disjoint, and the total of the combinations should be the total number of answer choices selected by all respondents. The bar chart on the right side however is not a histogram. Each percentage is a proportion to the whole, and adding those percentages yields way above 100%.
I like the annotation on Bart’s chart a lot. They are succinct and they give just the right information to explain how to read the chart.
Limitations
I already mentioned the vertical labeling issue for UpSet charts. Here are two other considerations for you.
The majority of the plotting area is dedicated to the matrix of dots. The matrix contains merely labels for data. They are like country boundaries on a map. While it lays out the structure of data very clearly, the designer should ask whether it is essential for the readers to see the entire landscape.
In real-world data, the “long tail” phenomenon we saw earlier is very common. With six featured symptoms, there are 2^6 = 64 possible combinations of symptoms (minus 1 if they filtered out those not reporting symptoms*), almost all of which will be empty. Should the low-frequency columns be removed? This is not as controversial as you think, because implicitly both Bart and Nils already dropped all empty combinations!
Data and Code
Kieran Healy left a comment on the last post, and you can find both the data (thank you!) and some R code for UpSet charts at his blog.
Also, Nils has a Shiny app on Github.
(*) One must be very careful about what “users” are being represented. They form a tiny subset of users of the Symptom Tracker app, just those who have previously taken a diagnostic test and have self-reported at least one symptom. I have separately commented on the analyses of this dataset by the team behind the app. The first post discusses their analytical methods, the second post examines how they pre-processed the data, and a future post will describe the data collection practices. For the purpose of this blog post, I’ll ignore any data issues.
(#) Bart’s chart is conceptual because some of the columns of dots are repeated, and there is one column without fills, which should have been removed by a pre-processing step applied by the research team.
Could you render the visualisation as a network graph with the strength of the vertices corresponding to the prevalence of that combination of symptoms and the size of the nodes corresponding to the prevalence of that symptom overall?
Posted by: Yellek | May 03, 2020 at 11:01 PM
An interesting follow on question would be in the timing/sequence of when each of theses symptoms occur as the disease progresses.
Posted by: Alan | May 04, 2020 at 10:55 AM
Y: That may be a profitable avenue to explore. The "adjacency" matrix has a network representation. The big limitation is the emphasis on pairs of nodes over triples, quadruples, etc. But there may be ways to overcome that.
A: Yes, they threw away the timing information by collapsing to a binary variable for each symptom in the preprint. I discussed the data preprocessing issues in this blog post on the book blog. The researchers also collected levels of severity of most symptoms.
Posted by: Kaiser | May 04, 2020 at 11:01 AM
Thank you for your extensive review of the Upset chart that I have created. A few remarks from the creator's end:
- One of the design principles was to show the most important data first, therefore the sorting of the rows was changed and the dots were colors (that was a shall I do it or not tradeoff).
- Another principle was not to leave out info, although some combinations were pretty rare
Two other considerations in creating the graph remain, as alway, who is the intended audience and on which device/ medium is the graph being consumed
The graph is actually interactive and online (https://tinyurl.com/y7me7pzw), and detail are disclosed by hovering over them (the labels on the columns for example)
The audience was not really clear to me, but I kind of took doctors in mind to inform them about the symptom combinations you can encounter. One point I didn't take into account was how distinctive the symptoms were. This was clearer in the actual study that I checked later which shows the odds ratio (e.g. symptoms of infected vs symptoms of non-infected)
The impact of the graph is an aspect of an infographic as well. This was big in this case as I learnt the important of loss of smell and taste and informed the developers of the Corona Check App in the Netherlands and they added this symptom in their checker, which most likely resulted in finding more people infected.
In the meanwhile I have visualised a German study from Prof Streeck that also verified symptoms (however they didn't disclose symptom combinations). This resulted in another infographic focused on the common public interested in the COVID-19 virus. See https://tinyurl.com/y7dpv5c3
Posted by: VizBiz15 | May 14, 2020 at 10:38 AM
Hi Bart, Thanks for your note, and the link to the interactive version. Also, the Gangelt study looks interesting. On the first chart, what does the blue/gray color represent?
Posted by: Kaiser | May 14, 2020 at 11:45 AM
Hi Kaiser,
Grey means not statistically significant and blue means significant on the first chart. Gathering feedback on this infographic right now. They study itself got criticism from different scientists in Germany, so I need to consider how to address this. It's really the back and forth you get on visualisations that provides additional insights.
Posted by: VizBiz15 | May 14, 2020 at 05:22 PM
If you color-code the text saying statistical significance (and not), that should resolve the issue for that chart. The top right chart I think is the most interesting but also hardest to understand. More symptoms being at the bottom seems unnatural. Also, the 0 symptoms line is confusing: it's the only one where having the gray dot to the right of the blue dot confirms your message. Will write back if I come up with a better graphic.
I'll probably blog about the study. The first thing that seems odd is the claim that restricting each family name to appear once helped "randomness" when, if this town is like any other, some surnames are much more common than others in the base population. I just don't understand why they needed to mess with the random selection.
Posted by: Kaiser | May 14, 2020 at 07:39 PM