Putting vaccine trials in boxes

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

Bloombergbw_vaccinetrials_sm

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes. 

Redo_junkcharts_bloombergbw_vaccinetrials_0

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Redo_junkcharts_bloombergbw_vaccinetrials_2

 

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.

Bloombergbwonline_vaccinetrials

If any reader can figure out the logic of the ordering of the vaccine strategies, please leave a comment below.


Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Nyt_excessdeaths_south

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

Nyt_excessdeaths_northeast3

 

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

 


Everything in Texas is big, but not this BIG

Long-time reader John forwarded the following chart via Twitter.

Covidtracking_texassquare

The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.

Redo_texasdeathsquares_process

The final product:

Redo_texasdeaths_columns

An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.

Redo_texasdeaths_distortion

Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Presented without comment

Weekend assignment - which of these tells the story better?

Ourworldindata_cases_log

Or:

Ourworldindata_cases_linear

The cop-out answer is to say both. If you must pick one, which one?

***

When designing a data visualization as a living product (not static), you'd want a design that adapts as the data change.


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


How Covid-19 deaths sneaked into Florida's statistics

Like many others, some Floridians are questioning their state's Covid statistics. It's clear there are numerous "degrees of freedom" for politicians to manipulate the numbers. What's not clear is who's influencing these decisions. Are they public-health experts, donors, voters, or whom?

A Twitter follower sent in the following chart, embedded in an informative article in Sun-Sentinel:

Sun-sentinel_pneumonia_percent_of_total

I like the visual design. It's clean, and conveys a moderately complex concept effectively. The reader may not immediately get what metrics are being plotted but the idea that the blue line should operate within the gray area.. until it doesn't is easily grasped. The range is technically an uncertainty band.

The metric is the proportion of total deaths (all causes) that are attributed to pneumonia and flu. Typical influenza deaths are found in that category. This chart investigates whether there were excess (unexplained) P&F deaths. The gray band measures the variability in the proportions of past years. When the blue line operates inside the band, the metric is normal. When it pierces the upper band, which happened here around week 25, a rare event has occurred.

The concern on Twitter was about the horizontal axis. Those integer labels can be confusing. The designer places a "how to read this" message in a footnote, explaining that week 1 is the first week of a typical flu season (which corresponds to late September 2019). This nugget of information helps a lot. We can see that the flu season peaks around week 20, and by the spring, it should be waning. Not so in 2020.

It's hard to escape the conclusion that deaths from Covid-19 are hiding inside the statistics of Pneumonia & Flu. As a statistician, I want to tell you Statistics Don't Lie! You can hide the data along one dimension, but they show up elsewhere. Misclassifying the deaths does buy someone some time. It takes a few weeks to compile all-cause mortality data (gasp, the CDC said mortality records are only 75 percent accurate after 8 weeks!)

The other small problem with the chart is the labeling. Neither axis has labels. The data label that shows up when you click on the line might be a default from the software that can't be turned off. It shows the two numbers being plotted without labels.

***

Here is a re-working of the chart that tells the story:

Redo_junkcharts_sunsentinelpneunominacovid19

The proportion of deaths attributed to P&F and Covid together is roughly double the upper end of what Florida should be seeing this time of the year (without Covid). Covid-19 accounts for half the gap. The other half are still being classified as P&F. However, I suspect CDC will adjust these numbers later to reflect the reality. (In making this chart, I also learned that Florida stopped including seasonal visitors in the death counts. This is egregious manipulation. If someone died while in Florida, they should be counted. I didn't investigate whether this counting rule applies only to Covid-19 deaths, or to deaths from all causes. If they had always done that, then I might give them a pass.)

On second thought, maybe not. The other egregious thing that appeared to have happened is that the Florida state health department unplugged their prior website (https://www.floridahealth.gov) so no one can cross-reference any prior documents. The only website I can access now for Florida state health is a Covid-specific site (https://floridahealthcovid19.gov).

Florida_state_health_websites

There must be something juicy on the previous influenza page, no?

***

Lastly, when you look at my chart, please pretend that the last week is not on there. In all likelihood, the "drop" is fake because the mortality data have not been fully updated. My chart contains one more week than the Sun Sentinel chart. So you can see that the drastic decline shown on their chart turned up a big uptick on mine (next to last week).

This is a common mistake on many charts I see these days. Half-baked numbers are shown next to fully-baked ones.


Twitter people UpSet with that Covid symptoms diagram

Been busy with an exciting project, which I might talk about one day. But I promised some people I'll follow up on Covid symptoms data visualization, so here it is.

After I posted about the Venn diagram used to depict self-reported Covid-19 symptoms by users of the Covid Symptom Tracker app (reported by Nature), Xan and a few others alerted me to Twitter discussion about alternative visualizations that people have made after they suffered the indignity of trying to parse the Venn diagram.

To avoid triggering post-trauma, for those want to view the Venn diagram, please click here.

[In the Twitter links below, you almost always have to scroll one message down - saving tweets, linking to tweets, etc. are all stuff I haven't fully figured out.]

Start with the Questions

Xan’s final comment is especially appropriate: "There's an over-riding Type-Q issue: count charts answer the wrong question".

As dataviz designers, we frequently get locked into the mindset of “what is the best way to present this dataset?” This line of thinking leads to overloaded graphics that attempt to answer every possible question that may arise from the data in one panoptic chart, akin to juggling 10 balls at once.

For complex datasets, it is often helpful to narrow down the list of questions, and provide a series of charts, each addressing one or two questions. I’ll come back to this point. I want to first show some of the nicer visuals that others have produced, which brings out the structure and complexity of this dataset.

 

The UpSet chart

The primary contender is the “UpSet” chart form, as best exemplified by Bart’s effort

Upset_bartjutte

The centerpiece of this chart is the matrix of dots. The horizontal rows of dots represent the presence of specific symptoms such as cough and anosmia (loss of smell and taste). The vertical columns are intuitive, once you get it. They represent combinations of symptoms, and the fill/no-fill of the dots indicates which symptoms are being combined. For example, the first column counts people reporting fatigue plus anosmia (but nothing else).

The UpSet chart clearly communicates the structure of the data. In many survey questions (including this one conducted by the Symptom Tracker app), respondents are allowed to check/tick more than one answer choices. This creates a situation where the number of answers (here, symptoms) per respondent can be zero up to the total number of answer choices.

So far, we have built a structure like we have drawn country outlines on a map. There is no data yet. The data are primarily found in the sidebar histograms (column/bar charts). Reading horizontally to the right side, one learns that the most frequently reported symptom was fatigue, covering 88 percent of the users.* Reading vertically, one learns that the top combination of symptoms was fatigue plus anosmia, covering 16 percent of users.

***

Now come the divisive acts.

Act 1: Bart orders the columns in a particular way that meets his subjective view of how he wants readers to see the data. The columns are sorted from the most frequent combinations to the least. The histogram has a “long tail”, with most of the combinations receiving a small proportion of the total. The top five combinations is where the bulk of the data is – I’d have liked to see all five columns labeled, without decimal places.

This is a choice on the part of the designer. Nils, for example, made two versions of his UpSet charts. The second version arranges the combinations from singles to quintuples.

Nils Gehlenborg_upsetplot_sortedbynumberofsymptoms

 

Digression: The Visual in Data Visualization

The two rendering of “UpSet” charts, by Nils and Bart, is a perfect illustration of the Trifecta Checkup framework. Each corner of the Trifecta is an independent dimension, and yet all must sync. With the same data and the same question types, what differentiates the two versions is the visual design.

See how many differences you can find, and make your own design choices!

 

I place the digression here because Act 1 above has to do with the Q corner, and both visual designs can accommodate the sorting decisions. But Act 2 below pertains to the V corner.

Act 2: Bart applies a blue gradient to the matrix of dots that reinforces his subjective view about identifying frequent combinations of symptoms. Nils, by contrast, uses the matrix to show present/absent only.

I’m not sure about Act 2. I think the addition of the color gradient overloads the matrix in the chart. It has the nice effect of focusing the reader’s attention on the top 5 combinations but it also requires the reader to have understood the meaning of columns first. Perhaps applying the gradient to the histogram up top rather than the dots in the matrix can achieve the same goal with less confusion.

 

Getting Obtuse

For example, some readers (e.g. Robin) expressed confusion.

Robin is alleging something the chart doesn’t do. He pointed out (correctly) that while 16 percent experienced fatigue and anosmia only (without other symptoms), more than 50 percent reported fatigue and anosmia, plus other symptoms. That nugget of information is deeply buried inside Bart’s chart – it’s the sum of each column for which the first two dots are filled in. For example, the second column represents fatigue+anosmia+cough. So Robin wants to aggregate those up.

Robin’s critique arises from the Q(uestion) corner. If the designer wants to highlight specific combinations that occur most frequently in the data, then Bart’s encoding makes perfect sense. On the other hand, if the purpose is to highlight pairs of symptoms that occur most frequently together (disregarding symptoms outside each pair), then the data must be further aggregated. The switch in the Question requires more Data manipulation, which then affects the Visualization. That's the essence of the Trifecta Checkup framework.

Rest assured, the version that addresses Robin’s point will not give an easy answer to Bart’s question. In fact, Xan whipped up a bar chart in response:

Xan_symptomscombo_barchart

This is actually hard to comprehend because Robin’s question is even hard to state. The first bar shows 87 percent of users reported fatigue as a symptom, the same number that appeared on Bart’s version on the right side. Then, the darkened section of the bar indicates the proportion of users who reported only fatigue and nothing else, which appears to be about 10 percent. So 1 out of 9 reported just fatigue while 8 out of 9 who reported fatigue also experienced other symptoms.

 

Xan’s bar chart can be flipped 90 degrees and replace Bart’s histogram on top of the matrix. But you see, we end up with the same problem as I mentioned up top. By jamming more insights from more questions onto the same chart, we risk dropping the other balls that were already in the air.

So, my advice is always to first winnow down the list of questions you want to address. And don’t be afraid of making a series of charts instead of one panoptic chart.

***

Act 3: Bart decides to leave out labels for the columns.

This is a curious choice given the key storyline we’ve been working with so far (the Top 5 combinations of symptoms). But notice how annoying this problem is. Combinations require long text, which must be written vertically or slanted on this design. Transposing could help but not really. It’s just a limitation of this chart form. For me, reading the filled dots underneath the columns as column labels isn’t a show-stopper.

 

Histograms vs Bar Charts

It’s worth pointing out that the sidebar “histograms” are not both histograms. I tend to think of histograms as a specific type of bar (column) chart, in which the sum of the bars (columns) can be interpreted as a whole. So all histograms are bar charts but only some bar charts are histograms.

The column chart up top is a histogram. The combinations of symptoms are disjoint, and the total of the combinations should be the total number of answer choices selected by all respondents. The bar chart on the right side however is not a histogram. Each percentage is a proportion to the whole, and adding those percentages yields way above 100%.

I like the annotation on Bart’s chart a lot. They are succinct and they give just the right information to explain how to read the chart.

 

Limitations

I already mentioned the vertical labeling issue for UpSet charts. Here are two other considerations for you.

The majority of the plotting area is dedicated to the matrix of dots. The matrix contains merely labels for data. They are like country boundaries on a map. While it lays out the structure of data very clearly, the designer should ask whether it is essential for the readers to see the entire landscape.

In real-world data, the “long tail” phenomenon we saw earlier is very common. With six featured symptoms, there are 2^6 = 64 possible combinations of symptoms (minus 1 if they filtered out those not reporting symptoms*), almost all of which will be empty. Should the low-frequency columns be removed? This is not as controversial as you think, because implicitly both Bart and Nils already dropped all empty combinations!

 

Data and Code

Kieran Healy left a comment on the last post, and you can find both the data (thank you!) and some R code for UpSet charts at his blog.

Also, Nils has a Shiny app on Github.

 

(*) One must be very careful about what “users” are being represented. They form a tiny subset of users of the Symptom Tracker app, just those who have previously taken a diagnostic test and have self-reported at least one symptom. I have separately commented on the analyses of this dataset by the team behind the app. The first post discusses their analytical methods, the second post examines how they pre-processed the data, and a future post will describe the data collection practices. For the purpose of this blog post, I’ll ignore any data issues.

(#) Bart’s chart is conceptual because some of the columns of dots are repeated, and there is one column without fills, which should have been removed by a pre-processing step applied by the research team.