Check your presumptions while you're reading this chart about Israel's vaccination campaign

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?

Dvir_aran_chart

Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):

AronBrand_israelcases_twoorthreedoses

It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).

Lukas_denominator

C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)

Joescholar_basepopulationquestion

Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.

JasonPogue_5monthsout

D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:

Ontario_case_definition

Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:

Dvir_group_definition

In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.

DvirAran_proportionofunvaccinated

That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.

***

If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.

Junkcharts_dviraran_israel_threeshotschart

This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.

***

As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Locating the political center

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

Bloomberg_politicalcenter_print_sm

***

Here are the rightmost two charts.

Bloomberg_politicalcenter_rightside Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

Junkcharts_redo_bloombergpoliticalcenterrightsideOn these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

 


Election visuals: three views of FiveThirtyEight's probabilistic forecasts

As anyone who is familiar with Nate Silver's forecasting of U.S. presidential elections knows, he runs a simulation that explores the space of possible scenarios. The polls that provide a baseline forecast make certain assumptions, such as who's a likely voter. Nate's model unshackles these assumptions from the polling data, exploring how the outcomes vary as these assumptions shift.

In the most recent simulation, his computer explores 40,000 scenarios, each of which predicts a split of the electoral vote, from which the winner of the election can be determined. The model's outcome is usually summarized by a winning probability, which is just the proportion of scenarios under which one candidate wins.

This type of forecasting was responsible for the infamous meltdown in 2016 when most of these models - Nate's being an exception - issued extremely confident predictions that Hillary Clinton wins with 95% or higher probability. Essentially, the probability distribution collapses to a point. This is analogous to an extremely narrow confidence band, indicating almost zero uncertainty about the event. It was as if almost all of the 40,000 scenarios predicted Clinton to be the winner.

The 538 data team has come up with various ways of visualizing the outputs of the model (link). The entire post is worth reading. Here, I'll highlight the most scientific, and direct visual representation, which is the third display.

538_pdf_pair

We start by looking at the bottom of the two charts, showing the predicted electoral votes won  by Democratic challenger Joe Biden, in each of the 40,000 scenarios. Our attention is directed to the thick line that gives the relative chance of Biden's electoral-vote tally. This line is a smoothed summary of the columns in the background, which show the number of times the simulation produces each electoral-vote count.

The highlighted, right side of the chart recounts scenarios in which Biden becomes President, that is to say, he wins more than 270 electoral votes (out of 538, doh). The faded, left side represents scenarios in which Biden is defeated and Trump wins a second term.

The reason I focused on the bottom chart is that the top chart is merely a mirror image of this one. Just reflect the bottom chart around the vertical axis of 270 electoral votes, change the color scheme to red, and swap annotations related to Trump and Biden, and you get the other chart. This is because the narrative has excluded third-party and write-in candidates, leaving us with a zero-sum situation.

Alternatively, one can jam both charts into one, while supplying extra labels, like this:

Redo_junkcharts_538forecastpdf_1

I prefer the denser single chart because my mind wanders away searching for extra meaning when chart elements are mirrored.

One advantage of the mirrored presentation is that the probability profiles of the potential Trump or Biden wins can be directly compared. We learn that Trump's winning margins are smaller, rarely above 150, and never above 250.

This comparison is made easier by flipping left side of the chart onto the right side:

Redo_junkcharts_538forecastpdf_2

Those are three different visualizations using the same chart form. I'd have to run a poll to figure out which is the best. What's your opinion?


How many details to include in a chart

This graphic by Bloomberg provides the context for understanding the severity of the Atlantic storm season. (link)

Bloomberg_2020storms_vertical

At this point of the season, 2020 appears to be one of the most severe in history.

I was momentarily fascinated by a feature of modern browser-based data visualization: the death of the aspect ratio. When the browser window is stretched sufficiently wide, the chart above is transformed to this look:

Bloomberg_2020storms_horizontal

The chart designer has lost control of the aspect ratio.

***

This Bloomberg chart is an example of the spaghetti-style plots that convey variability by displaying individual units of data (here, storm years). The envelope of the growth curves gives the range of historical counts while the density of curves roughly offers some sense of the most likely counts at different points of the season.

But these spaghetti-style plots are not precise at conveying the variability because the density is hard to gauge. That's where aggregating the individual units helps.

The following chart does not show individual storm years. It shows the counts for the median season at selected points in time, and also a band of variability (for example, you'd include say 90 or 95% of the seasons).

Redo_bloomberg_2020storms

I don't have the raw data so the aggregating is done by eyeballing the spaghetti.

I prefer this presentation even though it does not plot every single data point one has in the dataset.

 

 


On data volume, reliability, uncertainty and confidence bands

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.

Economist_lifequalitywealth1

The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.

Redo_economistlifesatisfaction_linear

One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.

***

The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.

***

The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.

Redo_economistlifesatisfaction_nohex

Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.

***

An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)

Redo_economistlifesatisfaction_confband

The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


Twitter people UpSet with that Covid symptoms diagram

Been busy with an exciting project, which I might talk about one day. But I promised some people I'll follow up on Covid symptoms data visualization, so here it is.

After I posted about the Venn diagram used to depict self-reported Covid-19 symptoms by users of the Covid Symptom Tracker app (reported by Nature), Xan and a few others alerted me to Twitter discussion about alternative visualizations that people have made after they suffered the indignity of trying to parse the Venn diagram.

To avoid triggering post-trauma, for those want to view the Venn diagram, please click here.

[In the Twitter links below, you almost always have to scroll one message down - saving tweets, linking to tweets, etc. are all stuff I haven't fully figured out.]

Start with the Questions

Xan’s final comment is especially appropriate: "There's an over-riding Type-Q issue: count charts answer the wrong question".

As dataviz designers, we frequently get locked into the mindset of “what is the best way to present this dataset?” This line of thinking leads to overloaded graphics that attempt to answer every possible question that may arise from the data in one panoptic chart, akin to juggling 10 balls at once.

For complex datasets, it is often helpful to narrow down the list of questions, and provide a series of charts, each addressing one or two questions. I’ll come back to this point. I want to first show some of the nicer visuals that others have produced, which brings out the structure and complexity of this dataset.

 

The UpSet chart

The primary contender is the “UpSet” chart form, as best exemplified by Bart’s effort

Upset_bartjutte

The centerpiece of this chart is the matrix of dots. The horizontal rows of dots represent the presence of specific symptoms such as cough and anosmia (loss of smell and taste). The vertical columns are intuitive, once you get it. They represent combinations of symptoms, and the fill/no-fill of the dots indicates which symptoms are being combined. For example, the first column counts people reporting fatigue plus anosmia (but nothing else).

The UpSet chart clearly communicates the structure of the data. In many survey questions (including this one conducted by the Symptom Tracker app), respondents are allowed to check/tick more than one answer choices. This creates a situation where the number of answers (here, symptoms) per respondent can be zero up to the total number of answer choices.

So far, we have built a structure like we have drawn country outlines on a map. There is no data yet. The data are primarily found in the sidebar histograms (column/bar charts). Reading horizontally to the right side, one learns that the most frequently reported symptom was fatigue, covering 88 percent of the users.* Reading vertically, one learns that the top combination of symptoms was fatigue plus anosmia, covering 16 percent of users.

***

Now come the divisive acts.

Act 1: Bart orders the columns in a particular way that meets his subjective view of how he wants readers to see the data. The columns are sorted from the most frequent combinations to the least. The histogram has a “long tail”, with most of the combinations receiving a small proportion of the total. The top five combinations is where the bulk of the data is – I’d have liked to see all five columns labeled, without decimal places.

This is a choice on the part of the designer. Nils, for example, made two versions of his UpSet charts. The second version arranges the combinations from singles to quintuples.

Nils Gehlenborg_upsetplot_sortedbynumberofsymptoms

 

Digression: The Visual in Data Visualization

The two rendering of “UpSet” charts, by Nils and Bart, is a perfect illustration of the Trifecta Checkup framework. Each corner of the Trifecta is an independent dimension, and yet all must sync. With the same data and the same question types, what differentiates the two versions is the visual design.

See how many differences you can find, and make your own design choices!

 

I place the digression here because Act 1 above has to do with the Q corner, and both visual designs can accommodate the sorting decisions. But Act 2 below pertains to the V corner.

Act 2: Bart applies a blue gradient to the matrix of dots that reinforces his subjective view about identifying frequent combinations of symptoms. Nils, by contrast, uses the matrix to show present/absent only.

I’m not sure about Act 2. I think the addition of the color gradient overloads the matrix in the chart. It has the nice effect of focusing the reader’s attention on the top 5 combinations but it also requires the reader to have understood the meaning of columns first. Perhaps applying the gradient to the histogram up top rather than the dots in the matrix can achieve the same goal with less confusion.

 

Getting Obtuse

For example, some readers (e.g. Robin) expressed confusion.

Robin is alleging something the chart doesn’t do. He pointed out (correctly) that while 16 percent experienced fatigue and anosmia only (without other symptoms), more than 50 percent reported fatigue and anosmia, plus other symptoms. That nugget of information is deeply buried inside Bart’s chart – it’s the sum of each column for which the first two dots are filled in. For example, the second column represents fatigue+anosmia+cough. So Robin wants to aggregate those up.

Robin’s critique arises from the Q(uestion) corner. If the designer wants to highlight specific combinations that occur most frequently in the data, then Bart’s encoding makes perfect sense. On the other hand, if the purpose is to highlight pairs of symptoms that occur most frequently together (disregarding symptoms outside each pair), then the data must be further aggregated. The switch in the Question requires more Data manipulation, which then affects the Visualization. That's the essence of the Trifecta Checkup framework.

Rest assured, the version that addresses Robin’s point will not give an easy answer to Bart’s question. In fact, Xan whipped up a bar chart in response:

Xan_symptomscombo_barchart

This is actually hard to comprehend because Robin’s question is even hard to state. The first bar shows 87 percent of users reported fatigue as a symptom, the same number that appeared on Bart’s version on the right side. Then, the darkened section of the bar indicates the proportion of users who reported only fatigue and nothing else, which appears to be about 10 percent. So 1 out of 9 reported just fatigue while 8 out of 9 who reported fatigue also experienced other symptoms.

 

Xan’s bar chart can be flipped 90 degrees and replace Bart’s histogram on top of the matrix. But you see, we end up with the same problem as I mentioned up top. By jamming more insights from more questions onto the same chart, we risk dropping the other balls that were already in the air.

So, my advice is always to first winnow down the list of questions you want to address. And don’t be afraid of making a series of charts instead of one panoptic chart.

***

Act 3: Bart decides to leave out labels for the columns.

This is a curious choice given the key storyline we’ve been working with so far (the Top 5 combinations of symptoms). But notice how annoying this problem is. Combinations require long text, which must be written vertically or slanted on this design. Transposing could help but not really. It’s just a limitation of this chart form. For me, reading the filled dots underneath the columns as column labels isn’t a show-stopper.

 

Histograms vs Bar Charts

It’s worth pointing out that the sidebar “histograms” are not both histograms. I tend to think of histograms as a specific type of bar (column) chart, in which the sum of the bars (columns) can be interpreted as a whole. So all histograms are bar charts but only some bar charts are histograms.

The column chart up top is a histogram. The combinations of symptoms are disjoint, and the total of the combinations should be the total number of answer choices selected by all respondents. The bar chart on the right side however is not a histogram. Each percentage is a proportion to the whole, and adding those percentages yields way above 100%.

I like the annotation on Bart’s chart a lot. They are succinct and they give just the right information to explain how to read the chart.

 

Limitations

I already mentioned the vertical labeling issue for UpSet charts. Here are two other considerations for you.

The majority of the plotting area is dedicated to the matrix of dots. The matrix contains merely labels for data. They are like country boundaries on a map. While it lays out the structure of data very clearly, the designer should ask whether it is essential for the readers to see the entire landscape.

In real-world data, the “long tail” phenomenon we saw earlier is very common. With six featured symptoms, there are 2^6 = 64 possible combinations of symptoms (minus 1 if they filtered out those not reporting symptoms*), almost all of which will be empty. Should the low-frequency columns be removed? This is not as controversial as you think, because implicitly both Bart and Nils already dropped all empty combinations!

 

Data and Code

Kieran Healy left a comment on the last post, and you can find both the data (thank you!) and some R code for UpSet charts at his blog.

Also, Nils has a Shiny app on Github.

 

(*) One must be very careful about what “users” are being represented. They form a tiny subset of users of the Symptom Tracker app, just those who have previously taken a diagnostic test and have self-reported at least one symptom. I have separately commented on the analyses of this dataset by the team behind the app. The first post discusses their analytical methods, the second post examines how they pre-processed the data, and a future post will describe the data collection practices. For the purpose of this blog post, I’ll ignore any data issues.

(#) Bart’s chart is conceptual because some of the columns of dots are repeated, and there is one column without fills, which should have been removed by a pre-processing step applied by the research team.