Probabilities and proportions: which one is the chart showing

The New York Times showed this chart (link):

Nyt_unvaccinated_undeterred

My first read: oh my gosh, 40-50% of the unvaccinated Americans are living their normal lives - dining at restaurants, assembling with more than 10 people, going to religious gatherings.

After reading the text around this chart, I realize I have misinterpreted it.

The chart should be read by columns. Each column is a "pie chart". For example, the first column shows that half the restaurant diners are not vaccinated, a third are fully vaccinated, and the remainder are partially vaccinated. The other columns have roughly the same proportions.

The author says "The rates of vaccination among people doing these activities largely reflect the rates in the population." This line is perhaps more confusing than intended. What she's saying is that in the general population, half of us are unvaccinated, a third are fully unvaccinated, and the remainder are partially vaccinated.

Here's a picture:

Junkcharts_redo_nyt_unvaccinatedundeterred

What this chart is saying is that the people dining out is like a random sample from all Americans. So too the other groups depicted. What Americans are choosing to do is independent of their vaccination status.

Unvaccinated people are no less likely to be doing all these activities than the fully vaccinated. This raises the question: are half of the people not wearing masks outdoors unvaccinated?

***

Why did I read the chart wrongly in the first place? It has to do with expectations.

Most survey charts plot probabilities not proportions. I haphazardly grabbed the following Pew Research chart as an example:

Pew_kids_socialmedia

From this chart, we learn that 30% of kids 9-11 years old uses TikTok compared to 11% of kids 5-8.  The percentages down a column do not sum to 100%.

 


These are the top posts of 2020

It's always very interesting as a writer to look back at a year's of posts and find out which ones were most popular with my readers.

Here are the top posts on Junk Charts from 2020:

How to read this chart about coronavirus risk

This post about a New York Times scatter plot dates from February, a time when many Americans were debating whether Covid-19 was just the flu.

Proportions and rates: we are no dupes

This post about a ArsTechnica chart on the effects of Covid-19 by age is an example of designing the visual to reflect the structure of the data.

When the pie chart is more complex than the data

This post shows a 3D pie chart which is worse than a 2D pie chart.

Twitter people upset with that Covid symptoms diagram

This post discusses some complicated graphics designed to illustrate complicated datasets on Covid-19 symptoms.

Cornell must remove the logs before it reopens in the fall

This post is another warning to think twice before you use log scales.

What is the price of objectivity?

This post turns an "objective" data visualization into a piece of visual story-telling.

The snake pit chart is the best election graphic ever

This post introduces my favorite U.S. presidential election graphic, designed by the FiveThirtyEight team.

***

Here is a list of posts that deserve more attention:

Locating the political center

An example of bringing readers as close to the insights as possible

Visualizing change over time

An example of designing data visualization to reflect the structure of multivariate data

Bloomberg made me digest these graphics slowly

An example of simple and thoughtful graphics

The hidden bad assumption behind most dual-axis time-series charts

Read this before you make a dual-axis chart

Pie chart conventions

Read this before you make a pie chart

***
Looking forward to bring you more content in 2021!

Happy new year.


Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:

Policies_cases_hosp_static

Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.

***

Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.

Cases_over_time_grouped_by_policies

This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.

***

The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:

NewYork_policies_vs_cases

Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.

***

You can find the full article here.

 

 

 


Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Nyt_excessdeaths_south

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

Nyt_excessdeaths_northeast3

 

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

 


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


How to read this chart about coronavirus risk

In my just-published Long Read article at DataJournalism.com, I touched upon the subject of "How to Read this Chart".

Most data graphics do not come with directions of use because dataviz designers follow certain conventions. We do not need to tell you, for example, that time runs left to right on the horizontal axis (substitute right to left for those living in right-to-left countries). It's when we deviate from the norms that calls for a "How to Read this Chart" box.

***
A discussion over Twitter during the weekend on the following New York Times chart perfectly illustrates this issue. (The article is well worth reading to educate oneself on this red-hot public-health issue. I made some comments on the sister blog about the data a few days ago.)

Nyt_coronavirus_scatter

Reading this chart, I quickly grasp that the horizontal axis is the speed of infection and the vertical axis represents the deadliness. Without being told, I used the axis labels (and some of you might notice the annotations with the arrows on the top right.) But most people will likely miss - at a glance - that the vertical axis utilizes a log scale while the horizontal axis is linear (regular).

The effect of a log scale is to pull the large numbers toward the average while spreading the smaller numbers apart - when compared to a linear scale. So when we look at the top of the coronavirus box, it appears that this virus could be as deadly as SARS.

The height of the pink box is 3.9, while the gap between the top edge of the box and the SARS dot is 6. Yet our eyes tell us the top edge is closer to the SARS dot than it is to the bottom edge!

There is nothing inaccurate about this chart - the log scale introduces such distortion. The designer has to make a choice.

Indeed, there were two camps on Twitter, arguing for and against the log scale.

***

I use log scales a lot in analyzing data, but tend not to use log scales in a graph. It's almost a given that using the log scale requires a "How to Read this Chart" message. And the NY Times crew delivers!

Right below the chart is a paragraph:

Nyt_coronavirus_howtoreadthis

To make this even more interesting, the horizontal axis is a hidden "log" scale. That's because infections spread exponentially. Even though the scale is not labeled "log", think as if the large values have been pulled toward the middle.

Here is an over-simplified way to see this. A disease that spreads at a rate of fifteen people at a time is not 3 times worse than one that spreads five at a time. In the former case, the first sick person transmits it to 15, and then each of the 15 transmits the flu to 15 others, thus after two steps, 241 people have been infected (225 + 15 + 1). In latter case, it's 5x5 + 5 + 1 = 31 infections after two steps. So at this point, the number of infected is already 8 times worse, not 3 times. And the gap keeps widening with each step.

P.S. See also my post on the sister blog that digs deeper into the metrics.

 


All these charts lament the high prices charged by U.S. hospitals

Nyt_medicalprocedureprices

A former student asked me about this chart from the New York Times that highlights much higher prices of hospital procedures in the U.S. relative to a comparison group of seven countries.

The dot plot is clearly thought through. It is not a default chart that pops out of software.

Based on its design, we surmise that the designer has the following intentions:

  1. The names of the medical procedures are printed to be read, thus the long text is placed horizontally.

  2. The actual price is not as important as the relative price, expressed as an index with the U.S. price at 100%. These reference values are printed in glaring red, unignorable.

  3. Notwithstanding the above point, the actual price is still of secondary importance, and the values are provided as a supplement to the row labels. Getting to the actual prices in the comparison countries requires further effort, and a calculator.

  4. The primary comparison is between the U.S. and the rest of the world (or the group of seven countries included). It is less important to distinguish specific countries in the comparison group, and thus the non-U.S. dots are given pastels that take some effort to differentiate.

  5. Probably due to reader feedback, the font size is subject to a minimum so that some labels are split into two lines to prevent the text from dominating the plotting region.

***

In the Trifecta Checkup view of the world, there is no single best design. The best design depends on the intended message and what’s in the available data.

To illustate this, I will present a few variants of the above design, and discuss how these alternative designs reflect the designer's intentions.

Note that in all my charts, I expressed the relative price in terms of discounts, which is the mirror image of premiums. Instead of saying Country A's price is 80% of the U.S. price, I prefer to say Country A's price is a 20% saving (or discount) off the U.S. price.

First up is the following chart that emphasizes countries instead of hospital procedures:

Redo_medicalprice_hor_dot

This chart encourages readers to draw conclusions such as "Hospital prices are 60-80 percent cheaper in Holland relative to the U.S." But it is more taxing to compare the cost of a specific procedure across countries.

The indexing strategy already creates a barrier to understanding relative costs of a specific procedure. For example, the value for angioplasty in Australia is about 55% and in Switzerland, about 75%. The difference 75%-55% is meaningless because both numbers are relative savings from the U.S. baseline. Comparing Australia and Switzerland requires a ratio (0.75/0.55 = 1.36): Australia's prices are 36% above Swiss prices, or alternatively, Swiss prices are a 64% 26% discount off Australia's prices.

The following design takes it even further, excluding details of individual procedures:

Redo_medicalprice_hor_bar

For some readers, less is more. It’s even easier to get a rough estimate of how much cheaper prices are in the comparison countries, for now, except for two “outliers”, the chart does not display individual values.

The widths of these bars reveal that in some countries, the amount of savings depends on the specific procedures.

The bar design releases the designer from a horizontal orientation. The country labels are shorter and can be placed at the bottom in a vertical design:

Redo_medicalprice_vert_bar

It's not that one design is obviously superior to the others. Each version does some things better. A good designer recognizes the strengths and weaknesses of each design, and selects one to fulfil his/her intentions.

 

P.S. [1/3/20] Corrected a computation, explained in Ken's comment.


Where are the Democratic donors?

I like Alberto's discussion of the attractive maps about donors to Democratic presidential candidates, produced by the New York Times (direct link).

Here is the headline map:

Nyt_demdonormaps

The message is clear: Bernie Sanders is the only candidate with nation-wide appeal. The breadth of his coverage is breath-taking. (I agree with Alberto's critique about the lack of a color scale. It's impossible to know if the counts are trivial or not.)

Bernie's coverage is so broad that his numbers overwhelm those of all other candidates except in their home bases (e.g. O'Rourke in Texas).

A remedy to this is to look at the data after removing Bernie's numbers.

Nyt_demdonormap_2

 

This pair of maps reminds me of the Sri Lanka religions map that I revisualized in this post.

Redo_srilankareligiondistricts_v2

The first two maps divide the districts into those in which one religion dominates and those in which multiple religions share the limelight. The third map then shows the second-rank religion in the mixed-religions districts.

The second map in the NYT's donor map series plots the second-rank candidate in all the precincts that Bernie Sanders lead. It's like the designer pulled off the top layer (blue: Bernie) to reveal what's underneath.

Because all of Bernie's data are removed, O'Rourke is still dominating Texas, Buttigieg in Indiana, etc. An alternative is to pull off the top layer in those pockets as well. Then, it's likely to see Bernie showing up in those areas.

The other startling observation is how small Joe Biden's presence is on these maps. This is likely because Biden relies primarily on big donors.

See here for the entire series of donor maps. See here for past discussion of New York Times's graphics.


Powerful photos visualizing housing conditions in Hong Kong

I was going to react to Alberto's post about the New York Times's article about economic inequality in Hong Kong, which is proposed as one origin to explain the current protest movement. I agree that the best graphic in this set is the "photoviz" showing the "coffins" or "cages" that many residents live in, because of the population density. 

Nyt_hongkong_apartment_photoviz

Then I searched the archives, and found this old post from 2015 which is the perfect response to it. What's even better, that post was also inspired by Alberto.

The older post featured a wonderful campaign by human rights organization Society for Community Organization that uses photoviz to draw attention to the problem of housing conditions in Hong Kong. They organized a photography exhibit on this theme in 2014. They then updated the exhibit in 2016.

Here is one of the iconic photos by Benny Lam:

Soco_trapped_B1

I found more coverage of Benny's work here. There is also a book that we can flip on Vimeo.

In 2017, the South China Morning Post (SCMP) published drone footage showing the outside view of the apartment buildings.

***

What's missing is the visual comparison to the luxury condos where the top 1 percent live. For these, one can  visit the real estate sites, such as Sotheby's. Here is their "12 luxury homes for sales" page.

Another comparison: a 1000 sq feet apartment that sits between those extremes. The photo by John Butlin comes from SCMP's Post Magazine's feature on the apartment:

Butlin_scmp_home

***

Also check out my review of Alberto's fantastic, recent book, How Charts Lie.

Cairo_howchartslie_cover

 

 


Say it thrice: a nice example of layering and story-telling

I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).

It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.

Nyt_candidatemap_1

This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.

I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.

The next section presents the same data in a small-multiples layout. The heads are replaced by dots.

Nyt_candidatemap_2

This allows more precise comparison of one candidate to another, and one location to another.

This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.

Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.

Nyt_candidatemap_3

Click here to see the animation.

When repetition is done right, it doesn't feel like repetition.