Come si dice donut in italiano

One of my Italian readers sent me the following "horror chart". (Last I checked, it's not Halloween.)

Horrorchart

I mean, people are selling these rainbow sunglasses.

Rainbowwunglasses

The dataset behind the chart is the market share of steel production by country in 1992 and in 2014. The presumed story is how steel production has shifted from country to country over those 22 years.

Before anything else, readers must decipher the colors. This takes their eyes off the data and on to the color legend placed on the right column. The order of the color legend is different from that found in the nearest object, the 2014 donut. The following shows how our eyes roll while making sense of the donut chart.

Junkcharts_steeldonuts_eye1

It's easier to read the 1992 donut because of the order but now, our eyes must leapfrog the 2014 donut.

Junkcharts_steeldonuts_eye2

This is another example of a visualization that fails the self-sufficiency test. The entire dataset is actually printed around the two circles. If we delete the data labels, it becomes clear that readers are consuming the data labels, not the visual elements of the chart.

Junkcharts_steeldonuts_sufficiency

The chart is aimed at an Italian audience so they may have a patriotic interest in the data for Italia. What they find is disappointing. Italy apparently completely dropped out of steel production. It produced 3% of the world's steel in 1992 but zero in 2014.

Now I don't know if that is true because while reproducing the chart, I noticed that in the 2014 donut, there is a dark orange color that is not found in the legend. Is that Italy or a mysterious new entrant to steel production?

One alternative is a dot plot. This design accommodates arrows between the dots indicating growth versus decline.

Junkcharts_redo_steeldonuts

 


A note to science journal editors: require better visuals

In reviewing a new small-scale study of the Moderna vaccine, I found this chart:

Modernahalfdoses_fig3a

This style of charts is quite common in scientific papers. And they are horrible. It irks me to think that some authors are forced to adopt such styles.

The study's main goal is to compare two half doses to two full doses of the Moderna vaccine. (To understand the science, read the post on my book blog.) The participants were stratified by age group. The vaccine is expected to work better for younger people than for older people. The point of the study isn't to measure the difference by age group, and so the age-group dimension is secondary.

Upon recognizing that, I reduce the number of colors from 4 to 2:

Junkcharts_redo_modernahalfdoses_1

Halving the number of colors presents no additional difficulty. The reader spends less time cross-referencing.

The existence of the Pbo (placebo) and Conv (convalescent plasma) columns on the sides is both unsightly and suboptimal. The "Conv" serves as a reference level for the amount of antibodies the vaccine stimulates in people. A better way to display reference levels is using reference lines.

Junkcharts_redo_modernahalfdoses_2color

The biggest problem with the chart is the log scale on the vertical axis. This isn't even a log-10 but a log-2. (Each tick is a doubling of value.)

Take the first set of columns as an example. The second column is clearly less than twice the height of the first column, and yet 25 is 3.5 times bigger than 7.  The third column is also visually less than double the size of the second column, and yet 189 is 7.5 times bigger than 25. The areas (heights) of the columns do not convey the right information about relative sizes of the underlying data.

Here's an amusing observation. The brown area shaded below is half of the entire area of the chart - if we reverted it to a linear scale. And yet there is not a single data point above 250 in the data so the brown area is entirely empty.

Junkcharts_redo_modernahalfdoses_logscale

An effect of a log scale is to compress the larger values of a dataset. That's what you're seeing here.

I now revisualize using dotplots:

Junkcharts_redo_modernahalfdoses_dotplotlinear

The version on the left retains the log scale while the right one (pun intended) reverts to the linear scale.

The biggest effect by far is the spike of antibodies between day 29 and 43 - which is after the second shot is administered. (For Moderna, the second shot is targeted for day 28.) In fact, it is during that window that the level of antibodies went from below the "conv" level (i.e. from natural infection) to far above.

The log-scale version buries this finding because it squeezes the large numbers on the chart. In addition, it artificially pulls the small numbers toward the "Conv" level. On the right chart, the second dot for 18-54, full doses is only at half the level of "Conv"  but it looks tantalizing close to the "Conv" level on the left chart.

The authors of the study also claim that there is negligible dropoff by 30 days after the second dose, i.e. between the third and fourth dots in each set. That may be so on the log-scale chart but on the linear chart, we see a moderate reduction. I don't believe the size of this study allows us to make a stronger conclusion but the claim of no dropoff is dubious.

The left chart also obscures the age-group differences. It appears as if all four sets show roughly the same pattern. With the linear scale, we notice that the vaccine clearly works better for the younger subgroup. As I discussed on the book blog, no one actually knows what level of antibodies constitutes "protection," and so I can't say whether that age-group difference has practical significance.

***

I recommend using log scales sparingly and carefully. They are a source of much mischief and misadventure.

 

 

 


Election visual 3: a strange, mash-up visualization

Continuing our review of FiveThirtyEight's election forecasting model visualization (link), I now look at their headline data visualization. (The previous posts in this series are here, and here.)

538_topchartofmaps

It's a set of 22 maps, each showing one election scenario, with one candidate winning. What chart form is this?

Small multiples may come to mind. A small-multiples chart is a grid in which every component graphic has the same form - same chart type, same color scheme, same scale, etc. The only variation from graphic to graphic is the data. The data are typically varied along a dimension of interest, for example, age groups, geographic regions, years. The following small-multiples chart, which I praised in the past (link), shows liquor consumption across the world.

image from junkcharts.typepad.com

Each component graphic changes according to the data specific to a country. When we scan across the grid, we draw conclusions about country-to-country variations. As with convention, there are as many graphics as there are countries in the dataset. Sometimes, the designer includes only countries that are directly relevant to the chart's topic.

***

What is the variable FiveThirtyEight chose to vary from map to map? It's the scenario used in the election forecasting model.

This choice is unconventional. The 22 scenarios is a subset of the 40,000 scenarios from the simulation - we are left wondering how those 22 are chosen.

Returning to our question: what chart form is this?

Perhaps you're reminded of the dot plot from the previous post. On that dot plot, the designer summarized the results of 40,000 scenarios using 100 dots. Since Biden is the winner in 75 percent of all scenarios, the dot plot shows 75 blue dots (and 25 red).

The map is the new dot. The 75 blue dots become 16 blue maps (rounded down) while the 25 red dots become 6 red maps.

Is it a pictogram of maps? If we ignore the details on the maps, and focus on the counts of colors, then yes. It's just a bit challenging because of the hole in the middle, and the atypical number of maps.

As with the dot plot, the map details are a nice touch. It connects readers with the simulation model which can feel very abstract.

Oddly, if you're someone familiar with probabilities, this presentation is quite confusing.

With 40,000 scenarios reduced to 22 maps, each map should represent 1818 scenarios. On the dot plot, each dot should represent 400 scenarios. This follows the rule for creating pictograms. Each object in a pictogram - dot, map, figurine, etc. - should encode an equal amount of the data. For the 538 visualization, is it true that each of the six red maps represents 1818 scenarios? This may be the case but not likely.

Recall the dot plot where the most extreme red dot shows a scenario in which Trump wins 376 out of 538 electoral votes (margin = 214). Each dot should represent 400 scenarios. The visualization implies that there are 400 scenarios similar to the one on display. For the grid of maps, the following red map from the top left corner should, in theory, represent 1,818 similar scenarios. Could be, but I'm not sure.

538_electoralvotemap_topleft

Mathematically, each of the depicted scenario, including the blowout win above, occurs with 1/40,000 chance in the simulation. However, one expects few scenarios that look like the extreme scenario, and ample scenarios that look like the median scenario.  

So, the right way to read the 538 chart is to ignore the map details when reading the embedded pictogram, and then look at the small multiples of detailed maps bearing in mind that extreme scenarios are unique while median scenarios have many lookalikes.

(Come to think about it, the analogous situation in the liquor consumption chart is the relative population size of different countries. When comparing country to country, we tend to forget that the data apply to large numbers of people in populous countries, and small numbers in tiny countries.)

***

There's a small improvement that can be made to the detailed maps. As I compare one map to the next, I'm trying to pick out which states that have changed to change the vote margin. Conceptually, the number of states painted red should decrease as the winning margin decreases, and the states that shift colors should be the toss-up states.

So I'd draw the solid Republican (Democratic) states with a lighter shade, forming an easily identifiable bloc on all maps, while the toss-up states are shown with a heavier shade.

Redo_junkcharts_538electoralmap_shading

Here, I just added a darker shade to the states that disappear from the first red map to the second.


Election visuals 2: informative and playful

In yesterday's post, I reviewed one section of 538's visualization of its election forecasting model, specifically, the post focuses on the probability plot visualization.

The visualization, technically called  a pdf, is a mainstay of statistical graphics. While every one of 40,000 scenarios shows up on this chart, it doesn't offer a direct answer to our topline question. What is Nate's call at this point in time? Elsewhere in their post, we learn that the 538 model currently gives Biden a 75% chance of winning, thrice that of Trump's.

538_pdf_pair

In graphical terms, the area to the right of the 270-line is three times the size of the left area (on the bottom chart). That's not apparent in the pdf representation. Addressing this, statisticians may convert the pdf into a cdf, which depicts the cumulative area as we sweep from the left to the right along the horizontal axis.  

The cdf visualization rarely leaves the pages of a scientific journal because it's not easy for a novice to understand. Not least because the relevant probability is 1 minus the cumulative probability. The cdf for the bottom chart will show 25% at the 270-line while the chance of Biden winning is 1 - 25% = 75%.

The cdf presentation is also wasteful for the election scenario. No one cares about any threshold other than the 270 votes needed to win, but the standard cdf shows every possible threshold.

The second graphical concept in the 538 post (link) is an attempt to solve this problem.

538_dotplot

If you drop all the dots to an imaginary horizontal baseline, the above dotplot looks like this:

Redo_junkcharts_538electionforecast_dotplot_1

There is a recent trend toward centering dots to produce symmetry. It's actually harder to perceive the differences in heights of the band.

The secret sauce is to put down 100 dots, with a 75-25 blue-red split that conveys the 75% chance of a Biden win. Imposing the pdf line from the other visualization, I find that the density of dots roughly mimics the probability of outcomes.

Redo_junkcharts_538electionforecast_dotplot_2

It's easier to estimate the blue vs red areas using those dots than the lines.

The dots are stuffed toys. Clicking on each dot reveals a map showing one of the 40,000 scenarios. It displays which candidate wins which state. For example, the most extreme example of a Trump win is:

538_dotplot_redextreme

Here is a scenario of a razor-tight election won by Trump:

538_dotplot_redmiddle

This presentation has a weakness as well. It gives the impression that each of the dots is equally important because they are the same size. In reality, the importance of each dot is proportional to the height of the band. Since the band is generally wider near the middle, the dots near the middle are more likely scenarios than the dots shown on the two edges.

On balance, I like this visualization that is both informative and playful.

As before, what strikes me about the simulation result is the flatness of the probability surface. This feature is obscured when we summarize the result as 75% chance of a Biden victory.


Putting vaccine trials in boxes

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

Bloombergbw_vaccinetrials_sm

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes. 

Redo_junkcharts_bloombergbw_vaccinetrials_0

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Redo_junkcharts_bloombergbw_vaccinetrials_2

 

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.

Bloombergbwonline_vaccinetrials

If any reader can figure out the logic of the ordering of the vaccine strategies, please leave a comment below.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

Germanextremists_bars

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Redo_junkcharts_germanextremists_sidebysidelines

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

Redo_junkcharts_germanextremists_dots

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.

 

[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]