Same data + same chart form = same story. Maybe.

We love charts that tell stories.

Some people believe that if they situate the data in the right chart form, the stories reveal themselves.

Some people believe for a given dataset, there exists a best chart form that brings out the story.

An implication of these beliefs is that the story is immutable, given the dataset and the chart form.

If you use the Trifecta Checkup, you already know I don't subscribe to those ideas. That's why the Trifecta has three legs, the third is the question - which is related to the message or the story.


I came across the following chart by Statista, illustrating the growth in Covid-19 cases from the start of the pandemic to this month. The underlying data are collected by WHO and cover the entire globe. The data are grouped by regions.


The story of this chart appears to be that the world moves in lock step, with each region behaving more or less the same.

If you visit the WHO site, they show a similar chart:


On this chart, the regions at the bottom of the graph (esp. Southeast Asia in purple) clearly do not follow the same time patterns as Americas (orange) or Europe (green).

What we're witnessing is: same data, same chart form, different stories.

This is a feature, not a bug, of the stacked area chart. The story is driven largely by the order in which the pieces are stacked. In the Statista chart, the largest pieces are placed at the bottom while for WHO, the order is exactly reversed.

(There are minor differences which do not affect my argument. The WHO chart omits the "Other" category which accounts for very little. Also, the Statista chart shows the smoothed data using 7-day averaging.)

In this example, the order chosen by WHO preserves the story while the order chosen by Statista wipes it out.


What might be the underlying question of someone who makes this graph? Perhaps it is to identify the relative prevalence of Covid-19 in different regions at different stages of the pandemic.

Emphasis on the word "relative". Instead of plotting absolute number of cases, I consider plotting relative number of cases, that is to say, the proportion of cases in each region at given times.

This leads to a stacked area percentage chart.


In this side-by-side view, you see that this form is not affected by flipping the order of the regions. Both charts say the same thing: that there were two waves in Europe and the Americas that dwarfed all other regions.



Dreamy Hawaii

I really enjoyed this visual story by ProPublica and Honolulu Star-Advertiser about the plight of beaches in Hawaii (link).

The story begins with a beautiful invitation:


This design reminds me of Vimeo's old home page. (It no longer looks like this today but this screenshot came from when I was the data guy there.) In both cases, the images are not static but moving.


The tour de force of this visual story is an annotated walk along the Lanikai Beach. Here is a snapshot at one of the stops:


This shows a particular homeowner who, according to documents, was permitted to rebuild a destroyed seawall even though officials were supposed to disallow reconstruction in order to protect beaches from eroding. The property is marked on the map above. The image inside the box is a gif showing waves smashing the seawall.

As the reader scrolls down, the image window runs through a carousel of gifs of houses along the beach. The images are synchronized to the reader's progress along the shore. The narrative makes stops at specific houses at which point a text box pops up to provide color commentary.


The erosion crisis is shown in this pair of maps.


There's some fancy work behind the scenes to patch together images, and estimate the boundaries of th beaches.


The following map is notable for its simplicity. There are no unnecessary details and labels. We don't need to know the name of every street or a specific restaurant. Removing excess details makes readers focus on the informative parts. 


Clicking on the dots brings up more details.


Enjoy the entire story here.

Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:


Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.


Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.


This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.


The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:


Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.


You can find the full article here.




Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.


This is an edited version of the chart used in Peter Frazier's presentation.


The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.


Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.



Everything in Texas is big, but not this BIG

Long-time reader John forwarded the following chart via Twitter.


The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.


The final product:


An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.


Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.

On data volume, reliability, uncertainty and confidence bands

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.


The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.


One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.


The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.


The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.


Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.


An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)


The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.

Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.


At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".


Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.


The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.


[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]

Presented without comment

Weekend assignment - which of these tells the story better?




The cop-out answer is to say both. If you must pick one, which one?


When designing a data visualization as a living product (not static), you'd want a design that adapts as the data change.

Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.


The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.


We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.


The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.


The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]



How the pandemic affected employment of men and women

In the last post, I looked at the overall employment situation in the U.S. Here is the trend of the "official" unemployment rate since 1990.


I was talking about the missing 100 million. These are people who are neither employed nor unemployed in the eyes of the Bureau of Labor Statistics (BLS). They are simply unrepresented in the numbers shown in the chart above.

This group is visualized in my scatter plot as "not in labor force", as a percent of the employment-age population. The horizontal axis of this scatter plot shows the proportion of employed people who hold part-time jobs. Anyone who worked at least one hour during the month is counted as employed part-time.


Today, I visualize the differences between men and women.

The first scatter plot shows the situation for men:


This plot reveals a long-term structural problem for the U.S. economy. Regardless of the overall economic health, more and more men have been declared not in labor force each year. Between 2007, the start of the Great Recession to 2019, the proportion went up from 27% to 31%, and the pandemic has pushed this to almost 34%. As mentioned in the last post, this sharp rise in April raises concern that the criteria for "not in labor force" capture a lot of people who actually want a job, and therefore should be counted as part of the labor force but unemployed.

Also, as seen in the last post, the severe drop in part-time workers is unprecedented during economic hardship. As dots turn from blue to red, they typically are moving right, meaning more part-time workers. Since the pandemic, among those people still employed, the proportion holding full-time jobs has paradoxically exploded.


The second scatter plot shows the situation with women:


Women have always faced a tougher job market. If they are employed, they are more likely to be holding part-time jobs relative to employed men; and a significantly larger proportion of women are not in the labor force. Between 1990 and 2001, more women entered the labor force. Just like men, the Great Recession resulted in a marked jump in the proportion out of labor force. Since 2014, a positive trend emerged, now interrupted by the pandemic, which has pushed both metrics to levels never seen before.

The same story persists: the sharp rise in women "not in labor force" exposes a problem with this statistic - as it apparently includes people who do want to work, not as intended. In addition, unlike the pattern in the last 30 years, the severe economic crisis is coupled with a shift toward full-time employment, indicating that part-time jobs were disappearing much faster than full-time jobs.