The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:


The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:


(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:





Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.


This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.


The key, as always, is to nail down your message, then select the right chart form.



Transforming the data to fit the message

A short time ago, there were reports that some theme-park goers were not happy about the latest price hike by Disney. One of these report, from the Washington Post (link), showed a chart that was intended to convey how much Disney park prices have outpaced inflation. Here is the chart:


I had a lot of trouble processing this chart. The two lines are labeled "original price" and "in 2014 dollars". The lines show a gap back in the 1970s, which completely closes up by 2014. This gives the reader an impression that the problem has melted away - which is the opposite of the designer intended.

The economic concept being marshalled here is the time value of money, or inflation. The idea is that $3.50 in 1971 is equivalent to a much higher ticket price in "2014 dollars" because by virtue of inflation, putting that $3.50 in the bank in 1971 and holding till 2014 would make that sum "nominally" higher. In fact, according to the chart, the $3.50 would have become $20.46, an approx. 7-fold increase.

The gap thus represents the inflation factor. The gap melting away is a result of passing of time. The closer one is to the present, the less the effect of cumulative inflation. The story being visualized is that Disney prices are increasing quickly whether or not one takes inflation into account. Further, if inflation were to be considered, the rate of increase is lower (red line).

What about the alternative story - Disney's price increases are often much higher than inflation? We can take the nominal price increase, and divide it into two parts, one due to inflation (of the prior-period price), and the other in excess of inflation, which we will interpret as a Disney premium.

The following chart then illustrates this point of view:


Most increases are small, and stay close to the inflation rate. But once in a while, and especially in 2010s, the price increases have outpaced inflation by a lot.

Note: since 2013, Disney has introduced price tiers, starting with two and currently at four levels. In the above chart, I took the average of the available prices, making the assumption that all four levels are equally popular. The last number looks like a price decrease because there is a new tier called "Low". The data came from

A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.



When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":


The color scheme across all three maps shows a keen awareness of background/foreground concerns. 

Three pies and a bar: serving visual goodness

If you are not sick of the Washington Post article about friends (not) letting friends join the other party, allow me to write yet another post on, gasp, that pie chart. And sorry to have kept reader Daniel L. waiting, as he pointed out, when submitting this chart to me, that he had tremendous difficulty understanding it:



This is not one pie but six pies on a platter. There are two sources of confusion: first, the repeated labels of Republicans and Democrats to refer to different groups of people; and second, the indecision between using two or four categories of "how many".

Let me begin by re-ordering and re-labeling the chart:


From this version, one can pull out the key messages of the analysis. (A) Most voters, regardless of party, have mostly friends from the same party. and (B) Republicans are more likely to have more friends from the other party than Democrats. A third, but really not that interesting, point is that regardless of party, people have about the same likelihood to befriend Independents.

In visualization, less is more is frequently appropriate. So, here is a view of the same chart, using two categories instead of four.


The added advantage is only two required colors, and thus even grayscale can work.

The new arrangement of the pie platter makes it clear that there really isn't that much difference between Republican and Democratic voters along this dimension. Thus, visualizing the aggregate gets us to the same place.


After three servings of pies, the reader might be craving some energy bars


One can say that for very simple data like this, pie charts are acceptable. However, the stacked bar is better.

Thanks again Daniel, and it's a pleasure to serve you!

Lop-sided precincts, a visual exploration

In the last post, I discussed one of the charts in the very nice Washington Post feature, delving into polarizing American voters. See the post here. (Thanks again Daniel L.)

Today's post is inspired by the following chart (I am  showing only the top of it - click here to see the entire chart):


The chart plots each state as a separate row, so like most such charts, it is tall. The data analysis behind the chart is fascinating and unusual, although I find the chart harder to grasp than expected. The analyst starts with precinct-level data, and determines which precincts were "lop-sided," defined as having a winning margin of over 50 percent for the winner (either Trump or Clinton). The analyst then sums the voters in those lop-sided precincts, and expresses this as a percent of all voters in the state.

For example, in Alabama, the long red bar indicates that about 48% of the state's voters live in lop-sided precincts that went for Trump. It's important to realize that not all such people voted for Trump - they happened to live in precincts that went heavily for Trump. Interestingly, about 12% of the states voters reside in precincts that went heavily for Clinton. Thus, overall, 60% of Alabama's voters live in lop-sided precincts.

This is more sophisticated than the usual analysis that shows up in journalism.

The bar chart may confuse readers for several reasons:

  • The horizontal axis is labeled "50-point plus margin for Trump/Clinton" and has values from 0% to 40-60% range. This description seemingly infers the values being plotted as winning margins. However, the sub-header tells readers that the data values are percentages of total voters in the state.
  • The shades of colors are not explained. I believe the dark shade indicates the winning party in each state, so Trump won Alabama and Clinton, California. The addition of this information allows the analysis to become multi-dimensional. It also reveals that the designer wants to address how lop-sided precincts affect the outcome of the election. However, adding shade in this manner effectively turns a two-color composition into a four-color composition, adding to the processing load.
  • The chart adopts what Howard Wainer calls the "Alabama first"  ordering. This always messes up the designer's message because the alphabetical order typically does not yield a meaningful correlation.

The bars are facing out from the middle, which is the 0% line. This arrangement is most often used in a population pyramid, and used when the designer feels it important to let readers compare the magnitudes of two segments of a population. I do not feel that the Democrat versus Republican comparison within each state is crucial to this chart, given that most states were not competitive.

What is more interesting to me is the total proportion of voters who live in these lop-sided precincts. The designer agrees on this point, and employs bar stacking to make this point. This yields some amazing insights here: several Democratic strongholds such as Massachusetts surprisingly have few lop-sided precincts.

Here then is a remake of the chart according to my priorities. Click here for the full chart.


The emphasis is on the total proportion of voters in lop-sided precincts. The states are ordered by that metric from most lop-sided to least. This draws out an unexpected insight: most red states have a relatively high proportion of votesr in lop-sided precincts (~ 30 to 40%) while most blue states - except for the quartet of Maryland, New York, California and Illinois - do not exhibit such demographic concentration.

The gray/grey area offers a counterpoint, that most voters do not live in lop-sided districts.

P.S. I should add that this is one of those chart designs that frustrate standard - I mean, point-and-click - charting software because I am placing the longest bar segments on the left, regardless of color.

Let's not mix these polarized voters as the medians run away from one another

Long-time follower Daniel L. sent in a gem, by the Washington Post. This is a multi-part story about the polarization of American voters, nicely laid out, with superior analyses and some interesting graphics. Click here to see the entire article.

Today's post focuses on the first graphic. This one:


The key messages are written out on the 2017 charts: namely, 95% of Republicans are more conservative than the median Democrat, and 97% of Democrats are more libearl than the median Republicans.

This is a nice statistical way of laying out the polarization. There are a number of additional insights one can draw from the population distributions: for example, in the bottom row, the Democrats have been moving left consistently, and decisively in 2017. By contrast, Republicans moved decisively to the right from 2004 to 2017. I recall reading about polarization in past elections but it is really shocking to see the extreme in 2017.

A really astounding but hidden feature is that the median Democrat and the median Republican were not too far apart in 1994 and 2004 but the gap exploded in 2017.


I like to solve a few minor problems on this graphic. It's a bit confusing to have each chart display information on both Republican and Democratic distributions. The reader has to understand that in the top row, the red area represents Republican voters but the blue line shows the median Democrat.

Also, I want to surface two key insights: the huge divide that developed in 2017, and the exploding gap between the two medians.

Here is the revised graphic:


On the left side, each chart focuses on one party, and the trend over the three elections. The reader can cross charts to discover that the median voter in one party is more extreme than essentially all of the voters of the other party. This same conclusion can be drawn from the exploding gap between the median voters in either party, which is explicitly plotted in the lower right chart. The top right chart is a pretty visualization of how polarized the country was in the 2017 election.


Visualizing electoral college politics: exercise in displaying relationships between variables

Reader Berry B. sent in a tip quite some months ago that I just pulled out of my inbox. He really liked the Washington Post's visualization of the electoral college in the Presidential election. (link)

One of the strengths of this project is the analysis that went on behind the visualization. The authors point out that there are three variables at play: the population of each state, the votes casted by state, and the number of electoral votes by state. A side-by-side comparison of the two tile maps gives a perspective of the story:


The under/over representation of electoral votes is much less pronounced if we take into account the propensity to vote. With three metrics at play, there is quite a bit going on. On these maps, orange and blue are used to indicate the direction of difference. Then the shade of the color codes the degree of difference, which was classified into severe versus slight (but only for one direction). Finally, solid squares are used for the comparison with population, and square outlines are for comparison with votes cast.

Pick Florida (FL) for example. On the left side, we have a solid, dark orange square while on the right, we have a square outline in dark orange. From that, we are asked to match the dark orange with the dark orange and to contrast the solid versus the outline. It works to some extent but the required effort seems more than desirable.


I'd like to make it easier for readers to see the interplay between all three metrics.

In the following effort, I ditch the map aesthetic, and focus on three transformed measures: share of population, share of popular vote, and share of electoral vote. The share of popular vote is a re-interpretation of what Washington Post calls "votes cast".

The information is best presented by grouping states that behaved similarly. The two most interesting subgroups are the large states like Texas and California where the residents loudly complained that their voice was suppressed by the electoral vote allocation but in fact, the allocated electoral votes were not far from their share of the popular vote! By contrast, Floridians had a more legitimate reason to gripe since their share of the popular vote much exceeded their share of the electoral vote. This pattern also persisted throughout the battleground states.


The hardest part of this design is making the legend:





Unintentional deception of area expansion #bigdata #piechart

Someone sent me this chart via Twitter, as an example of yet another terrible pie chart. (I couldn't find that tweet anymore but thank you to the reader for submitting this.)


At first glance, this looks like a pie chart with the radius as a second dimension. But that is the wrong interpretation.

In a pie chart, we typically encode the data in the angles of the pie sectors, or equivalently, the areas of the sectors. In this special case, the angle is invariant across the slices, and the data are encoded in the radius.

Since the data are found in the radii, let's deconstruct this chart by reducing each sector to its left-side edge.

This leads to a different interpretation of the chart: it’s actually a simple bar chart, manipulated.


The process of the manipulation runs against what data visualization should be. It takes the bar chart (bottom right) that is easy to read, introduces slants so it becomes harder to digest (top right), and finally absorbs a distortion to go from inefficient to incompetent (left).

What is this distortion I just mentioned? When readers look at the original chart, they are not focusing on the left-side edge of each sector but they are seeing the area of each sector. The ratio of areas is not the same as the ratio of lengths. Adding purple areas to the chart seems harmless but in fact, despite applying the same angles, the designer added disproportionately more area to the larger data points compared to the smaller ones.


In order to remedy this situation, the designer has to take the square root of the lengths of the edges. But of course, the simple bar chart is more effective.



Your charts need the gift of purpose

Via Twitter, I received this chart:


My readers are nailing it when it comes to finding charts that deserve close study. On Twitter, the conversation revolved around the inversion of the horizontal axis. Favorability is associated with positive numbers, and unfavorability with negative numbers, and so, it seems the natural ordering should be to place Favorable on the right and Unfavorable on the left.

Ordinarily, I'd have a problem with the inversion but here, the designer used the red-orange color scheme to overcome the potential misconception. It's hard to imagine that orange would be the color of disapproval, and red, of approval!

I am more concerned about a different source of confusion. Take a look at the following excerpt:

Wp_favorability_overall inset

If you had to guess, what are the four levels of favorability? Using the same positive-negative scale discussed above, most of us will assume that going left to right, we are looking at Strongly Favorable, Favorable, Unfavorable, Strongly Unfavorable. The people in the middle are neutrals and the people on the edages are extremists.

But we'd be mistaken. The order going left to right is Favorable, Strongly Favorable, Strongly Unfavorable, Unfavorable. The designer again used tints and shades to counter our pre-conception. This is less successful because the order defies logic. It is a double inversion.

The other part of the chart I'd draw attention to is the column of data printed on the right. Each such column is an act of giving up - the designer admits he or she couldn't find a way to incorporate that data into the chart itself. It's like a footnote in a book. The problem arises because such a column frequently contains very important information. On this chart, the data are "net favorable" ratings, the proportion of Favorables minus the proportion of Unfavorables, or visually, the length of the orange bar minus the length of the red bar.

The net rating is a succinct way to summarize the average sentiment of the population. But it's been banished to a footnote.


Anyone who follows American politics a little in recent years recognizes the worsening polarization of opinions. A chart showing the population average is thus rather meaningless. I'd like to see the above chart broken up by party affiliation (Republican, Independent, Democrat).

This led me to the original source of the chart. It turns out that the data came from a Fox News poll but the chart was not produced by Fox News - it accompanied this Washington Post article. Further, the article contains three other charts, broken out by party affiliation, as I hoped. The headline of the article was "Bernie Sanders remains one of the most popular politicians..."

But reading three charts, printed vertically, is not the simplest matter. One way to make it easier is to gift the chart a purpose. It turns out there are no surprises among the Republican and Democratic voters - they are as polarized as one can imagine. So the real interesting question in this data is the orientation of the Independent voters - are they more likely to side with Democrats or Republicans?

Good house-keeping means when you acquire stuff, you must remove other stuff. After adding the party dimension, it makes more sense to collapse the favorability dimension - precisely by using the net favorable rating column: