Dataviz is good at comparisons if we make the right comparisons

In an article about gas prices around the world, the Washington Post uses the following bar chart (link):

Wpost_gasprices_highincome

There are a few wrinkles in this one compared to the most generic bar chart one can produce:

Redo_wpost_gasprices_0

(The numbers on my chart are not the same as Washington Post's. That's because the data vendor charges for data, except for the most recent week. So, my data is from a different week.)

_trifectacheckup_imageThe gas prices are not expressed in dollars but a transformation turns prices into a cost-effectiveness metric: miles per dollar, or more precisely, miles per $40 dollars of gas. The metric has a reverse direction - the higher the price, the lower the miles. The data transformation belongs to the D corner of the Trifecta Checkup framework (link). Depending on how one poses the Q(uestion) of the chart, the shift from dollars to miles can bring the Q and the D in sync.

In the V(isual) corner, the designer embellishes the bars. A car icon is placed at the tip of each bar while the bar itself is turned into a wavy path, symbolizing a dirt path. The driving metaphor is in full play. In fact, the video makes the most out of it. There is no doubt that the embellishment has turned a mere scientific presentation into a form of entertainment.

***

Did the embellishment harm visual clarity? For the most part, no.

The worst it can get is when they compared U.S. and India/South Africa:

Redo_wpost_gasprices_indiasouthafrica

The left column shows the original charts from the article. In  both charts, the two cars are so close together that it is impossible to learn the scale of the difference. The amount of difference is a fraction of the width of a car icon.

The right column shows the "self-sufficiency test". Imagine the data labels are not on the chart. What we learn is that if we wanted to know how big of a gap is between the two countries, when reading the charts on the left, we are relying on the data labels, not the visual elements. On the right side, if we really want to learn the gaps, we have to look through the car icons to find the tips of the bars!

This discussion does not necessarily doom the appealing chart. If the message one wants to send with the India/South Afrcia charts is that there is negligible difference between them, then it is not crucial to present the precise differences in prices.

***

The real problem with this dataviz is in the D corner. Comparing countries is hard.

As shown above, by the miles per $40 spend metric, U.S. and India are rated essentially the same. So is the average American and the average Indian suffering equally?

Far from it. The clue comes from the aggregate chart, in which countries are divided into three tiers: high income, upper middle income and lower middle income. The U.S. belongs to the high-income tier while India falls into the lower-middle-income tier.

The cost of living in India is much lower than in the US. Forty dollars is a much bigger chunk of an Indian paycheck than an American one.

To adjust for cost of living, economists use a PPP (purchasing power parity) value. The following chart shows the difference:

Redo_wpost_gasprices_1

The right graph contains cost-of-living adjustments. It shows a completely different picture. Nominally (left chart), the price of gas in about the same in dollar terms between U.S. and India. In terms of cost of living, gas is actually 5 times more expensive in India. Thus, the adjusted miles per $40 gas number is much smaller for India than the unadjusted. (Because PPP is relative to U.S. prices, the U.S. numbers are not affected.)

PPP is not the end-all here. According to the Economic Times (India), only 22 out of 1,000 Indians own cars, compared to 980 out of 1,000 Americans. Think about the implication of using any statistic that averages the entire population!

***

Why is gas more expensive in California than the U.S. average? The talking point I keep hearing is environmental regulations. Gas prices may be higher in Europe for a similar reason. Residents in those places may be willing to pay higher prices because they get satisfaction from playing their part in preserving the planet for future generations.

The footnote discloses this not-trivial issue.

Wpost_gasprices_footnote

When converting from dollars per gallon/liter into miles per $40, we need data on miles per gallon/liter. Americans notoriously drive cars (trucks, SUVs, etc.) that have much lower mileage than those driven by other countries. However, this factor is artificially removed by assuming the same car with 32 mpg on all countries. A quick hop to the BTS website tells us that the average mpg of American cars is a third of that assumption. [See note below.]

Ignoring cross-country comparisons for the time being, the true number for U.S. is not 247 miles per $40 spent on gas as claimed. It is a third of that value: 82 miles per $40 spent.

It's tough to find data on fuel economy of all passenger cars, not just new passenger cars. I found Australia's number, which is 21 mpg. So this brings the miles per $40 number down from about 230 to 115. These are not small adjustments.

Washington Post's analysis paints a simplistic picture that presupposes that price is the only thing people care about. I call this issue xyopia. It's when the analyst frames the problem as factor x explaining outcome y, and when factor x is not the only, and frequently not even the most important, factor affecting y.

More on xyopia.

More discussion of Washington Post graphics.

 

[P.S. 7-25-2022. Reader Cody Curtis pointed out in the comments that the Bureau of Transportation Statistics report was using km/liter as units, not miles per gallon. The 10 km/liter number for average cars is roughly 23 mpg. I'll leave the text as is in the post as the larger point is valid: that there is variation in average fuel economy between nations - partly due to environemental regulation and consumer behavior - and thus, a proper comparison requires adjusting for this factor.]


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.

Wp_waterstress

This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.

***

The key, as always, is to nail down your message, then select the right chart form.

 

 


Transforming the data to fit the message

A short time ago, there were reports that some theme-park goers were not happy about the latest price hike by Disney. One of these report, from the Washington Post (link), showed a chart that was intended to convey how much Disney park prices have outpaced inflation. Here is the chart:

Wapo_magickingdom_price_changes

I had a lot of trouble processing this chart. The two lines are labeled "original price" and "in 2014 dollars". The lines show a gap back in the 1970s, which completely closes up by 2014. This gives the reader an impression that the problem has melted away - which is the opposite of the designer intended.

The economic concept being marshalled here is the time value of money, or inflation. The idea is that $3.50 in 1971 is equivalent to a much higher ticket price in "2014 dollars" because by virtue of inflation, putting that $3.50 in the bank in 1971 and holding till 2014 would make that sum "nominally" higher. In fact, according to the chart, the $3.50 would have become $20.46, an approx. 7-fold increase.

The gap thus represents the inflation factor. The gap melting away is a result of passing of time. The closer one is to the present, the less the effect of cumulative inflation. The story being visualized is that Disney prices are increasing quickly whether or not one takes inflation into account. Further, if inflation were to be considered, the rate of increase is lower (red line).

What about the alternative story - Disney's price increases are often much higher than inflation? We can take the nominal price increase, and divide it into two parts, one due to inflation (of the prior-period price), and the other in excess of inflation, which we will interpret as a Disney premium.

The following chart then illustrates this point of view:

Redo_disneypricehikes

Most increases are small, and stay close to the inflation rate. But once in a while, and especially in 2010s, the price increases have outpaced inflation by a lot.

Note: since 2013, Disney has introduced price tiers, starting with two and currently at four levels. In the above chart, I took the average of the available prices, making the assumption that all four levels are equally popular. The last number looks like a price decrease because there is a new tier called "Low". The data came from AllEars.net.


A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.

Wpost_happy700_map2

 

When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":

Wpost_happy700_legend

The color scheme across all three maps shows a keen awareness of background/foreground concerns. 


Three pies and a bar: serving visual goodness

If you are not sick of the Washington Post article about friends (not) letting friends join the other party, allow me to write yet another post on, gasp, that pie chart. And sorry to have kept reader Daniel L. waiting, as he pointed out, when submitting this chart to me, that he had tremendous difficulty understanding it:

Wpost_friendsparties4

 

This is not one pie but six pies on a platter. There are two sources of confusion: first, the repeated labels of Republicans and Democrats to refer to different groups of people; and second, the indecision between using two or four categories of "how many".

Let me begin by re-ordering and re-labeling the chart:

Redo_junkcharts_friendsparties4

From this version, one can pull out the key messages of the analysis. (A) Most voters, regardless of party, have mostly friends from the same party. and (B) Republicans are more likely to have more friends from the other party than Democrats. A third, but really not that interesting, point is that regardless of party, people have about the same likelihood to befriend Independents.

In visualization, less is more is frequently appropriate. So, here is a view of the same chart, using two categories instead of four.

Redo_junkcharts_friendsparties4b

The added advantage is only two required colors, and thus even grayscale can work.

The new arrangement of the pie platter makes it clear that there really isn't that much difference between Republican and Democratic voters along this dimension. Thus, visualizing the aggregate gets us to the same place.

Redo_junkcharts_friendsparties4c

After three servings of pies, the reader might be craving some energy bars

Redo_junkcharts_friendsparties4d

One can say that for very simple data like this, pie charts are acceptable. However, the stacked bar is better.

Thanks again Daniel, and it's a pleasure to serve you!


Lop-sided precincts, a visual exploration

In the last post, I discussed one of the charts in the very nice Washington Post feature, delving into polarizing American voters. See the post here. (Thanks again Daniel L.)

Today's post is inspired by the following chart (I am  showing only the top of it - click here to see the entire chart):

Wpost_friendsparties2_top

The chart plots each state as a separate row, so like most such charts, it is tall. The data analysis behind the chart is fascinating and unusual, although I find the chart harder to grasp than expected. The analyst starts with precinct-level data, and determines which precincts were "lop-sided," defined as having a winning margin of over 50 percent for the winner (either Trump or Clinton). The analyst then sums the voters in those lop-sided precincts, and expresses this as a percent of all voters in the state.

For example, in Alabama, the long red bar indicates that about 48% of the state's voters live in lop-sided precincts that went for Trump. It's important to realize that not all such people voted for Trump - they happened to live in precincts that went heavily for Trump. Interestingly, about 12% of the states voters reside in precincts that went heavily for Clinton. Thus, overall, 60% of Alabama's voters live in lop-sided precincts.

This is more sophisticated than the usual analysis that shows up in journalism.

The bar chart may confuse readers for several reasons:

  • The horizontal axis is labeled "50-point plus margin for Trump/Clinton" and has values from 0% to 40-60% range. This description seemingly infers the values being plotted as winning margins. However, the sub-header tells readers that the data values are percentages of total voters in the state.
  • The shades of colors are not explained. I believe the dark shade indicates the winning party in each state, so Trump won Alabama and Clinton, California. The addition of this information allows the analysis to become multi-dimensional. It also reveals that the designer wants to address how lop-sided precincts affect the outcome of the election. However, adding shade in this manner effectively turns a two-color composition into a four-color composition, adding to the processing load.
  • The chart adopts what Howard Wainer calls the "Alabama first"  ordering. This always messes up the designer's message because the alphabetical order typically does not yield a meaningful correlation.

The bars are facing out from the middle, which is the 0% line. This arrangement is most often used in a population pyramid, and used when the designer feels it important to let readers compare the magnitudes of two segments of a population. I do not feel that the Democrat versus Republican comparison within each state is crucial to this chart, given that most states were not competitive.

What is more interesting to me is the total proportion of voters who live in these lop-sided precincts. The designer agrees on this point, and employs bar stacking to make this point. This yields some amazing insights here: several Democratic strongholds such as Massachusetts surprisingly have few lop-sided precincts.

***
Here then is a remake of the chart according to my priorities. Click here for the full chart.

Redo_wpost_friendsparties2_top

The emphasis is on the total proportion of voters in lop-sided precincts. The states are ordered by that metric from most lop-sided to least. This draws out an unexpected insight: most red states have a relatively high proportion of votesr in lop-sided precincts (~ 30 to 40%) while most blue states - except for the quartet of Maryland, New York, California and Illinois - do not exhibit such demographic concentration.

The gray/grey area offers a counterpoint, that most voters do not live in lop-sided districts.

P.S. I should add that this is one of those chart designs that frustrate standard - I mean, point-and-click - charting software because I am placing the longest bar segments on the left, regardless of color.


Let's not mix these polarized voters as the medians run away from one another

Long-time follower Daniel L. sent in a gem, by the Washington Post. This is a multi-part story about the polarization of American voters, nicely laid out, with superior analyses and some interesting graphics. Click here to see the entire article.

Today's post focuses on the first graphic. This one:

Wpost_friendsparties1

The key messages are written out on the 2017 charts: namely, 95% of Republicans are more conservative than the median Democrat, and 97% of Democrats are more libearl than the median Republicans.

This is a nice statistical way of laying out the polarization. There are a number of additional insights one can draw from the population distributions: for example, in the bottom row, the Democrats have been moving left consistently, and decisively in 2017. By contrast, Republicans moved decisively to the right from 2004 to 2017. I recall reading about polarization in past elections but it is really shocking to see the extreme in 2017.

A really astounding but hidden feature is that the median Democrat and the median Republican were not too far apart in 1994 and 2004 but the gap exploded in 2017.

***

I like to solve a few minor problems on this graphic. It's a bit confusing to have each chart display information on both Republican and Democratic distributions. The reader has to understand that in the top row, the red area represents Republican voters but the blue line shows the median Democrat.

Also, I want to surface two key insights: the huge divide that developed in 2017, and the exploding gap between the two medians.

Here is the revised graphic:

  Redo_wpost_friendsparties1

On the left side, each chart focuses on one party, and the trend over the three elections. The reader can cross charts to discover that the median voter in one party is more extreme than essentially all of the voters of the other party. This same conclusion can be drawn from the exploding gap between the median voters in either party, which is explicitly plotted in the lower right chart. The top right chart is a pretty visualization of how polarized the country was in the 2017 election.

 


Visualizing electoral college politics: exercise in displaying relationships between variables

Reader Berry B. sent in a tip quite some months ago that I just pulled out of my inbox. He really liked the Washington Post's visualization of the electoral college in the Presidential election. (link)

One of the strengths of this project is the analysis that went on behind the visualization. The authors point out that there are three variables at play: the population of each state, the votes casted by state, and the number of electoral votes by state. A side-by-side comparison of the two tile maps gives a perspective of the story:

Wp_electoralcollege_maps

The under/over representation of electoral votes is much less pronounced if we take into account the propensity to vote. With three metrics at play, there is quite a bit going on. On these maps, orange and blue are used to indicate the direction of difference. Then the shade of the color codes the degree of difference, which was classified into severe versus slight (but only for one direction). Finally, solid squares are used for the comparison with population, and square outlines are for comparison with votes cast.

Pick Florida (FL) for example. On the left side, we have a solid, dark orange square while on the right, we have a square outline in dark orange. From that, we are asked to match the dark orange with the dark orange and to contrast the solid versus the outline. It works to some extent but the required effort seems more than desirable.

***

I'd like to make it easier for readers to see the interplay between all three metrics.

In the following effort, I ditch the map aesthetic, and focus on three transformed measures: share of population, share of popular vote, and share of electoral vote. The share of popular vote is a re-interpretation of what Washington Post calls "votes cast".

The information is best presented by grouping states that behaved similarly. The two most interesting subgroups are the large states like Texas and California where the residents loudly complained that their voice was suppressed by the electoral vote allocation but in fact, the allocated electoral votes were not far from their share of the popular vote! By contrast, Floridians had a more legitimate reason to gripe since their share of the popular vote much exceeded their share of the electoral vote. This pattern also persisted throughout the battleground states.

Redo_wp_electoralcollege

The hardest part of this design is making the legend:

Redo_wp_electoralcollege_legend

 

 

 


Unintentional deception of area expansion #bigdata #piechart

Someone sent me this chart via Twitter, as an example of yet another terrible pie chart. (I couldn't find that tweet anymore but thank you to the reader for submitting this.)

Uk_itsurvey_left

At first glance, this looks like a pie chart with the radius as a second dimension. But that is the wrong interpretation.

In a pie chart, we typically encode the data in the angles of the pie sectors, or equivalently, the areas of the sectors. In this special case, the angle is invariant across the slices, and the data are encoded in the radius.

Since the data are found in the radii, let's deconstruct this chart by reducing each sector to its left-side edge.

This leads to a different interpretation of the chart: it’s actually a simple bar chart, manipulated.

Redo_ukitsurvey_1

The process of the manipulation runs against what data visualization should be. It takes the bar chart (bottom right) that is easy to read, introduces slants so it becomes harder to digest (top right), and finally absorbs a distortion to go from inefficient to incompetent (left).

What is this distortion I just mentioned? When readers look at the original chart, they are not focusing on the left-side edge of each sector but they are seeing the area of each sector. The ratio of areas is not the same as the ratio of lengths. Adding purple areas to the chart seems harmless but in fact, despite applying the same angles, the designer added disproportionately more area to the larger data points compared to the smaller ones.

  Redo_ukitsurvey_2

In order to remedy this situation, the designer has to take the square root of the lengths of the edges. But of course, the simple bar chart is more effective.