On data volume, reliability, uncertainty and confidence bands

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.


The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.


One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.


The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.


The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.


Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.


An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)


The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.

This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):


The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.


The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:


The right chart shows deaths per million while the left chart shows total deaths.


In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:


Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!

Taking small steps to bring out the message

Happy new year! Good luck and best wishes!


We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.


The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:


The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:


This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.


Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?

The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:


The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)


Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.


If the convention is respected, then the following scatter plot results:


The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).


When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!


[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]

Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.


Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?


We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:


The horizontal gridlines attached to the zero line can also be removed:


Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.


The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:


Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:


For what it's worth, re-sort the sectors from largest to smallest share of multinationals:


Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:



To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)



Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

Five steps to let the young ones shine

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.


I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.


The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.


The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.


The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.


Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.





Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.


The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:


The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:


I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)

Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.


In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.


Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!


As a reference, I found this map of the population density in San Francisco (link):



Digital revolution in China: two visual takes

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley. 


The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

  • GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
  • The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!


If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.


(Click on the image to see it in full.) Here is the top of the graphic:


The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.