On data volume, reliability, uncertainty and confidence bands
Jul 22, 2020
This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.
The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.
For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.
One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.
The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.
The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.
These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.
The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.
For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.
Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.
An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.
Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)
The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.
A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)
Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.
You can follow this conversation by subscribing to the comment feed for this post.