Messing with expectations

A co-worker sent me to the following map, found in Forbes:


It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.


There is no need to use two colors either.


A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.


The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.


I recast this as a small-multiples by categorizing states into four subgroups.


With this change, one can almost justify using maps because there is sort of a spatial pattern.



An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.


The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.


The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.


In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.


For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.


I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.



Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.


What is the question is the question

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

Fortune_global500 copy

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count $6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.


As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

  • What proportion of the total revenues are found in Beijing?
  • What proportion of the total revenues are found in Greater China?
  • What are the top 5 cities in Greater China?
  • What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.



Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.

Partition of Europe

A long-time reader sent me the following map via twitter:


This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.


Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.

Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.


Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.


The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?


The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.


Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.


The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.


A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.


The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.


The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.


What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.


Graph workflow and defaults wreak havoc

For the past week or 10 days, every time I visited one news site, it insisted on showing me an article about precipitation in North Platte. It's baiting me to write a post about this lamentable bar chart (link):



This chart got problems, and the problems start with the tooling, which dictates a workflow.

I imagine what the chart designer had to deal with.

For a bar chart, the tool requires one data series to be numeric, and the other to be categorical. A four-digit year is a number, which can be treated either as numeric or categorical. In most cases, and by default, numbers are considered numeric. To make this chart, the user asked the tool to treat years as categorical.


Many tools treat categories as distinct entities ("nominal"), mapping each category to a distinct color. So they have 11 colors for 11 years, which is surely excessive.

This happens because the year data is not truly categorical. These eleven years were picked based on the amount of rainfall. There isn't a single year with two values, it's not even possible. The years are just irregularly spaced indices. Nevertheless, the tool misbehaves if the year data are regarded as numeric. (It automatically selects a time-series line chart, because someone's data visualization flowchart says so.) Mis-specification in order to trick the tool has consequences.

The designer's intention is to compare the current year 2023 to the driest years in history. This is obvious from the subtitle in which 2023 is isolated and its purple color is foregrounded.


How unfortunate then that among the 11 colors, this tool grabbed 4 variations of purple! I like to think that the designer wanted to keep 2023 purple, and turn the other bars gray -- but the tool thwarted this effort.


The tool does other offensive things. By default, it makes a legend for categorical data. I like the placement of the legend right beneath the title, a recognition that on most charts, the reader must look at the legend first to comprehend what's on the chart.

Not so in this case. The legend is entirely redundant. Removing the legend does not affect our cognition one bit. That's because the colors encode nothing.

Worse, the legend sows confusion because it presents the same set of years in chronological order while the bars below are sorted by amount of precipitation: thus, the order of colors in the legend differs from that in the bar chart.


I can imagine the frustration of the designer who finds out that the tool offers no option to delete the legend. (I don't know this particular tool but I have encountered tools that are rigid in this manner.)


Something else went wrong. What's the variable being plotted on the numeric (horizontal) axis?

The answer is inches of rainfall but the answer is actually not found anywhere on the chart. How is it possible that a graphing tool does not indicate the variables being plotted?

I imagine the workflow like this: the tool by default puts an axis label which uses the name of the column that holds the data. That column may have a name that is not reader-friendly, e.g. PRECIP. The designer edits the name to "Rainfall in inches". Being a fan of the Economist graphics style, they move the axis label to the chart title area.

The designer now works the chart title. The title is made to spell out the story, which is that North Platte is experiencing a historically dry year. Instead of mentioning rainfall, the new title emphasizes the lack thereof.

The individual steps of this workflow make a lot of sense. It's great that the title is informative, and tells the story. It's great that the axis label was fixed to describe rainfall in words not database-speak. But the end result is a confusing mess.

The reader must now infer that the values being plotted are inches of rainfall.

Further, the tool also imposes a default sorting of the bars. The bars run from longest to shortest, in this case, the longest bar has the most rainfall. After reading the title, our expectation is to find data on the Top 11 driest years, from the driest of the driest to the least dry of the driest. But what we encounter is the opposite order.


Most graphics software behaves like this as they are plotting the ranks of the categories with the driest being rank 1, counting up. Because the vertical axis moves upwards from zero, the top-ranked item ends up at the bottom of the chart.


_trifectacheckup_imageMoving now from the V corner to the D corner of the Trifecta checkup (link), I can't end this post without pointing out that the comparisons shown on the chart don't work. It's the first few months of 2023 versus the full years of the others.

The fix is to plot the same number of months for all years. This can be done in two ways: find the partial year data for the historical years, or project the 2023 data for the full year.

(If the rainy season is already over, then the chart will look exactly the same at the end of 2023 as it is now. Then, I'd just add a note to explain this.)


Here is a version of the chart after doing away with unhelpful default settings:


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.


The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.


The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:


Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.


Next, I add back the data labels:


The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.


Here is the whole chart:


A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.


The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.


A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.

Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:


This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.


I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:


The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.


The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.

Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:


Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.


Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?


The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.


Think about what the reader goes through when comparing two counties.


Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.



Yet another off radar plot

Bloomberg compares people's lives in retirement in this interesting dataviz project (link, paywall). The "showcase" chart is a radar plot that looks like this:


The radar plot may count as the single chart type that has the most number of lives. I'm afraid this one does not go into the hall of fame, either.

The setup leading to this plot is excellent, though. The analytical framework is to divide the retirement period into two parts: healthy and not so healthy. The countries in the radar plot are in fact ordered by the duration of the "healthy retirement period", with France leading the pack. The reference levels used throughout the article is the OECD average. On average, the OECD resident retires at age 64, and dies at age 82, so they spend 18 years in retirement, and 13 of them while "healthy".

In the radar plot, the three key dates are plotted as yellow, green and purple dots. The yellow represents the retirement age, the green, the end of the healthy period, and the purple, the end of life.

Now, take 10, 20, 30 seconds, and try to come up with a message for the above chart.

Not easy at all.


Notice the control panel up top. The male and female data are plotted separately. I place the two segments next to each other:


It's again hard to find any insight - other than the most obvious, which is that female life expectancy is higher.

But note that the order for the countries is different for each chart, and so even the above statement takes a bit of time to verify.


There are many structural challenges to using radar charts. I'll cover one of these here - the amount of non data-ink baggage that comes with using this chart form.

In the Bloomberg example, the baggage includes radial gridlines for countries, concentric gridlines for the years dimension, the country labels around the circle, the age labels in the middle, the color legend, the set of arrows that map to the healthy retirement period, and the country ranks (and little arrow) that indicate the direction of reading. That's a lot of information to process.

In the next post, I'll try a different visual form.