Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

Latimes_lifeexpectancy_postcovid

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Junkcharts_lifeexpectancy_elections

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

Junkcharts_lifeexpectancy_income

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

Junkcharts_lifeexpectancy_covidcases

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

 


Showing both absolute and relative values on the same chart 2

In the previous post, I looked at Visual Capitalist's visualization of the amount of uninsured deposits at U.S. banks. Using a stacked bar chart, I placed both absolute and relative values on the same chart.

In making that chart, I made these three tradeoffs.

First, I elevated absolute values (dollar amounts) over relative values (proportions). The original designer decided the opposite.

Second, I elevated the TBTF banks over the smaller banks. The original designer also decided the opposite.

Third, I elevated the total value over the disaggregated values (insured, uninsured). The original designer only visualized the uninsured values in the bars.

Which chart is better depends on what story one wants to tell.

***
For today's post, I'm showing another sketch of the same data, with the same goal of putting both absolute and relative values on the same chart.

Redo_visualcapitalist_uninsureddeposits_2b

The starting point of this sketch is the original chart - the stacked bar chart showing relative proportions. I added the insured portion so that it is on almost equal footing as the uninsured portion of the deposits. This edit is crucial to convey the impression of proportions.

My story hasn't changed; I still want to elevate the TBTF banks.

For this version, I try a different way of elevating TBTF banks. The key step is to encode data into the heights of the bars. I use these bar heights to convey the relative importance of banks, as reflected by total deposits.

The areas of the red blocks represent the uninsured amounts. That said, it's not easy to compare rectangular areas when both dimensions are different.

Comparing the total red area with the total yellow area, we learn that the majority of deposits in these banks are uninsured(!)

 


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Thoughts on Daniel's fix for dual-axes charts

I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:

Danielzvinca_dualaxes_linecolumn

In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.

Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.

Danielzvinca_dualaxes_estimatedy

How is this chart different from the typical dual-axes chart, like the first graph in this post?

Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).

His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.

The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.

The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.

***

What are we staring at?

The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.

***

There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.

While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.

***

Daniel's fix reminds me of a different, but very common, chart.

Forecastvsactualinflationchart

This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.

We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.

In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"

Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.

This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.

Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.

 

 

 

 

 


All about Connecticut

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.

Ctmirror_growingindustries

The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.

***

The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.

Ctmirror_highered

This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.

***

The final example is a two-sided bar chart:

Ctmirror_migration

This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.

***

There are many other charts, and I encourage you to visit and support this data journalism.

 

 

 


Lay off bubbles

Wall Street Journal says that the scale of layoffs in the tech industry recently is worse than those caused by the pandemic lockdown. Here is the chart:

Redo_wsj_tech_layoffs_sufficiency

It's the dreaded bubble chart, complete with overlapping circles. Each bubble represents the total number of employees laid off in the U.S. in a given month.

The above isn't really the chart you find in the Journal. I have removed the two data labels from the chart. Look at the highlighted months of April 2020 and November 2022. Can you guess how much larger is the number of laid-off employees in November 2022 relative to April 2020?

***

If you guessed it's 100% - that the larger bubble is twice the size of the smaller one, then you're much better than I at reading bubble charts. Here is the published chart with the data labels:

Wsj tech layoffs

I like to run this exercise - removing data labels - in order to reveal whether the graphical elements on the page are sufficient to convey the underlying data. Bubbles are typically not great at this. (This is what I call the self-sufficiency test.)

***

Another problem with bubble charts is that the sizes of the bubbles are arbitrary. This allows the designer to convey different messages with the same data.

Take a look at these two bubble charts:

Redo_wsj_layoff_bubbles

The first one has huge bubbles, and lots of overlapping while the second one is roughly the same as the WSJ chart (I pulled a different dataset so the numbers may not be exactly the same).

Both charts are made from exactly the same data! In the second chart, the smallest bubbles are made very small while in the first chart, the smallest bubbles are still quite large.

Think twice before you make a bubble chart.

 


Dual axes: a favorite of tricksters

Twitter readers directed me to this abomination from the St. Louis Fed (link).

Stlouisfed_military_spend

This chart is designed to paint the picture that China is this grave threat because it's been ramping up military expenditure so much so that it exceeded U.S. spending since the 2000s.

Sadly, this is not what the data are suggesting at all! This story is constructed by manipulating the dual axes. Someone has already fixed it. Here's the same data plotted with a single axis:

Redo_military_spend

(There are two set of axis labels but they have the same scale and both start at zero, so there is only one axis.)

Certainly, China has been ramping up military spending. Nevertheless, China's current level of spending is about one-third of America's. Also, imagine the cumulative spending excess over the 30 years shown on the chart.

Note also, the growth line of U.S. military spending in this period is actually similarly steep as China's.

***

Apparently, the St. Louis Fed is intent on misleading its readers. Even though on Twitter, they acknowledged people's feedback, they decided not to alter the chart.

Stlouisfed_militaryexpenditure_tweet

If you click through to the article, you'll find the same flawed chart as before so I'm not sure how they "listened". I went to Wayback Machine to check the first version of this page, and I notice no difference.

***

If one must make a dual axes chart, it is the responsibility of the chart designer to make it clear to readers that different lines on the chart use different axes. In this case, since the only line that uses the right hand side axis is the U.S. line, which is blue, they should have colored the right hand axis blue. Doing that does not solve the visualization problem; it merely reduces the chance of not noticing the dual axes.

***

I have written about dual axes a lot in the past. Here's a McKinsey chart from 2006 that offends.


Following this pretty flow chart

Bloomberg did a very nice feature on how drought has been causing havoc with river transportation of grains and other commodities in the U.S., which included several well-executed graphics.

Mississippi_sankeyI'm particularly attracted to this flow chart/sankey diagram that shows the flows of grains from various U.S. ports to foreign countries.

It looks really great.

Here are some things one can learn from this chart:

  • The Mississippi River (blue flow) is by far the most important conduit of American grain exports
  • China is by far the largest importer of American grains
  • Mexico is the second largest importer of American grains, and it has a special relationship with the "interior" ports (yellow). Notice how the Interior almost exclusively sends grains to Mexico
  • Similarly, the Puget Sound almost exclusively trades with China

The above list is impressive for one chart.

***

Some key questions are not as easy to see from this layout:

  • What proportion of the total exports does the Mississippi River account for? (Turns out to be almost exactly half.)
  • What proportion of the total exports go to China? (About 40%. This question is even harder than the previous one because of all the unlabeled values for the smaller countries.)
  • What is the relative importance of different ports to Japan/Philippines/Indonesia/etc.? (Notice how the green lines merge from the other side of the country names.)
  • What is the relative importance of any of the countries listed, outside the top 5 or so?
  • What is the ranking of importance of export nations to each port? For Mississippi River, it appears that the countries may have been drawn from least important (up top) to most important (down below). That is not the case for the other ports... otherwise the threads would tie up into knots.

***

Some of the features that make the chart look pretty are not data-driven.

See this artificial "hole" in the brown branch.

Bloomberg_mississippigrains_branchgap

In this part of the flow, there are two tiny outflows to Myanmar and Yemen, so most of the goods that got diverted to the right side ended up merging back to the main branch. However, the creation of this hole allows a layering effect which enhances the visual cleanliness.

Next, pay attention to the yellow sub-branches:

Bloomberg_mississippigrains_subbranching

At the scale used by the designer, all of the countries shown essentially import about the same amount from the Interior (yellow). Notice the special treatment of Singapore and Phillippines. Instead of each having a yellow sub-branch coming off the "main" flow, these two countries share the sub-branch, which later splits.

 

 

 


A graphical compass

A Twitter user pointed me to this article from Washington Post, ruminating about the correlation between gas prices and measures of political sentiment (such as Biden's approval rating or right-track-wrong-track). As common in this genre, the analyst proclaims that he has found something "counter intuitive".

The declarative statement strikes me as odd. In the first two paragraphs, he said the data showed "as gas prices fell, American optimism rose. As prices rose, optimism fell... This seems counterintuitive."

I'm struggling to see what's counterintuitive. Aren't the data suggesting people like lower prices? Is that not what we think people like?

The centerpiece of the article concerns the correlation between metrics. "If two numbers move in concert, they can be depicted literally moving in concert. One goes up, the other moves either up or down consistently." That's a confused statement and he qualifies it by typing "That sort of thing."

He's reacting to the following scatter plot with lines. The Twitter user presumably found it hard to understand. Count me in.

Washingtonpost_gasprices

Why is this chart difficult to grasp?

The biggest puzzle is: what differentiates those two lines? The red and the gray lines are not labelled. One would have to consult the article to learn that the gray line represents the "raw" data at weekly intervals. The red line is aggregated data at monthly intervals. In other words, each red dot is an average of 4 or 5 weekly data points. The red line is just a smoothed version of the gray line. Smoothed lines show the time trend better.

The next missing piece is the direction of time, which can only be inferred by reading the month labels on the red line. But the chart without the direction of time is like a map without a compass. Take this segment for example:

Wpost_gaspricesapproval_directionoftime

If time is running up to down, then approval ratings are increasing over time while gas prices are decreasing. If time is running down to up, then approval ratings are decreasing over time while gas prices are increasing. Exactly the opposite!

The labels on the red line are not sufficient. It's possible that time runs in the opposite direction on the gray line! We only exclude that possibility if we know that the red line is a smoothed version of the gray line.

This type of chart benefits from having a compass. Here's one:

Wpost_gaspricesapproval_compass

It's useful for readers to know that the southeast direction is "good" (higher approval ratings, lower gas prices) while the northwest direction is "bad". Going back to the original chart, one can see that the metrics went in the "bad" direction at the start of the year and has reverted to a "good" direction since.

***

What does this chart really say? The author remarked that "correlation is not causation". "Just because Biden’s approval rose as prices dropped doesn’t mean prices caused the drop."

Here's an alternative: People have general sentiments. When they feel good, they respond more positively to polls, as in they rate everything more positively. The approval ratings are at least partially driven by this general sentiment. The same author apparently has another article saying that the right-track-wrong-track sentiment also moved in tandem with gas prices.

One issue with this type of scatter plot is that it always cues readers to make an incorrect assumption: that the outcome variables (approval rating) is solely - or predominantly - driven by the one factor being visualized (gas prices). This visual choice completely biases the reader's perception.

P.S. [11-11-22] The source of the submission was incorrectly attributed.


Painting the corner

Found an old one sitting in my folder. This came from the Wall Street Journal in 2018.

At first glance, the chart looks like a pretty decent effort.

The scatter plot shows Ebitda against market value, both measured in billions of dollars. The placement of the vertical axis title on the far side is a little unusual.

Ebitda is a measure of business profit (something for a different post on the sister blog: the "b" in Ebitda means "before", and allows management to paint a picture of profits without accounting for the entire cost of running the business). In the financial markets, the market value is claimed to represent a "fair" assessment of the value of the business. The ratio of the market value to Ebitda is known as the "Ebitda multiple", which describes the number of dollars the "market" places on each dollar of Ebitda profit earned by the company.

Almost all scatter plots suffer from xyopia: the chart form encourages readers to take an overly simplistic view in which the market cares about one and only one business metric (Ebitda). The reality is that the market value contains information about Ebitda plus lots of other factors, such as competitors, growth potential, etc.

Consider Alphabet vs AT&T. On this chart, both companies have about $50 billion in Ebitda profits. However, the market value of Alphabet (Google's mother company) is about four times higher than that of AT&T. This excess valuation has nothing to do with profitability but partly explained by the market's view that Google has greater growth potential.

***

Unusually, the desginer chose not to utilize the log scale. The right side of the following display is the same chart with a log horizontal axis.

The big market values are artificially pulled into the middle while the small values are plied apart. As one reads from left to right, the same amount of distance represents more and more dollars. While all data visualization books love log scales, I am not a big fan of it. That's because the human brain doesn't process spatial information this way. We don't tend to think in terms of continuously evolving scales. Thus, presenting the log view causes readers to underestimate large values and overestimate small differences.

Now let's get to the main interest of this chart. Notice the bar chart shown on the top right, which by itself is very strange. The colors of the bar chart is coordinated with those on the scatter plot, as the colors divide the companies into two groups; "media" companies (old, red), and tech companies (new, orange).

Scratch that. Netflix is found in the scatter plot but with a red color while AT&T and Verizon appear on the scatter plot as orange dots. So it appears that the colors mean different things on different plots. As far as I could tell, on the scatter plot, the orange dots are companies with over $30 billion in Ebitda profits.

At this point, you may have noticed the stray orange dot. Look carefully at the top right corner, above the bar chart, and you'll find the orange dot representing Apple. It is by far the most important datum, the company that has the greatest market value and the largest Ebitda.

I'm not sure burying Apple in the corner was a feature or a bug. It really makes little sense to insert the bar chart where it is, creating a gulf between Apple and the rest of the companies. This placement draws the most attention away from the datum that demands the most attention.