Trying too hard

Today, I return to the life expectancy graphic that Antonio submitted. In a previous post, I looked at the bumps chart. The centerpiece of that graphic is the following complicated bar chart.

Aburto_covid_lifeexpectancy

Let's start with the dual axes. On the left, age, and on the right, year of birth. I actually like this type of dual axes. The two axes present two versions of the same scale so the dual axes exist without distortion. It just allows the reader to pick which scale they want to use.

It baffles me that the range of each bar runs from 2.5 years to 7.5 years or 7.5 years to 2.5 years, with 5 or 10 years situated in the middle of each bar.

Reading the rest of the chart is like unentangling some balled up wires. The author has created a statistical model that attributes cause of death to male life expectancy in such a way that you can take the difference in life expectancy between two time points, and do a kind of waterfall analysis in which each cause of death either adds to or subtracts from the prior life expectancy, with the sum of these additions and substractions leading to the end-of-period life expectancy.

The model is complicated enough, and the chart doesn't make it any easier.

The bars are rooted at the zero value. The horizontal axis plots addition or substraction to life expectancy, thus zero represents no change during the period. Zero does not mean the cause of death (e.g. cancer) does not contribute to life expectancy; it just means the contribution remains the same.

The changes to life expectancy are shown in units of months. I'd prefer to see units of years because life expectancy is almost always given in years. Using years turn 2.5 months into 0.2 years which is a fraction, but it allows me to see the impact on the reported life expectancy without having to do a month-to-year conversion.

The chart highlights seven causes of death with seven different colors, plus gray for others.

What really does a number on readers is the shading, which adds another layer on top of the hues. Each color comes in one of two shading, referencing two periods of time. The unshaded bar segments concern changes between 2010 and "2019" while the shaded segments concern changes between "2019" and 2020. The two periods are chosen to highlight the impact of COVID-19 (the red-orange color), which did not exist before "2019".

Let's zoom in on one of the rows of data - the 72.5 to 77.5 age group.

Screen Shot 2022-09-14 at 1.06.59 PM

COVID-19 (red-orange) has a negative impact on life expectancy and that's the easy one to see. That's because COVID-19's contribution as a cause of death is exactly zero prior to "2019". Thus, the change in life expectancy is a change from zero. This is not how we can interpret any of the other colors.

Next, we look at cancer (blue). Since this bar segment sits on the right side of zero, cancer has contributed positively to change in life expectancy between 2010 and 2020. Practically, that means proportionally fewer people have died from cancer. Since the lengths of these bar segments correspond to the relative value, not absolute value, of life expectancy, longer bars do not necessarily indicate more numerous deaths.

Now the blue segment is actually divided into two parts, the shaded and not shaded. The not-shaded part is for the period "2019" to 2020 in the first year of the COVID-19 pandemic. The shaded part is for the period 2010 to "2019". It is a much wider span but it also contains 9 years of changes versus "1 year" so it's hard to tell if the single-year change is significantly different from the average single-year change of the past 9 years. (I'm using these quotes because I don't know whether they split the year 2019 in the middle since COVID-19 didn't show up till the end of that year.)

Next, we look at the yellow-brown color correponding to CVD. The key feature is that this block is split into two parts, one positive, one negative. Prior to "2019", CVD has been contributing positively to life expectancy changes while after "2019", it has contributed negatively. This observation raises some questions: why would CVD behave differently with the arrival of the pandemic? Are there data problems?

***

A small multiples design - splitting the period into two charts - may help here. To make those two charts comparable, I'd suggest annualizing the data so that the 9-year numbers represent the average annual values instead of the cumulative values.

 

 


Here's a radar chart that works, sort of

In the same Reuters article that featured the speedometer chart which I discussed in this blog post (link), the author also deployed a small multiples of radar charts.

These radar charts are supposed to illustrate the article's theme that "European countries are racing to fill natural gas storage sites ahead of winter."

Here's the aggregate chart that shows all countries:

Reuters_gastorage_radar_details

In general, I am not a fan of radar charts. When I first looked at this chart, I also disliked it. But keep reading because I eventually decided that this usage is an exception. One just needs to figure out how to read it.

One reason why I dislike radar charts is that they always come with a lot of non-data-ink baggage. We notice that the months of the year are plotted in a circle starting at the top. They marked off the start of the war on Feb 24, 2022 in red. Then, they place the dotted circle, which represents the 80% target gas storage amount.

The trick is to avoid interpreting the areas, or the shapes of the blue and gray patches. I know, they look cool and grab our attention but in the context of conveying data, they are meaningless.

Redo_reuters_eugasradarall_1Instead of areas, focus on the boundaries of those patches. Don't follow one boundary around the circle. Pick a point in time, corresponding to a line between the center of the circle and the outermost circle, and look at the gap between the two lines. In the diagram shown right, I marked off the two relevant points on the day of the start of the war.

From this, we observe that across Europe, the gas storage was far less than the 80% target (recently set).

By comparing two other points (the blue and gray boundaries), we see that during February, Redo_reuters_eugasradarall_2gas storage is at a seasonal low, and in 2022, it is on the low side of the 5-year average. 

However, the visual does not match well with the theme of the article! While the gap between the blue and gray boundaries decreased since the start of the war, the blue boundary does not exceed the historical average, and does not get close to 80% until August, a month in which gas storage reaches 80% in a typical year.

This is example of a chart in which there is a misalignment between the Q and the V corners of the Trifecta Checkup (link).

_trifectacheckup_image

The question/message is that Europeans are reacting to the war by increasing their gas storage beyond normal. The visual actually says that they are increasing the gas storage as per normal.

***

As I noted before, when read in a particular way, these radar charts serve their purpose, which is more than can be said for most radar charts.

The designer made several wise choices:

Instead of drawing one ring for each year of data, the designer averaged the past 5 years and turned that into one single ring (patch). You can imagine what this radar chart would look like if the prior data were not averaged: hoola hoop mania!

Marawa-bgt

Simplifying the data in this way also makes the small multiples work. The designer uses the aggregate chart as a legend/how to read this. And in a further section below, the designer plots individual countries, without the non-data-ink baggage:

Reuters_gastorage_mosttofill

Thanks againto longtime reader Antonio R. who submitted this chart.

Happy Labor Day weekend for those in the U.S.!

 

 

 


Two uses of bumps charts

Long-time reader Antonio R. submitted the following chart, which illustrates analysis from a preprint on the effect of Covid-19 on life expectancy in the U.S. (link)

Aburto_covid_lifeexpectancy

Aburto_lifeexpectancyFor this post, I want to discuss the bumps chart on the lower right corner. Bumps charts are great at showing change over time. In this case, the authors are comparing two periods "2010-2019" and "2019-2020". By glancing at the chart, one quickly divides the causes of death into three groups: (a) COVID-19 and CVD, which experienced a big decline (b) respiratory, accidents, others ("rest"), and despair, which experienced increases, and (c) cancer and infectious, which remained the same.

And yet, something doesn't seem right.

What isn't clear is the measured quantity. The chart title says "months gained or lost" but it takes a moment to realize the plotted data are not number of months but ranks of the effects of the causes of deaths on life expectancy.

Observe that the distance between each cause of death is the same. Look at the first rising line (respiratory): the actual values went from 0.8 months down to 0.2.

***

While the canonical bumps chart plots ranks, the same chart form can be used to show numeric data. I prefer to use the same term for both charts. In recent years, the bumps chart showing numeric data has been called "slopegraph".

Here is a side-by-side comparison of the two charts:

Redo_aburto_covidlifeexpectancy

The one on the left is the same as the original. The one on the right plots the number of months increased or decreased.

The choice of chart form paints very different pictures. There are four blue lines on the left, indicating a relative increase in life expectancy - these causes of death contributed more to life expectancy between the two periods. Three of the four are red lines on the right chart. Cancer was shown as a flat line on the left - because it was the highest ranked item in both periods. The right chart shows that the numeric value for cancer suffered one of the largest drops.

The left chart exaggerates small numeric changes while it condenses large numeric changes.

 

 


Speedometer charts: love or hate

Pie chart hate is tired. In this post, I explain my speedometer hate. (Also called gauges,  dials)

Next to pie charts, speedometers are perhaps the second most beloved chart species found on business dashboards. Here is a typical example:

Speedometers_example

 

For this post, I found one on Reuters about natural gas in Europe. (Thanks to long-time contributor Antonio R. for the tip.)

Eugas_speedometer

The reason for my dislike is the inefficiency of this chart form. In classic Tufte-speak, the speedometer chart has a very poor data-to-ink ratio. The entire chart above contains just one datum (73%). Most of the ink are spilled over non-data things.

This single number has a large entourage:

- the curved axis
- ticks on the axis
- labels on the scale
- the dial
- the color segments
- the reference level "EU target"

These are not mere decorations. Taking these elements away makes it harder to understand what's on the chart.

Here is the chart without the curved axis:

Redo_eugas_noaxis

Here is the chart without axis labels:

Redo_eugas_noaxislabels

Here is the chart without ticks:

Redo_eugas_notickmarks

When the tick labels are present, the chart still functions.

Here is the chart without the dial:

Redo_eugas_nodial

The datum is redundantly encoded in the color segments of the "axis".

Here is the chart without the dial or the color segments:

Redo_eugas_nodialnosegments

If you find yourself stealing a peek at the chart title below, you're not alone.

All versions except one increases our cognitive load. This means the entourage is largely necessary if one encodes the single number in a speedometer chart.

The problem with the entourage is that readers may resort to reading the text rather than the chart.

***

The following is a minimalist version of the Reuters chart:

Redo_eugas_onedial

I removed the axis labels and the color segments. The number 73% is shown using the dial angle.

The next chart adds back the secondary message about the EU target, as an axis label, and uses color segments to show the 73% number.

Redo_eugas_nodialjustsegments

Like pie charts, there are limited situations in which speedometer charts are acceptable. But most of the ones we see out there are just not right.

***

One acceptable situation is to illustrate percentages or proportions, which is what the EU gas chart does. Of course, in that situation, one can alo use a pie chart without shame.

For illustrating proportions, I prefer to use a full semicircle, instead of the circular sector of arbitrary angle as Reuters did. The semicircle lends itself to easy marks of 25%, 50%, 75%, etc, eliminating the need to print those tick labels.

***

One use case to avoid is numeric data.

Take the regional sales chart pulled randomly from a Web search above:

Speedometers_example

These charts are completely useless without the axis labels.

Besides, because the span of the axis isn't 0% to 100%, every tick mark must be labelled with the numeric value. That's a lot of extra ink used to display a single value!


Another reminder that aggregate trends hide information

The last time I looked at the U.S. employment situation, it was during the pandemic. The data revealed the deep flaws of the so-called "not in labor force" classification. This classification is used to dehumanize unemployed people who are declared "not in labor force," in which case they are neither employed nor unemployed -- just not counted at all in the official unemployment (or employment) statistics.

The reason given for such a designation was that some people just have no interest in working, or even looking for a job. Now they are not merely discouraged - as there is a category of those people. In theory, these people haven't been looking for a job for so long that they are no longer visible to the bean counters at the Bureau of Labor Statistics.

What happened when the pandemic precipitated a shutdown in many major cities across America? The number of "not in labor force" shot up instantly, literally within a few weeks. That makes a mockery of the reason for such a designation. See this post for more.

***

The data we saw last time was up to April, 2020. That's more than two years old.

So I have updated the charts to show what has happened in the last couple of years.

Here is the overall picture.

Junkcharts_unemployment_notinLFparttime_all_2

In this new version, I centered the chart at the 1990 data. The chart features two key drivers of the headline unemployment rate - the proportion of people designated "invisible", and the proportion of those who are considered "employed" who are "part-time" workers.

The last two recessions have caused structural changes to the labor market. From 1990 to late 2000s, which included the dot-com bust, these two metrics circulated within a small area of the chart. The Great Recession of late 2000s led to a huge jump in the proportion called "invisible". It also pushed the proportion of part-timers to all0time highs. The proportion of part-timers has fallen although it is hard to interpret from this chart alone - because if the newly invisible were previously part-time employed, then the same cause can be responsible for either trend.

_numbersense_bookcoverReaders of Numbersense (link) might be reminded of a trick used by school deans to pump up their US News rankings. Some schools accept lots of transfer students. This subpopulation is invisible to the US News statisticians since they do not factor into the rankings. The recent scandal at Columbia University also involves reclassifying students (see this post).

Zooming in on the last two years. It appears that the pandemic-related unemployment situation has reversed.

***

Let's split the data by gender.

American men have been stuck in a negative spiral since the 1990s. With each recession, a higher proportion of men are designated BLS invisibles.

Junkcharts_unemployment_notinLFparttime_men_2

In the grid system set up in this scatter plot, the top right corner is the worse of all worlds - the work force has shrunken and there are more part-timers among those counted as employed. The U.S. men are not exiting this quadrant any time soon.

***
What about the women?

Junkcharts_unemployment_notinLFparttime_women_2

If we compare 1990 with 2022, the story is not bad. The female work force is gradually reaching the same scale as in 1990 while the proportion of part-time workers have declined.

However, celebrating the above is to ignore the tremendous gains American women made in the 1990s and 2000s. In 1990, only 58% of women are considered part of the work force - the other 42% are not working but they are not counted as unemployed. By 2000, the female work force has expanded to include about 60% with similar proportions counted as part-time employed as in 1990. That's great news.

The Great Recession of the late 2000s changed that picture. Just like men, many women became invisible to BLS. The invisible proportion reached 44% in 2015 and have not returned to anywhere near the 2000 level. Fewer women are counted as part-time employed; as I said above, it's hard to tell whether this is because the women exiting the work force previously worked part-time.

***

The color of the dots in all charts are determined by the headline unemployment number. Blue represents low unemployment. During the 1990-2022 period, there are three moments in which unemployment is reported as 4 percent or lower. These charts are intended to show that an aggregate statistic hides a lot of information. The three times at which unemployment rate reached historic lows represent three very different situations, if one were to consider the sizes of the work force and the number of part-time workers.

 

P.S. [8-15-2022] Some more background about the visualization can be found in prior posts on the blog: here is the introduction, and here's one that breaks it down by race. Chapter 6 of Numbersense (link) gets into the details of how unemployment rate is computed, and the implications of the choices BLS made.

P.S. [8-16-2022] Corrected the axis title on the charts (see comment below). Also, added source of data label.


What does Elon Musk do every day?

The Wall Street Journal published a fun little piece about tweets by Elon Musk (link).

Here is an overview of every tweet he sent since he started using Twitter more than a decade ago.

Wsj_musk_tweets_alldaylong2
Apparently, he sent at least one tweet almost every day for the last four years. In addition, his tweets appear at all hours of the day. (Presumably, he is not the only one tweeting from his account.)

He doesn't just spend time writing tweets; he also reads other people's tweets. WSJ finds that up to 80% of his tweets include mentions of other users.

Wsj_musk_tweets_mentionsothers7

***

One problem with "big data" analytics is that they often don't answer interesting questions. Twitter is already one of the companies that put more of their data out there, but still, analysts are missing some of the most important variables.

We know that Musk has 93 million followers. We already know from recent news that a large proportion of such users may be spam/fake. It is frequently assumed in twitter analysis that any tweet he makes reaches 93 million accounts. That's actually far from correct. Twitter uses algorithms to decide what posts show up in each user's feed so we have no idea how many of the 93 million accounts are in fact exposed to any of Musk's tweets.

Further, not every user reads everything on their Twitter feed. I don't even check it every day. Because Twitter operates as a 'firehose" with ever-changing content as users send out short messages at all hours, what one sees depends on when one reads. If Musk tweets in the morning, the users who log on in the afternoon won't see it.

Let's say an analyst wants to learn how impactful Musk's tweets are. That's pretty difficult when one can't figure out which of the 93 million followers were shown these tweets, and who read them. The typical data used to measure response are retweets and likes. Those are convenient metrics because they are available. They are very limited in what they measure. There are lots of users who don't like or retweet at all.

***

The available data do make for some fun charts. This one gave me a big smile:

Wsj_musk_tweets_emojis9

Between writing tweets, reading tweets, and ROTFL, every hour of almost every day, Musk finds time to run his several companies. That's impressive.

 


Selecting the right analysis plan is the first step to good dataviz

It's a new term, and my friend Ray Vella shared some student projects from his NYU class on infographics. There's always something to learn from these projects.

The starting point is a chart published in the Economist a few years ago.

Economist_richgetricher

This is a challenging chart to read. To save you the time, the following key points are pertinent:

a) income inequality is measured by the disparity between regional averages

b) the incomes are given in a double index, a relative measure. For each country and year combination, the average national GDP is set to 100. A value of 150 means the richest region of Spain has an average income that is 50% higher than Spain's national average in the year 2015.

The original chart - as well as most of the student work - is based on a specific analysis plan. The difference in the index values between the richest and poorest regions is used as a measure of the degree of income inequality, and the change in the difference in the index values over time, as a measure of change in the degree of income inequality over time. That's as big a mouthful as the bag of words sounds.

This analysis plan can be summarized as:

1) all incomes -> relative indices, at each region-year combination
2) inequality = rich - poor region gap, at each region-year combination
3) inequality over time = inequality in 2015 - inequality in 2000, for each country
4) country difference = inequality in country A - inequality in country B, for each year

***

One student, J. Harrington, looks at the data through an alternative lens that brings clarity to the underlying data. Harrington starts with change in income within the richest regions (then the poorest regions), so that a worsening income inequality should imply that the richest region is growing incomes at a faster clip than the poorest region.

This alternative analysis plan can be summarized as:
1) change in income over time for richest regions for each country
2) change in income over time for poorest regions for each country
3) inequality = change in income over time: rich - poor, for each country

The restructuring of the analysis plan makes a big difference!

Here is one way to show this alternative analysis:

Junkcharts_kfung_sixeurocountries_gdppercapita

The underlying data have not changed but the reader's experience is transformed.


Deficient deficit depiction

A twitter user alerted me to this chart put out by the Biden adminstration trumpeting a reduction in the budget deficit from 2020 to 2021:

Omb_deficitreduction

This column chart embodies a form that is popular in many presentations, including in scientific journals. It's deficient in so many ways it's a marvel how it continues to live.

There are just two numbers: -3132 and -2772. Their difference is $360 billion, which is less than just over 10 percent of the earlier number. It's not clear what any data graphic can add.

Indeed, the chart does not do much. It obscures the actual data. What is the budget deficit in 2020? Readers must look at the axis labels, and judge that it's about a quarter of the way between 3000 and 3500. Five hundred quartered is 125. So it's roughly $3.125 trillion. Similarly, the 2021 number is slightly above the halfway point between 2,500 and 3,000.

These numbers are upside down. Taller columns are bad! Shortening the columns is good. It's all counter intuitive.

Column charts encode data in the heights of the columns. The designer apparently wants readers to believe the deficit has been cut by about a third.

As usual, this deception is achieved by cutting the column chart off at its knees. Removing equal sections of each column destroys the propotionality of the heights.

Why hold back? Here's a version of the chart showing the deficit was cut by half:

Junkcharts_redo_ombbudgetdeficit

The relative percent reduction depends on where the baseline is placed. The only defensible baseline is the zero baseline. That's the only setting under which the relative percent reduction is accurately represented visually.

***

This same problem presents itself subtly in Covid-19 vaccine studies. I explain in this post, which I rate as one of my best Covid-19 posts. Check it out!

 

 


Superb tile map offering multiple avenues for exploration

Here's a beauty by WSJ Graphics:

Wsj_powerproduction

The article is here.

This data graphic illustrates the power of the visual medium. The underlying dataset is complex: power production by type of source by state by month by year. That's more than 90,000 numbers. They all reside on this graphic.

Readers amazingly make sense of all these numbers without much effort.

It starts with the summary chart on top.

Wsj_powerproduction_us_summary

The designer made decisions. The data are presented in relative terms, as proportion of total power production. Only the first and last years are labeled, thus drawing our attention to the long-term trend. The order of the color blocks is carefully selected so that the cleaner sources are listed at the top and the dirtier sources at the bottom. The order of the legend labels mirrors the color blocks in the area chart.

It takes only a few seconds to learn that U.S. power production has largely shifted away from coal with most of it substituted by natural gas. Other than wind, the green sources of power have not gained much ground during these years - in a relative sense.

This summary chart serves as a reading guide for the rest of the chart, which is a tile map of all fifty states. Embedded in the tile map is a small-multiples arrangement.

***

The map offers multiple avenues for exploration.

Some readers may look at specific states. For example, California.

Wsj_powerproduction_california

Currently, about half of the power production in California come from natural gas. Notably, there is no coal at all in any of these years. In addition to wind, solar energy has also gained. All of these insights come without the need for any labels or gridlines!

Wsj_powerproduction_westernstatesBrowsing around California, readers find different patterns in other Western states like Oregon and Washington.

Hydroelectric energy is the dominant source in those two states, with wind gradually taking share.

At this point, readers realize that the summary chart up top hides remarkable state-level variations.

***

There are other paths through the map.

Some readers may scan the whole map, seeking patterns that pop out.

One such pattern is the cluster of states that use coal. In most of these states, the proportion of coal has declined.

Yet another path exists for those interested in specific sources of power.

For example, the trend in nuclear power usage is easily followed by tracking the purple. South Carolina, Illinois and New Hampshire are three states that rely on nuclear for more than half of its power.

Wsj_powerproduction_vermontI wonder what happened in Vermont about 8 years ago.

The chart says they renounced nuclear energy. Here is some history. This one-time event caused a disruption in the time series, unique on the entire map.

***

This work is wonderful. Enjoy it!


Funnel is just for fun

This is part 2 of a review of a recent video released by NASA. Part 1 is here.

The NASA video that starts with the spiral chart showing changes in average global temperature takes a long time (about 1 minute) to run through 14 decades of data, and for those who are patient, the chart then undergoes a dramatic transformation.

With a sleight of hand, the chart went from a set of circles to a funnel. Here is a look:

Nasa_climatespiral_funnel

What happens is the reintroduction of a time dimension. Imagine pushing the center of the spiral down into the screen to create a third dimension.

Our question as always is - what does this chart tell readers?

***

The chart seems to say that the variability of temperature has increased over time (based on the width of the funnel). The red/blue color says the temperature is getting hotter especially in the last 20-40 years.

When the reader looks beneath the surface, the chart starts to lose sense.

The width of the funnel is really a diameter of the spiral chart in the given year. But, if you recall, the diameter of the spiral (polar) chart isn't the same between any pairs of months.

Nasa_climatespiral_fullperiod

In the particular rendering of this video, the width of the funnel is the diameter linking the April and October values.

Remember the polar gridlines behind the spiral:

Nasa_spiral_gridlines

Notice the hole in the middle. This hole has arbitrary diameter. It can be as big or as small as the designer makes it. Thus, the width of the funnel is as big or as small as the designer wants it. But the first thing that caught our attention is the width of the funnel.

***

The entire section between -1 and + 1 is, in fact, meaningless. In the following chart, I removed the core of the funnel, adding back the -1 degree line. Doing so exposes an incompatibility between the spiral and funnel views. The middle of the polar grid is negative infinity, a black hole.

Junkcharts_nasafunnel_arbitrarygap

For a moment, the two sides of the funnel look like they are mirror images. That's not correct, either. Each width of the funnel represents a year, and the extreme values represent April and October values. The line between those two values does not signify anything real.

Let's take a pair of values to see what I mean.

Junkcharts_nasafunnel_lines

I selected two values for October 2021 and October 1899 such that the first value appears as a line double the length of the second. The underlying values are +0.99C and -0.04C, roughly speaking, +1 and 0, so the first value is definitely not twice the size of the second.

The funnel chart can be interpreted, in an obtuse way, as a pair of dot plots. As shown below, if we take dot plots for Aprils and Octobers of every year, turn the chart around, and then connect the corresponding dots, we arrive at the funnel chart.

Junkcharts_nasafunnel_fromdotplots

***

This NASA effort illustrates a central problem in visual communications: attention (what Andrew Gelman calls "grabbiness") and information integrity. On the one hand, what's the point of an accurate chart when no one is paying attention? On the other hand, what's the point of a grabby chart when anyone who pays attention gets the wrong information? It's not easy to find that happy medium.