Longest life, shortest length

Racetrack charts refuse to die. For old time's sake, here is a blog post from 2005 in which I explain why they don't make good dataviz.

Our latest example comes from Visual Capitalist (link), which publishes a fair share of nice dataviz. In this infographics, they feature a racetrack chart, just because the topic is the lifespan of cars.

Visualcapitalist_lifespan_cars_top

The whole infographic has four parts, each a racetrack chart. I'll focus on the first racetrack chart (shown above), which deals with the product category of sedans and hatchbacks.

The first thing I noticed is the reference value of 100,000 miles, which is described as the expected lifespan of a typical car made in the 1970s. This is of dubious value since the top of the page informs us the current relevant reference value is 200,000 miles, which is unlabeled. We surmise that 200,000 miles is indicated by the end of the grey sections of the racetrack. (This is eventually confirmed in the next racettrack chart for SUVs in the second sectiotn of the infographic.)

Now let's zoom in on the brown section of the track. Each of the four sections illustrates the same datum = 100,000 miles and yet they exhibit different lengths. From this, we learn that the data are not encoded in the lengths of these tracks -- but rather the data are to be found in the angle sustained at the centre of the concentric circles. The problem with racetrack charts is that readers are drawn to the lengths of the tracks rather than the angles at the center, which are not explicitly represented.

The Avalon model has the longest life span on this chart, and yet it is shown as the shortest curve.

***

The most baffling part of this chart is not the visual but the analysis methodology.

I quote:

iSeeCars analyzed over 2M used cars on the road between Jan. and Oct. 2022. Rankings are based on the mileage that the top 1% of cars within each model obtained.

According to this blurb, the 245,710 miles number for Avalon is the average mileage found in the top 1% of Avalons within the iSeeCars sample of 2M used cars.

The word "lifespan" strikes me as incorporating a date of death, and yet nothing in the above text indicates that any of the sampled cars are at end of life. The cars they really need are not found in their sample at all.

I suppose taking the top 1% is meant to exclude younger cars but why 1%? Also, this sample completely misses the cars that prematurely died, e.g. the cars that failed after 100,000 miles but before 200,000 miles. This filtering also ensures that newer models are excluded from the sample.

_trifectacheckup_imageIn the Trifecta Checkup, this qualifies as Type DV. The dataset does not answer the question of concern while the visual form distorts the data.


Achieving symmetry and obscurity

The following diagram found in an article on a logistics problem absorbed me for the larger part of an hour:

Table7_orderpicking_pyramiddiagram

I haven't seen this chart form before, and it looks cute.

Quickly, I realize this to be one of those charts that require a big box "How to read me". The only hint comes in the chart title: the chart concerns combinations of planning problems. The planning problems are listed on the left. If you want to give it a go, try now before continuing with this blog post. 

***

It took me and a coworker together to unpack this chart. Here's one way to read it:

Fig7_howtoread

Assume I want to know what other problems the problem of "workforce allocation" is associated with. I'd go to the workforce allocation row, then scan both up and down the diagonals. Going up, I see that the authors found one (1) paper that discusses workforce allocation together with workforce level, two (2) papers that feature workforce allocation together with storage location assignment, etc. while going down, I see that workforce allocation is paired with batching in two papers and with order consolidation & sorting in one paper.

You may recognize the underlying data as a type of correlation matrix, which is commonly shown as an upper or lower triangular matrix. Indeed, the same data can be found in a different presentation in the same paper:

Table6_orderpicking

All the numbers are the same. What happened was the designer transformed the upper triangular matrix into an inverted (isoceles) triangle, then turned it aside. The row labels are preserved, while the column labels are dropped. Then, the row labels are snapped to cover the space which was formerly the empty lower triangular matrix.

Junkcharts_vangil_transform

A gain in symmetry, a loss in clarity.

***

Why is this cute, symmetric arrangement so much harder to read? It's out of step with the reader's cognitive path. The reader first picks a planning problem, then scans up and down looking for the correct pair.

Fig7_howtoread_2

Compare this to the matrix view: the reader picks a pair of problems, then finds the single cell that gives the number of articles.

Fig7andfig6_cognition

One could borrow the reading strategy from the matrix, and proceed like this:

Fig7_howtoread_3

The reason why this cognition path doesn't come naturally is that there is only one set of labels on this triangular chart, compared to two sets in the common matrix format. It's unusual to have to pick out two items simultaneously from a single axis.

***

In the end, even though I like the idea of inducing symmetry, I am not convinced by the result.

***

The color scheme for the cells is also baffling. According to the legend, the dark color indicates research that solves a pair of problems in an integrated way while the light color is used when the researchers only analyze the interactions between the two problems.

What's odd is that each cell (pair of problems) is designated a single color. Since we expect researchers to take the different approaches to solving a given pair of problems, we deduce that the designated color represents the most frequent approach. What then does the number inside each cell represent? It can be the number of papers applying the color-coded solution approach, or it can be the total number of papers regardless of the solution approach.

 

P.S. [12-18-2022] See comments below for other examples of the triangular chart.

 

 


Energy efficiency deserves visual efficiency

Long-time contributor Aleksander B. found a good one, in the World Energy Outlook Report, published by IEA (International Energy Agency).

Iea_balloonchart_emissions

The use of balloons is unusual, although after five minutes, I decided I must do some research to have any hope of understanding this data visualization.

A lot is going on. Below, I trace my own journey through this chart.

The text on the top left explains that the chart concerns emissions and temperature change. The first set of balloons (the grey ones) includes helpful annotations. The left-right position of the balloons indicates time points, in 10-year intervals except for the first.

The trapezoid that sits below the four balloons is more mysterious. It's labelled "median temperature rise in 2100". I debate two possibilities: (a) this trapezoid may serve as the fifth balloon, extending the time series from 2050 to 2100. This interpretation raises a couple of questions: why does the symbol change from balloon to trapezoid? why is the left-right time scale broken? (b) this trapezoid may represent something unrelated to the balloons. This interpretation also raises questions: its position on the horizontal axis still breaks the time series; and  if the new variable is "median temperature rise", then what determines its location on the chart?

That last question is answered if I move my glance all the way to the right edge of the chart where there are vertical axis labels. This axis is untitled but the labels shown in degree Celsius units are appropriate for "median temperature rise".

Turning to the balloons, I wonder what the scale is for the encoded emissions data. This is also puzzling because only a few balloons wear data labels, and a scale is nowhere to be found.

Iea_balloonchart_emissions_legend

The gridlines suggests that the vertical location of the balloons is meaningful. Tracing those gridlines to the right edge leads me back to the Celsius scale, which seems unrelated to emissions. The amount of emissions is probably encoded in the sizes of the balloons although none of these four balloons have any data labels so I'm rather flustered. My attention shifts to the colored balloons, a few of which are labelled. This confirms that the size of the balloons indeed measures the amount of emissions. Nevertheless, it is still impossible to gauge the change in emissions for the 10-year periods.

The colored balloons rising above, way above, the gridlines is an indication that the gridlines may lack a relationship with the balloons. But in some charts, the designer may deliberately use this device to draw attention to outlier values.

Next, I attempt to divine the informational content of the balloon strings. Presumably, the chart is concerned with drawing the correlation between emissions and temperature rise. Here I'm also stumped.

I start to look at the colored balloons. I've figured out that the amount of emissions is shown by the balloon size but I am still unclear about the elevation of the balloons. The vertical locations of these balloons change over time, hinting that they are data-driven. Yet, there is no axis, gridline, or data label that provides a key to its meaning.

Now I focus my attention on the trapezoids. I notice the labels "NZE", "APS", etc. The red section says "Pre-Paris Agreement" which would indicate these sections denote periods of time. However, I also understand the left-right positions of same-color balloons to indicate time progression. I'm completely lost. Understanding these labels is crucial to understanding the color scheme. Clearly, I have to read the report itself to decipher these acronyms.

The research reveals that NZE means "net zero emissions", which is a forecasting scenario - an utterly unrealistic one - in which every country is assumed to fulfil fully its obligations, a sort of best-case scenario but an unattainable optimum. APS and STEPS embed different assumptions about the level of effort countries would spend on reducing emissions and tackling global warming.

At this stage, I come upon another discovery. The grey section is missing any acronym labels. It's actually the legend of the chart. The balloon sizes, elevations, and left-right positions in the grey section are all arbitrary, and do not represent any real data! Surprisingly, this legend does not contain any numbers so it does not satisfy one of the traditional functions of a legend, which is to provide a scale.

There is still one final itch. Take a look at the green section:

Iea_balloonchart_emissions_green

What is this, hmm, caret symbol? It's labeled "Net Zero". Based on what I have been able to learn so far, I associate "net zero" to no "emissions" (this suggests they are talking about net emissions not gross emissions). For some reason, I also want to associate it with zero temperature rise. But this is not to be. The "net zero" line pins the balloon strings to a level of roughly 2.5 Celsius rise in temperature.

Wait, that's a misreading of the chart because the projected net temperature increase is found inside the trapezoid, meaning at "net zero", the scientists expect an increase in 1.5 degrees Celsius. If I accept this, I come face to face with the problem raised above: what is the meaning of the vertical positioning of the balloons? There must be a reason why the balloon strings are pinned at 2.5 degrees. I just have no idea why.

I'm also stealthily presuming that the top and bottom edges of the trapezoids represent confidence intervals around the median temperature rise values. The height of each trapezoid appears identical so I'm not sure.

I have just learned something else about this chart. The green "caret" must have been conceived as a fully deflated balloon since it represents the value zero. Its existence exposes two limitations imposed by the chosen visual design. Bubbles/circles should not be used when the value of zero holds significance. Besides, the use of balloon strings to indicate four discrete time points breaks down when there is a scenario which involves only three buoyant balloons.

***

The underlying dataset has five values (four emissions, one temperature rise) for four forecasting scenarios. It's taken a lot more time to explain the data visualization than to just show readers those 20 numbers. That's not good!

I'm sure the designer did not set out to confuse. I think what happened might be that the design wasn't shown to potential readers for feedback. Perhaps they were shown only to insiders who bring their domain knowledge. Insiders most likely would not have as much difficulty with reading this chart as did I.

This is an important lesson for using data visualization as a means of communications to the public. It's easy for specialists to assume knowledge that readers won't have.

For the IEA chart, here is a list of things not found explicitly on the chart that readers have to know in order to understand it.

  • Readers have to know about the various forecasting scenarios, and their acronyms (APS, NZE, etc.). This allows them to interpret the colors and section titles on the chart, and to decide whether the grey section is missing a scenario label, or is a legend.
  • Since the legend does not contain any scale information, neither for the balloon sizes nor for the temperatures, readers have to figure out the scales on their own. For temperature, they first learn from the legend that the temperature rise information is encoded in the trapezoid, then find the vertical axis on the right edge, notice that this axis has degree Celsius units, and recognize that the Celsius scale is appropriate for measuring median temperature rise.
  • For the balloon size scale, readers must resist the distracting gridlines around the grey balloons in the legend, notice the several data labels attached to the colored balloons, and accept that the designer has opted not to provide a proper size scale.

Finally, I still have several unresolved questions:

  • The horizontal axis may have no meaning at all, or it may only have meaning for emissions data but not for temperature
  • The vertical positioning of balloons probably has significance, or maybe it doesn't
  • The height of the trapezoids probably has significance, or maybe it doesn't

 

 


Following this pretty flow chart

Bloomberg did a very nice feature on how drought has been causing havoc with river transportation of grains and other commodities in the U.S., which included several well-executed graphics.

Mississippi_sankeyI'm particularly attracted to this flow chart/sankey diagram that shows the flows of grains from various U.S. ports to foreign countries.

It looks really great.

Here are some things one can learn from this chart:

  • The Mississippi River (blue flow) is by far the most important conduit of American grain exports
  • China is by far the largest importer of American grains
  • Mexico is the second largest importer of American grains, and it has a special relationship with the "interior" ports (yellow). Notice how the Interior almost exclusively sends grains to Mexico
  • Similarly, the Puget Sound almost exclusively trades with China

The above list is impressive for one chart.

***

Some key questions are not as easy to see from this layout:

  • What proportion of the total exports does the Mississippi River account for? (Turns out to be almost exactly half.)
  • What proportion of the total exports go to China? (About 40%. This question is even harder than the previous one because of all the unlabeled values for the smaller countries.)
  • What is the relative importance of different ports to Japan/Philippines/Indonesia/etc.? (Notice how the green lines merge from the other side of the country names.)
  • What is the relative importance of any of the countries listed, outside the top 5 or so?
  • What is the ranking of importance of export nations to each port? For Mississippi River, it appears that the countries may have been drawn from least important (up top) to most important (down below). That is not the case for the other ports... otherwise the threads would tie up into knots.

***

Some of the features that make the chart look pretty are not data-driven.

See this artificial "hole" in the brown branch.

Bloomberg_mississippigrains_branchgap

In this part of the flow, there are two tiny outflows to Myanmar and Yemen, so most of the goods that got diverted to the right side ended up merging back to the main branch. However, the creation of this hole allows a layering effect which enhances the visual cleanliness.

Next, pay attention to the yellow sub-branches:

Bloomberg_mississippigrains_subbranching

At the scale used by the designer, all of the countries shown essentially import about the same amount from the Interior (yellow). Notice the special treatment of Singapore and Phillippines. Instead of each having a yellow sub-branch coming off the "main" flow, these two countries share the sub-branch, which later splits.

 

 

 


Finding the right context to interpret household energy data

Bloomberg_energybillBloomberg's recent article on surging UK household energy costs, projected over this winter, contains data about which I have long been intrigued: how much energy does different household items consume?

A twitter follower alerted me to this chart, and she found it informative.

***
If the goal is to pick out the appliances and estimate the cost of running them, the chart serves its purpose. Because the entire set of data is printed, a data table would have done equally well.

I learned that the mobile phone costs almost nothing to charge: 1 pence for six hours of charging, which is deemed a "single use" which seems double what a full charge requires. The games console costs 14 pence for a "single use" of two hours. That might be an underestimate of how much time gamers spend gaming each day.

***

Understanding the design of the chart needs a bit more effort. Each appliance is measured by two metrics: the number of hours considered to be "single use", and a currency value.

It took me a while to figure out how to interpret these currency values. Each cost is associated with a single use, and the duration of a single use increases as we move down the list of appliances. Since the designer assumes a fixed cost of electicity (shown in the footnote as 34p per kWh), at first, it seems like the costs should just increase from top to bottom. That's not the case, though.

Something else is driving these numbers behind the scene, namely, the intensity of energy use by appliance. The wifi router listed at the bottom is turned on 24 hours a day, and the daily cost of running it is just 6p. Meanwhile, running the fridge and freezer the whole day costs 41p. Thus, the fridge&freezer consumes electricity at a rate that is almost 7 times higher than the router.

The chart uses a split axis, which artificially reduces the gap between 8 hours and 24 hours. Here is another look at the bottom of the chart:

Bloomberg_energycost_bottom

***

Let's examine the choice of "single use" as a common basis for comparing appliances. Consider this:

  • Continuous appliances (wifi router, refrigerator, etc.) are denoted as 24 hours, so a daily time window is also implied
  • Repeated-use appliances (e.g. coffee maker, kettle) may be run multiple times a day
  • Infrequent use appliances may be used less than once a day

I prefer standardizing to a "per day" metric. If I use the microwave three times a day, the daily cost is 3 x 3p = 9 p, which is more than I'd spend on the wifi router, run 24 hours. On the other hand, I use the washing machine once a week, so the frequency is 1/7, and the effective daily cost is 1/7 x 36 p = 5p, notably lower than using the microwave.

The choice of metric has key implications on the appearance of the chart. The bubble size encodes the relative energy costs. The biggest bubbles are in the heating category, which is no surprise. The next largest bubbles are tumble dryer, dishwasher, and electric oven. These are generally not used every day so the "per day" calculation would push them lower in rank.

***

Another noteworthy feature of the Bloomberg chart is the split legend. The colors divide appliances into five groups based on usage category (e.g. cleaning, food, utility). Instead of the usual color legend printed on a corner or side of the chart, the designer spreads the category labels around the chart. Each label is shown the first time a specific usage category appears on the chart. There is a presumption that the reader scans from top to bottom, which is probably true on average.

I like this arrangement as it delivers information to the reader when it's needed.

 

 

 


Modern design meets dataviz

This chart was submitted via Twitter (thanks John G.).

OptimisticEstimatingHomeValue

Perhaps the designer is inspired by this:

Royalontariomuseum

That's the Royal Ontario Museum, one of the beautiful landmarks in Toronto.

***

The chart addresses an interesting question - how much do home buyers over or under-estimate home value?  That said, gathering data to answer this question is challenging. I won't delve into this issue in this post.

Let's ask where readers are looking for data on the chart. It appears that we should use the right edge of each triangle. While the left edge of the red triangle might be useful, the left edges of the other triangles definitely would not contain data.

Note that, like modern architecture, the designer is playing with edges. None of the four right edges is properly vertical - none of the lines cuts the horizontal axis at a right angle. So the data actually reside in the imaginary vertical lines from the apexes to the horizontal baseline.

Where is the horizontal baseline? It's not where it is drawn either. The last number in the series is a negative number and so the real baseline is in the middle of the plot area, where the 0% value is.

The following chart shows (left side) the misleading signals sent to readers and (right side) the proper way to consume the data.

Redo_rockethomes_priceprojection

The degree of distortion is quite extreme. Only the fourth value is somewhat accurate, albeit by accident.

The design does not merely perturb the chart; it causes a severe adverse reaction.

 

P.S. [9/19/2022] Added submitter name.

 

 

 


Trying too hard

Today, I return to the life expectancy graphic that Antonio submitted. In a previous post, I looked at the bumps chart. The centerpiece of that graphic is the following complicated bar chart.

Aburto_covid_lifeexpectancy

Let's start with the dual axes. On the left, age, and on the right, year of birth. I actually like this type of dual axes. The two axes present two versions of the same scale so the dual axes exist without distortion. It just allows the reader to pick which scale they want to use.

It baffles me that the range of each bar runs from 2.5 years to 7.5 years or 7.5 years to 2.5 years, with 5 or 10 years situated in the middle of each bar.

Reading the rest of the chart is like unentangling some balled up wires. The author has created a statistical model that attributes cause of death to male life expectancy in such a way that you can take the difference in life expectancy between two time points, and do a kind of waterfall analysis in which each cause of death either adds to or subtracts from the prior life expectancy, with the sum of these additions and substractions leading to the end-of-period life expectancy.

The model is complicated enough, and the chart doesn't make it any easier.

The bars are rooted at the zero value. The horizontal axis plots addition or substraction to life expectancy, thus zero represents no change during the period. Zero does not mean the cause of death (e.g. cancer) does not contribute to life expectancy; it just means the contribution remains the same.

The changes to life expectancy are shown in units of months. I'd prefer to see units of years because life expectancy is almost always given in years. Using years turn 2.5 months into 0.2 years which is a fraction, but it allows me to see the impact on the reported life expectancy without having to do a month-to-year conversion.

The chart highlights seven causes of death with seven different colors, plus gray for others.

What really does a number on readers is the shading, which adds another layer on top of the hues. Each color comes in one of two shading, referencing two periods of time. The unshaded bar segments concern changes between 2010 and "2019" while the shaded segments concern changes between "2019" and 2020. The two periods are chosen to highlight the impact of COVID-19 (the red-orange color), which did not exist before "2019".

Let's zoom in on one of the rows of data - the 72.5 to 77.5 age group.

Screen Shot 2022-09-14 at 1.06.59 PM

COVID-19 (red-orange) has a negative impact on life expectancy and that's the easy one to see. That's because COVID-19's contribution as a cause of death is exactly zero prior to "2019". Thus, the change in life expectancy is a change from zero. This is not how we can interpret any of the other colors.

Next, we look at cancer (blue). Since this bar segment sits on the right side of zero, cancer has contributed positively to change in life expectancy between 2010 and 2020. Practically, that means proportionally fewer people have died from cancer. Since the lengths of these bar segments correspond to the relative value, not absolute value, of life expectancy, longer bars do not necessarily indicate more numerous deaths.

Now the blue segment is actually divided into two parts, the shaded and not shaded. The not-shaded part is for the period "2019" to 2020 in the first year of the COVID-19 pandemic. The shaded part is for the period 2010 to "2019". It is a much wider span but it also contains 9 years of changes versus "1 year" so it's hard to tell if the single-year change is significantly different from the average single-year change of the past 9 years. (I'm using these quotes because I don't know whether they split the year 2019 in the middle since COVID-19 didn't show up till the end of that year.)

Next, we look at the yellow-brown color correponding to CVD. The key feature is that this block is split into two parts, one positive, one negative. Prior to "2019", CVD has been contributing positively to life expectancy changes while after "2019", it has contributed negatively. This observation raises some questions: why would CVD behave differently with the arrival of the pandemic? Are there data problems?

***

A small multiples design - splitting the period into two charts - may help here. To make those two charts comparable, I'd suggest annualizing the data so that the 9-year numbers represent the average annual values instead of the cumulative values.

 

 


Speedometer charts: love or hate

Pie chart hate is tired. In this post, I explain my speedometer hate. (Also called gauges,  dials)

Next to pie charts, speedometers are perhaps the second most beloved chart species found on business dashboards. Here is a typical example:

Speedometers_example

 

For this post, I found one on Reuters about natural gas in Europe. (Thanks to long-time contributor Antonio R. for the tip.)

Eugas_speedometer

The reason for my dislike is the inefficiency of this chart form. In classic Tufte-speak, the speedometer chart has a very poor data-to-ink ratio. The entire chart above contains just one datum (73%). Most of the ink are spilled over non-data things.

This single number has a large entourage:

- the curved axis
- ticks on the axis
- labels on the scale
- the dial
- the color segments
- the reference level "EU target"

These are not mere decorations. Taking these elements away makes it harder to understand what's on the chart.

Here is the chart without the curved axis:

Redo_eugas_noaxis

Here is the chart without axis labels:

Redo_eugas_noaxislabels

Here is the chart without ticks:

Redo_eugas_notickmarks

When the tick labels are present, the chart still functions.

Here is the chart without the dial:

Redo_eugas_nodial

The datum is redundantly encoded in the color segments of the "axis".

Here is the chart without the dial or the color segments:

Redo_eugas_nodialnosegments

If you find yourself stealing a peek at the chart title below, you're not alone.

All versions except one increases our cognitive load. This means the entourage is largely necessary if one encodes the single number in a speedometer chart.

The problem with the entourage is that readers may resort to reading the text rather than the chart.

***

The following is a minimalist version of the Reuters chart:

Redo_eugas_onedial

I removed the axis labels and the color segments. The number 73% is shown using the dial angle.

The next chart adds back the secondary message about the EU target, as an axis label, and uses color segments to show the 73% number.

Redo_eugas_nodialjustsegments

Like pie charts, there are limited situations in which speedometer charts are acceptable. But most of the ones we see out there are just not right.

***

One acceptable situation is to illustrate percentages or proportions, which is what the EU gas chart does. Of course, in that situation, one can alo use a pie chart without shame.

For illustrating proportions, I prefer to use a full semicircle, instead of the circular sector of arbitrary angle as Reuters did. The semicircle lends itself to easy marks of 25%, 50%, 75%, etc, eliminating the need to print those tick labels.

***

One use case to avoid is numeric data.

Take the regional sales chart pulled randomly from a Web search above:

Speedometers_example

These charts are completely useless without the axis labels.

Besides, because the span of the axis isn't 0% to 100%, every tick mark must be labelled with the numeric value. That's a lot of extra ink used to display a single value!


Funnels and scatters

I took a peek at some of the work submitted by Ray Vella's students in his NYU dataviz class recently.

The following chart by Hosanah Bryan caught my eye:

Rich Get Richer_Hosanah Bryan (v2)

The data concern the GDP gap between rich and poor regions in various countries. In some countries, especially in the U.K., the gap is gigantic. In other countries, like Spain and Sweden, the gap is much smaller.

The above chart uses a funnel metaphor to organize the data, although the funnel does not add more meaning (not that it has to). Between that, the color scheme and the placement of text, it's visually clean and pleasant to look at.

The data being plotted are messy. They are not actual currency values of GDP. Each number is an index, and represents the relative level of the GDP gap in a given year and country. The gap being shown by the colored bars are differences in these indices 15 years apart. (The students were given this dataset to work with.)

So the chart is very hard to understand if one focuses on the underlying data. Nevertheless, the same visual form can hold other datasets which are less complicated.

One can nitpick about the slight misrepresentation of the values due to the slanted edges on both sides of the bars. This is yet another instance of the tradeoff between beauty and precision.

***

The next chart by Liz Delessert engages my mind for a different reason.

The Rich Get Richerv2

The scatter plot sets up four quadrants. The top right is "everyone gets richer". The top left, where most of the dots lie, is where "the rich get richer, the poor get poorer".  This chart shows a thoughtfulness about organizing the data, and the story-telling.

The grid setup cues readers toward a particular way of looking at the data.

But power comes with responsibility. Such scatter plots are particularly susceptible to the choice of data, in this case, countries. It is tempting to conclude that there are no countries in which everyone gets poorer. But that statement more likely tells us more about which countries were chosen than the real story.

I like to see the chart applied to other data transformations that are easier. For example, we can start with the % change in GDP computed separately for rich and for poor. Then we can form a ratio of these two percent changes.

 

 


Dataviz is good at comparisons if we make the right comparisons

In an article about gas prices around the world, the Washington Post uses the following bar chart (link):

Wpost_gasprices_highincome

There are a few wrinkles in this one compared to the most generic bar chart one can produce:

Redo_wpost_gasprices_0

(The numbers on my chart are not the same as Washington Post's. That's because the data vendor charges for data, except for the most recent week. So, my data is from a different week.)

_trifectacheckup_imageThe gas prices are not expressed in dollars but a transformation turns prices into a cost-effectiveness metric: miles per dollar, or more precisely, miles per $40 dollars of gas. The metric has a reverse direction - the higher the price, the lower the miles. The data transformation belongs to the D corner of the Trifecta Checkup framework (link). Depending on how one poses the Q(uestion) of the chart, the shift from dollars to miles can bring the Q and the D in sync.

In the V(isual) corner, the designer embellishes the bars. A car icon is placed at the tip of each bar while the bar itself is turned into a wavy path, symbolizing a dirt path. The driving metaphor is in full play. In fact, the video makes the most out of it. There is no doubt that the embellishment has turned a mere scientific presentation into a form of entertainment.

***

Did the embellishment harm visual clarity? For the most part, no.

The worst it can get is when they compared U.S. and India/South Africa:

Redo_wpost_gasprices_indiasouthafrica

The left column shows the original charts from the article. In  both charts, the two cars are so close together that it is impossible to learn the scale of the difference. The amount of difference is a fraction of the width of a car icon.

The right column shows the "self-sufficiency test". Imagine the data labels are not on the chart. What we learn is that if we wanted to know how big of a gap is between the two countries, when reading the charts on the left, we are relying on the data labels, not the visual elements. On the right side, if we really want to learn the gaps, we have to look through the car icons to find the tips of the bars!

This discussion does not necessarily doom the appealing chart. If the message one wants to send with the India/South Afrcia charts is that there is negligible difference between them, then it is not crucial to present the precise differences in prices.

***

The real problem with this dataviz is in the D corner. Comparing countries is hard.

As shown above, by the miles per $40 spend metric, U.S. and India are rated essentially the same. So is the average American and the average Indian suffering equally?

Far from it. The clue comes from the aggregate chart, in which countries are divided into three tiers: high income, upper middle income and lower middle income. The U.S. belongs to the high-income tier while India falls into the lower-middle-income tier.

The cost of living in India is much lower than in the US. Forty dollars is a much bigger chunk of an Indian paycheck than an American one.

To adjust for cost of living, economists use a PPP (purchasing power parity) value. The following chart shows the difference:

Redo_wpost_gasprices_1

The right graph contains cost-of-living adjustments. It shows a completely different picture. Nominally (left chart), the price of gas in about the same in dollar terms between U.S. and India. In terms of cost of living, gas is actually 5 times more expensive in India. Thus, the adjusted miles per $40 gas number is much smaller for India than the unadjusted. (Because PPP is relative to U.S. prices, the U.S. numbers are not affected.)

PPP is not the end-all here. According to the Economic Times (India), only 22 out of 1,000 Indians own cars, compared to 980 out of 1,000 Americans. Think about the implication of using any statistic that averages the entire population!

***

Why is gas more expensive in California than the U.S. average? The talking point I keep hearing is environmental regulations. Gas prices may be higher in Europe for a similar reason. Residents in those places may be willing to pay higher prices because they get satisfaction from playing their part in preserving the planet for future generations.

The footnote discloses this not-trivial issue.

Wpost_gasprices_footnote

When converting from dollars per gallon/liter into miles per $40, we need data on miles per gallon/liter. Americans notoriously drive cars (trucks, SUVs, etc.) that have much lower mileage than those driven by other countries. However, this factor is artificially removed by assuming the same car with 32 mpg on all countries. A quick hop to the BTS website tells us that the average mpg of American cars is a third of that assumption. [See note below.]

Ignoring cross-country comparisons for the time being, the true number for U.S. is not 247 miles per $40 spent on gas as claimed. It is a third of that value: 82 miles per $40 spent.

It's tough to find data on fuel economy of all passenger cars, not just new passenger cars. I found Australia's number, which is 21 mpg. So this brings the miles per $40 number down from about 230 to 115. These are not small adjustments.

Washington Post's analysis paints a simplistic picture that presupposes that price is the only thing people care about. I call this issue xyopia. It's when the analyst frames the problem as factor x explaining outcome y, and when factor x is not the only, and frequently not even the most important, factor affecting y.

More on xyopia.

More discussion of Washington Post graphics.

 

[P.S. 7-25-2022. Reader Cody Curtis pointed out in the comments that the Bureau of Transportation Statistics report was using km/liter as units, not miles per gallon. The 10 km/liter number for average cars is roughly 23 mpg. I'll leave the text as is in the post as the larger point is valid: that there is variation in average fuel economy between nations - partly due to environemental regulation and consumer behavior - and thus, a proper comparison requires adjusting for this factor.]