Making major things easy, revisited

In the prior post, I made a chart that shows the driver license status of British drivers at different ages. The key change unplugs the obsession with a+b+c = 100%. Instead, the revised chart makes it easier to figure out what proportion of which age group holds which type of license.

This is the right-side plot from the panel of two plots:

Junkcharts_redo_significanceolddrivers_male

Looking at this chart, one might think my primary point of interest is the relative proportion with full license vs no license. But on second thought, I'm less interested in this comparison than that between male and female drivers. Does the prevalence of full licenses differ between men and women as they age?

In the original panel, the reader has to run back and forth between the two plots. Why not put that comparison on a single plot?

Like this:

Junkcharts_redo_significanceolderdrivers_fulllicense

This chart surfaces the difference between men and women (at all age groups) in owning full driver's licenses. Women are much more likely to stop driving earlier.

Here is the entire panel:

Junkcharts_redo_significanceolderdrivers_bylicense

Because of this structural choice, it is harder on this panel to learn the distribution of license status.

 

 


Making major things easy, and minor things hard

A recent issue of Significance magazine carried the following stacked column chart showing how the driver license status of men and women change as they age. The data came from the U.K.

Siginificance_olddrivers_1

Quick question - what percentage of British men in their sixties hold full driver licenses?

***

I was just kidding. Those questions can't be quickly answered on a stacked column chart. That's because you have to find the axis, and then mentally invert the axis.

On that chart, larger values are shown pointing down (green) and also pointing up (blue), and ... well, I don't have words for the yellow. In fact, the yellow segments, showing people without licenses, are possibly the most important category for this report.

In making decisions about visualizing data, it's important to separate out the major things from the minor things.

***

Here is a reimagination of the chart using connected dots:

Junkcharts_redo_significanceolderdrivers

What is hard to do using this chart is to verify that the three proportions add to 100%. What is easy is to read off the proportion for any gender, age and license status subgroup.

It's really quite intricate how these researchers binned the age data. There are bins of size 1, 4, 5 and 10, plus the top group is 85 and above. The way I handled these is to turn everything to 1-year bins. I assume that in the wider bins, we don't have precise data for each age, and the bin value is the average among the bin, thus it is as if someone had drawn a horizontal line across the bin width. (I left the top bin alone as I don't know what is the maximum age of a person in this study.)

***

Those of you who have laminated the flowchart of data visualization are probably irate. According to such a flowchart, one must use a column chart because the x variable (age band) has irregularly-sized discrete values, and one must use a stacked column chart because the y variable is a percentage, grouped by a third variable (license status).

Don't be mad, just ditch the flowchart.

 


Ranks, labels, metrics, data and alignment

A long-time reader Chris V. (since 2012!) sent me to this WSJ article on airline ratings (link).

The key chart form is this:

Wsj_airlines_overallranks

It's a rhombus shaped chart, really a bar chart rotated counter-clockwise by 45 degrees. Thus, all the text is at 45 degree angles. An airplane icon is imprinted on each bar.

There is also this cute interpretation of the white (non-data-ink) space as a symmetric reflection of the bars (with one missing element). On second thought, the decision to tilt the chart was probably made in service of this quasi-symmetry. If the data bars were horizontal, then the white space would have been sliced up into columns, which just doesn't hold the same appeal.

If we be Tuftian, all of these flourishes do not serve the data. But do they do much harm? This is a case that's harder to decide. The data consist of just a ranking of airlines. The message still comes across. The head must tilt, but the chart beguiles.

***

As the article progresses, the same chart form shows up again and again, with added layers of detail. I appreciate how the author has constructed the story. Subtly, the first chart teaches the readers how the graphic encodes the data, and fills in contextual information such as there being nine airlines in the ranking table.

In the second section, the same chart form is used, while the usage has evolved. There are now a pair of these rhombuses. Each rhombus shows the rankings of a single airline while each bar inside the rhombus shows the airline's ranking on a specific metric. Contrast this with the first chart, where each bar is an airline, and the ranking is the overall ranking on all metrics.

Wsj_airlines_deltasouthwestranks

You may notice that you've used a piece of knowledge picked up from the first chart - that on each of these metrics, each airline has been ranked against eight others. Without that knowledge, we don't know that being 4th is just better than the median. So, in a sense, this second section is dependent on the first chart.

There is a nice use of layering, which links up both charts. A dividing line is drawn between the first place (blue) and not being first (gray). This layering allows us to quickly see that Delta, the overall winner, came first in two of the seven metrics while Southwest, the second-place airline, came first in three of the seven (leaving two metrics for which neither of these airlines came first).

I'd be the first to admit that I have motion sickness. I wonder how many of you are starting to feel dizzy while you read the labels, heads tilted. Maybe you're trying, like me, to figure out the asterisks and daggers.

***

Ironically, but not surprisingly, the asterisks reveal a non-trivial matter. Asterisks direct readers to footnotes, which should be supplementary text that adds color to the main text without altering its core meaning. Nowadays, asterisks may hide information that changes how one interprets the main text, such as complications that muddy the main argument.

Here, the asterisks address a shortcoming of representing ranking using bars. By convention, lower ranking indicates better, and most ranking schemes start counting from 1. If ranks are directly encoded in bars, then the best object is given the shortest bar. But that's not what we see on the chart. The bars actually encode the reverse ranking so the longest bar represents the lowest ranking.

That's level one of this complication. Level two is where these asterisks are at.

Notice that the second metric is called "Canceled flights". The asterisk quipped "fewest". The data collected is on the number of canceled flights but the performance metric for the ranking is really "fewest canceled flights". 

If we see a long bar labelled "1st" under "canceled flights", it causes a moment of pause. Is the airline ranked first because it had the most canceled flights? That would imply being first is worst for this category. It couldn't be that. So perhaps "1st" means having the fewest canceled flights but then it's just weird to show that using the longest bar. The designer correctly anticipates this moment of pause, and that's why the chart has those asterisks.

Unfortunately, six out of the seven metrics require asterisks. In almost every case, we have to think in reverse. "Extreme delays" really mean "Least extreme delays"; "Mishandled baggage" really mean "Less mishandled baggage"; etc. I'd spend some time renaming the metrics to try to fix this avoiding footnotes. For example, saying "Baggage handling" instead of "mishandled baggage" is sufficient.

***

The third section contains the greatest details. Now, each chart prints the ranking of nine airlines for a particular metric.

Wsj_airlinerankings_bymetric

 

By now, the cuteness faded while the neck muscles paid. Those nice annotations, written horizontally, offered but a twee respite.

 

 

 

 

 


Dot plots with varying dot sizes

In a prior post, I appreciated the effort by the Bloomberg Graphics team to describe the diverging fortunes of Japanese and Chinese car manufacturers in various Asian markets.

The most complex chart used in that feature is the following variant of a dot plot:

Bloomberg_japancars_chinamarket

This chart plots the competitors in the Chinese domestic car market. Each bubble represents a car brand. Using the styling of the entire article, the red color is associated with Japanese brands while the medium gray color indicates Chinese brands. The light gray color shows brands from the rest of the world. (In my view, adding the pink for U.S. and blue for German brands - seen on the first chart in this series - isn't too much.)

The dot size represents the current relative market share of the brand. The main concern of the Bloomberg article is the change in market share in the period 2019-2024. This is placed on the horizontal axis, so the bubbles on the right side represent growing brands while the bubbles on the left, weakening brands.

All the Japanese brands are stagnating or declining, from the perspective of market share.

The biggest loser appears to be Volkswagen although it evidently started off at a high level since its bubble size after shrinkage is still among the largest.

***

This chart form is a composite. There are at least two ways to describe it. I prefer to see it as a dot plot with an added dimension of dot size. A dot plot typically plots a single dimension on a single axis, and here, a second dimension is encoded in the sizes of the dots.

An alternative interpretation is that it is a scatter plot with a third dimension in the dot size. Here, the vertical dimension is meaningless, as the dots are arbitrarily spread out to prevent overplotting. This arrangement is also called the bubble plot if we adopt a convention that a bubble is a dot of variable size. In a typical bubble plot, both vertical and horizontal axes carry meaning but here, the vertical axis is arbitrary.

The bubble plot draws attention to the variable in the bubble size, the scatter plot emphasizes two variables encoded in the grid while the dot plot highlights a single metric. Each shows secondary metrics.

***

Another revelation of the graph is the fragmentation of the market. There are many dots, especially medium gray dots. There are quite a few Chinese local manufacturers, most of which experienced moderate growth. Most of these brands are startups - this can be inferred because the size of the dot is about the same as the change in market share.

The only foreign manufacturer to make material gains in the Chinese market is Tesla.

The real story of the chart is BYD. I almost missed its dot on first impression, as it sits on the far right edge of the chart (in the original webpage, the right edge of the chart is aligned with the right edge of the text). BYD is the fastest growing brand in China, and its top brand. The pedestrian gray color chosen for Chinese brands probably didn't help. Besides, I had a little trouble figuring out if the BYD bubble is larger than the largest bubble in the size legend shown on the opposite end of BYD. (I measured, and indeed the BYD bubble is slightly larger.)

This dot chart (with variable dot sizes) is nice for highlighting individual brands. But it doesn't show aggregates. One of the callouts on the chart reads: "Chinese cars' share rose by 23%, with BYD at the forefront". These words are necessary because it's impossible to figure out that the total share gain by all Chinese brands is 23% from this chart form.

They present this information in the line chart that I included in the last post, repeated here:

Bloomberg_japancars_marketshares

The first chart shows that cumulatively, Chinese brands have increased their share of the Chinese market by 23 percent while Japanese brands have ceded about 9 percent of market share.

The individual-brand view offers other insights that can't be found in the aggregate line chart. We can see that in addition to BYD, there are a few local brands that have similar market shares as Tesla.

***

It's tough to find a single chart that brings out insights at several levels of analysis, which is why we like to talk about a "visual story" which typically comprises a sequence of charts.

 


Fantastic auto show from the Bloomberg crew

I really enjoyed the charts in this Bloomberg feature on the state of Japanese car manufacturers in the Southeast Asian and Chinese markets (link). This article contains five charts, each of which is both engaging and well-produced.

***

Each chart has a clear message, and the visual display is clearly adapted for purpose.

The simplest chart is the following side-by-side stacked bar chart, showing the trend in share of production of cars:

Bloomberg_japancars_production

Back in 1998, Japan was the top producer, making about 22% of all passenger cars in the world. China did not have much of a car industry. By 2023, China has dominated global car production, with almost 40% of share. Japan has slipped to second place, and its share has halved.

The designer is thoughtful about each label that is placed on the chart. If something is not required to tell the story, it's not there. Consistently across all five charts, they code Japan in red, and China in a medium gray color. (The coloring for the rest of the world is a bit inconsistent; we'll get to that later.)

Readers may misinterpret the cause of this share shift if this were the only chart presented to them. By itself, the chart suggests that China simply "stole" share from Japan (and other countries). What is true is that China has invested in a car manufacturing industry. A more subtle factor is that the global demand for cars has grown, with most of the growth coming from the Chinese domestic market and other emerging markets - and many consumers favor local brands. Said differently, the total market size in 2023 is much higher than that in 1998.

***

Bloomberg also made a chart that shows market share based on demand:

Bloomberg_japancars_marketshares

This is a small-multiples chart consisting of line charts. Each line chart shows market share trends in five markets (China and four Southeast Asian nations) from 2019 to 2024. Take the Chinese market for example. The darker gray line says Chinese brands have taken 20 percent additional market share since 2019; note that the data series is cumulative over the entire window. Meanwhile, brands from all other countries lost market share, with the Japanese brands (in red) losing the most.

The numbers are relative, which means that the other brands have not necessarily suffered declines in sales. This chart by itself doesn't tell us what happened to sales; all we know is the market shares of brands from different countries relative to their baseline market share in 2019. (Strange period to pick out as it includes the entire pandemic.)

The designer demonstrates complete awareness of the intended message of the chart. The lines for Chinese and Japanese brands were bolded to highlight the diverging fortunes, not just in China, but also in Southeast Asia, to various extents.

On this chart, the designer splits out US and German brands from the rest of the world. This is an odd decision because the categorization is not replicated in the other four charts. Thus, the light gray color on this chart excludes U.S. and Germany while the same color on the other charts includes them. I think they could have given U.S. and Germany their own colors throughout.

***

The primacy of local brands is hinted at in the following chart showing how individual brands fared in each Southeast Asian market:

Bloomberg_japancars_seasiamarkets

 

This chart takes the final numbers from the line charts above, that is to say, the change in market share from 2019 to 2024, but now breaks them down by individual brand names. As before, the red bubbles represent Japanese brands, and the gray bubbles Chinese brands. The American and German brands are lumped in with the rest of the world and show up as light gray bubbles.

I'll discuss this chart form in a next post. For now, I want to draw your attention to the Malaysia market which is the last row of this chart.

What we see there are two dominant brands (Perodua, Proton), both from "rest of the world" but both brands are Malaysian. These two brands are the biggest in Malaysia and they account for two of the three highest growing brands there. The other high-growth brand is Chery, which is a Chinese brand; even though it is growing faster, its market share is still much smaller than the Malaysian brands, and smaller than Toyota and Honda. Honda has suffered a lot in this market while Toyota eked out a small gain.

The impression given by this bubble chart is that Chinese brands have not made much of a dent in Malaysia. But that would not be correct, if we believe the line chart above. According to the line chart, Chinese brands roughly earned the same increase in market share (about 3%) as "other" brands.

What about the bubble chart might be throwing us off?

It seems that the Chinese brands were starting from zero, thus the growth is the whole bubble. For the Malaysian brands, the growth is in the outer ring of the bubbles, and the larger the bubble, the thinner is the ring. Our attention is dominated by the bubble size which represents a snapshot in the ending year, providing no information about the growth (which is shown on the horizontal axis).

***

For more discussion of Bloomberg graphics, see here.


Excess delay

The hot topic in New York at the moment is congestion pricing for vehicles entering Manhattan, which is set to debut during the month of June. I found this chart (link) that purports to prove the effectiveness of London's similar scheme introduced a while back.

Transportxtra_2

This is a case of the visual fighting against the data. The visual feels very busy and yet the story lying beneath the data isn't that complex.

This chart was probably designed to accompany some text which isn't available free from that link so I haven't seen it. The reader's expectation is to compare the periods before and after the introduction of congestion charges. But even the task of figuring out the pre- and post-period is taking more time than necessary. In particular, "WEZ" is not defined. (I looked this up, it's "Western Extension Zone" so presumably they expanded the area in which charges were applied when the travel rates went back to pre-charging levels.)

The one element of the graphic that raises eyebrows is the legend which screams to be read.

Transportxtra_londoncongestioncharge_legend

Why are there four colors for two items? The legend is not self-sufficient. The reader has to look at the chart itself and realize that purple is the pre-charging period while green (and blue) is the post-charging period (ignoring the distinction between CCZ and WEZ).

While we are solving this puzzle, we also notice that the bottom two colors are used to represent an unchanging quantity - which is the definition of "no congestion". This no-congestion travel rate is a constant throughout the chart and yet a lot of ink of two colors have been spilled on it. The real story is in the excess delay, which the congestion charging scheme was supposed to reduce.

The excess on the chart isn't harmless. The excess delay on the roads has been transferred to the chart reader. It actually distracts from the story the analyst is wanting to tell. Presumably, the story is that the excess delays dropped quite a bit after congestion charging was introduced. About four years later, the travel rates had creeped back to pre-charging levels, whereupon the authorities responded by extending the charging zone to WEZ (which as of the time of the chart, wasn't apparently bringing the travel rate down.)

Instead of that story, the excess of the chart makes me wonder... the roads are still highly congested with travel rates far above the level required to achieve no congestion, even after the charging scheme was introduced.

***

I started removing some of the excess from the chart. Here's the first cut:

Junkcharts_redo_transportxtra_londoncongestioncharge

This is better but it is still very busy. One problem is the choice of columns, even though the data are found strictly on the top of each column. (Besides, when I chop off the unchanging sections of the columns, I created a start-not-from-zero problem.) Also, the labeling of the months leaves much to be desired, there are too many grid lines, etc.

***

Here is the version I landed on. Instead of columns, I use lines. When lines are used, there is no need for month labels since we can assume a reader knows the structure of months within a year.

Junkcharts_redo_transportxtra_londoncongestioncharge-2

A priniciple I hold dear is not to have legends unless it is absolutely required. In this case, there is no need to have a legend. I also brought back the notion of a uncongested travel speed, with a single line (and annotation).

***

The chart raises several questions about the underlying analysis. I'd interested in learning more about "moving car observer surveys". What are those? Are they reliable?

Further, for evidence of efficacy, I think the pre-charging period must be expanded to multiple years. Was 2002 a particularly bad year?

Thirdly, assuming WEZ indicates the expansion of the program to a new geographical area, I'm not sure whether the data prior to its introduction represents the travel rate that includes the WEZ (despite no charging) or excludes it. Arguments can be made for each case so the key from a dataviz perspective is to clarify what was actually done.

 

P.S. [6-6-24] On the day I posted this, NY State Governer decided to cancel the congestion pricing scheme that was set to start at the end of June.


Messing with expectations

A co-worker sent me to the following map, found in Forbes:

Forbes_gastaxmap

It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.

Junkcharts_redo_forbes_gastaxmap_green

There is no need to use two colors either.

***

A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.

Forbes_gastaxmap_legend

The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.

***

I recast this as a small-multiples by categorizing states into four subgroups.

Junkcharts_redo_forbes_gastaxmap_split

With this change, one can almost justify using maps because there is sort of a spatial pattern.

 

 


The choice to encode data using colors

NBC News published the following heatmap that shows inflation by product category in the last year or so:

Nbcnews_inflationtracker

The general story might be that inflation was rampant in airfare and electricity prices about a year ago but these prices have moderated recently, especially in airfare. Gas prices appear to have inflated far less than overall inflation during these months.

***

Now, if you're someone who cares about the magnitude of differences, not just the direction, then revisit the above statements, and you'll feel a sense of inadequacy.

When we choose to encode data in colors, we're giving up on showing magnitudes or precision. The color scale shown up top sends the message that the continuous nature of the number line is being displayed but it really isn't.

The largest value of the chart is found on the left side of the airfare row:

Nbcnews_inflationtracker_highest

The value is about 36% which strangely enough is far larger than the maximum value shown in the legend above. Even if those values align, it is still impossible to guess what values the different colors and shades in the cells map to from the legend.

***

The following small-multiples chart shows the underlying values more precisely:

Redo_junkcharts_nbcnewsinflation

I have transformed the data differently. In these line charts, the data are indexed to the first month (100) so each chart shows the cumulative change in prices from that month to the current month, for each category, compared to the overall.

The two most interesting categories are airfare and gas. Airfare has recently decreased quite drastically relative to September 2022, and thus the line is far below the overall inflation trend. Gas prices moved in reverse: they dropped in the last quarter of 2022 but have steadily risen over 2023, and in the most recent month, is tracking overall inflation.

 

 


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


Dataviz in camouflage

This subway timetable in Tokyo caught my eye:

Tokyosubway_timetable_red

It lists the departure times of all trains going toward Shibuya on Saturdays and holidays.

It's a "stem and leaf" plot.

The stem-and-leaf plot is a crude histogram. In this version, the stem is the hour of the day (24-hour clock) and the leaf is the minute (between 0 and 59). The longer the leaf, the higher the frequency of trains.

We can see that there isn't one peak but rather a plateau between hours 9 and 18.

***

Contrast this with the weekday schedule in blue:

Tokyosubway_timetable_blue

We can clearly see two rush hours, one peak at hour 8 and a second one at hours 17-18.

Love seeing dataviz in camouflage!