Out of line

This simple chart showing life expectancies in 10 countries raises one's eyebrows.

Lifeexpectancy_indiatv

The first curiosity is the deliberate placement of Pakistan behind India and China. Every nation is sorted from lowest to highest, except for Pakistan. Is the reason politics? I have no idea. If you have an explanation, please leave a comment.

***
This graphic is an example of data visualization that does not actually show the data.

The positions of the flags do not in fact encode the data! For example, the Indian flag is closer to the Chinese flag than to the Pakistani flag even though the gap between India and China (7) is more than double the gap between India and Pakistan (3).

Here is what it looks like if the gaps encode the data. With this selection of countries, Pakistan and India are separated from the rest. 

Junkcharts_redo_indiatvlifeexpectancy

In the original chart, the readers must read the data labels to understand it, and resist intepreting the visual elements.

I removed the flag poles because they have the unintended consequence of establishing a zero level (where the cartoon characters stand) but the positions of the flags don't reflect a start-at-zero posture.

***

Returning to our first topic for a second. If the message of the chart is to single out Pakistan, it actually works! If all other countries are sorted by value, with Pakistan inserted out of order, it draws our attention.

In a conventional layout, Pakistan is shoved to the left side in the bottom corner. See below:

Junkcharts_redo_indiatvlifeexpectancy_2

 

 


Making major things easy, revisited

In the prior post, I made a chart that shows the driver license status of British drivers at different ages. The key change unplugs the obsession with a+b+c = 100%. Instead, the revised chart makes it easier to figure out what proportion of which age group holds which type of license.

This is the right-side plot from the panel of two plots:

Junkcharts_redo_significanceolddrivers_male

Looking at this chart, one might think my primary point of interest is the relative proportion with full license vs no license. But on second thought, I'm less interested in this comparison than that between male and female drivers. Does the prevalence of full licenses differ between men and women as they age?

In the original panel, the reader has to run back and forth between the two plots. Why not put that comparison on a single plot?

Like this:

Junkcharts_redo_significanceolderdrivers_fulllicense

This chart surfaces the difference between men and women (at all age groups) in owning full driver's licenses. Women are much more likely to stop driving earlier.

Here is the entire panel:

Junkcharts_redo_significanceolderdrivers_bylicense

Because of this structural choice, it is harder on this panel to learn the distribution of license status.

 

 


Making major things easy, and minor things hard

A recent issue of Significance magazine carried the following stacked column chart showing how the driver license status of men and women change as they age. The data came from the U.K.

Siginificance_olddrivers_1

Quick question - what percentage of British men in their sixties hold full driver licenses?

***

I was just kidding. Those questions can't be quickly answered on a stacked column chart. That's because you have to find the axis, and then mentally invert the axis.

On that chart, larger values are shown pointing down (green) and also pointing up (blue), and ... well, I don't have words for the yellow. In fact, the yellow segments, showing people without licenses, are possibly the most important category for this report.

In making decisions about visualizing data, it's important to separate out the major things from the minor things.

***

Here is a reimagination of the chart using connected dots:

Junkcharts_redo_significanceolderdrivers

What is hard to do using this chart is to verify that the three proportions add to 100%. What is easy is to read off the proportion for any gender, age and license status subgroup.

It's really quite intricate how these researchers binned the age data. There are bins of size 1, 4, 5 and 10, plus the top group is 85 and above. The way I handled these is to turn everything to 1-year bins. I assume that in the wider bins, we don't have precise data for each age, and the bin value is the average among the bin, thus it is as if someone had drawn a horizontal line across the bin width. (I left the top bin alone as I don't know what is the maximum age of a person in this study.)

***

Those of you who have laminated the flowchart of data visualization are probably irate. According to such a flowchart, one must use a column chart because the x variable (age band) has irregularly-sized discrete values, and one must use a stacked column chart because the y variable is a percentage, grouped by a third variable (license status).

Don't be mad, just ditch the flowchart.

 


Patiently looking

Voronoi (aka Visual Economist) made this map about service times at emergency rooms around the U.S.

 

Voronoi_EmergencyRoomWaitTImes

This map shows why one shouldn’t just stick state-level data into a state-level map by default.

The data are median service times, defined as the duration of the visit from the moment a patients arrive to the moment they leave. For reasons to be explained below, I don’t like this metric. The data are in terms of hours and minutes, and encoded in the color scale.

As with any choropleth, the dominant features of this map are the shapes and sizes of various pieces but these don’t carry any data. The eastern seaboard contains many states that are small in area but dense in population, and always produces a messy, crowded smorgasbord of labels and guiding lines.

The color scale is progressive (continuous) making it even harder to gain an appreciation of the spatial pattern. For the sake of argument, imagine a truly continuous color scale tuned to the median service times in number of minutes. There would be as many shades as there are unique number of values on the map. For example, the state with 2 hr 12 min median time would receive a different shade than the one with 2 hr 11 min. Looking at the dataset, I found 43 unique values of median service time in the 52 states and territories. Thus, almost every state would wear its unique shade, making it hard to answer such common questions as: which cluster of states have high/medium/low median service times?

(As the underlying software may only be capable of printing a finite number of shades so in reality, there aren’t any true continuous scales. A continuous scale is just a discrete scale with many levels of shades. For this map, I’d group the states into at most five categories, requiring five shades.)

***

We’re now reaching the D corner of the Trifecta Checkup (link). _trifectacheckup_image

I’d transform the data to relative values, such as an index against the median or average in the nation. The colors now indicate how much higher or lower is the state’s median service time than that of the nation. With this transformed data, it makes more sense to use a bidirectional color scale so that there are different colors for higher vs lower than average.

Lastly, I’m not sure about the use of median service time, as opposed to average (mean) service time. I suspect that the distribution is heavily skewed toward longer values so that the median service time falls below the mean service time. If, however, the service time distribution is roughly symmetric around the median, then the mean and median service times will be very similar, and thus the metric selection doesn’t matter.

Imagine you're the healthcare provider and your bonus is based on managing median service times. You have an incentive to let a small number of patients wait an extraordinary amount of time, while serving a bunch of patients who require relatively simple procedures. If it's a mean service time, the values of the extreme outliers will be spread over all the patients while the median service time is affected by the number of such outliers but not their magnitudes.

When I pulled down the publicly available data (link), I found additional data fields. The emergency room visits are further broken into four categories (low, medium, high, very high), and a median is reported within each category. Thus, we have a little idea how extreme the top values can be.

The following dotplot shows this:

Junkcharts_redo_voronoi_emergencyrooms

A chart like this is still challenging to read since there are 52 territories, ordered by the value on a metric. If the analyst can say what are interesting questions, e.g. breaking up the territories into regions, then a grouping can be applied to the above chart to aid comprehension.

 


Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


Dot plots with varying dot sizes

In a prior post, I appreciated the effort by the Bloomberg Graphics team to describe the diverging fortunes of Japanese and Chinese car manufacturers in various Asian markets.

The most complex chart used in that feature is the following variant of a dot plot:

Bloomberg_japancars_chinamarket

This chart plots the competitors in the Chinese domestic car market. Each bubble represents a car brand. Using the styling of the entire article, the red color is associated with Japanese brands while the medium gray color indicates Chinese brands. The light gray color shows brands from the rest of the world. (In my view, adding the pink for U.S. and blue for German brands - seen on the first chart in this series - isn't too much.)

The dot size represents the current relative market share of the brand. The main concern of the Bloomberg article is the change in market share in the period 2019-2024. This is placed on the horizontal axis, so the bubbles on the right side represent growing brands while the bubbles on the left, weakening brands.

All the Japanese brands are stagnating or declining, from the perspective of market share.

The biggest loser appears to be Volkswagen although it evidently started off at a high level since its bubble size after shrinkage is still among the largest.

***

This chart form is a composite. There are at least two ways to describe it. I prefer to see it as a dot plot with an added dimension of dot size. A dot plot typically plots a single dimension on a single axis, and here, a second dimension is encoded in the sizes of the dots.

An alternative interpretation is that it is a scatter plot with a third dimension in the dot size. Here, the vertical dimension is meaningless, as the dots are arbitrarily spread out to prevent overplotting. This arrangement is also called the bubble plot if we adopt a convention that a bubble is a dot of variable size. In a typical bubble plot, both vertical and horizontal axes carry meaning but here, the vertical axis is arbitrary.

The bubble plot draws attention to the variable in the bubble size, the scatter plot emphasizes two variables encoded in the grid while the dot plot highlights a single metric. Each shows secondary metrics.

***

Another revelation of the graph is the fragmentation of the market. There are many dots, especially medium gray dots. There are quite a few Chinese local manufacturers, most of which experienced moderate growth. Most of these brands are startups - this can be inferred because the size of the dot is about the same as the change in market share.

The only foreign manufacturer to make material gains in the Chinese market is Tesla.

The real story of the chart is BYD. I almost missed its dot on first impression, as it sits on the far right edge of the chart (in the original webpage, the right edge of the chart is aligned with the right edge of the text). BYD is the fastest growing brand in China, and its top brand. The pedestrian gray color chosen for Chinese brands probably didn't help. Besides, I had a little trouble figuring out if the BYD bubble is larger than the largest bubble in the size legend shown on the opposite end of BYD. (I measured, and indeed the BYD bubble is slightly larger.)

This dot chart (with variable dot sizes) is nice for highlighting individual brands. But it doesn't show aggregates. One of the callouts on the chart reads: "Chinese cars' share rose by 23%, with BYD at the forefront". These words are necessary because it's impossible to figure out that the total share gain by all Chinese brands is 23% from this chart form.

They present this information in the line chart that I included in the last post, repeated here:

Bloomberg_japancars_marketshares

The first chart shows that cumulatively, Chinese brands have increased their share of the Chinese market by 23 percent while Japanese brands have ceded about 9 percent of market share.

The individual-brand view offers other insights that can't be found in the aggregate line chart. We can see that in addition to BYD, there are a few local brands that have similar market shares as Tesla.

***

It's tough to find a single chart that brings out insights at several levels of analysis, which is why we like to talk about a "visual story" which typically comprises a sequence of charts.

 


The most dangerous day

Our World in Data published this interesting chart about infant mortality in the U.S.

Mostdangerousday

The article that sent me to this chart called the first day of life the "most dangerous day". This dot plot seems to support the notion, as the "per-day" death rate is the highest on the day of birth, and then drops quite fast (note log scale) over the the first year of life.

***

Based on the same dataset, I created the following different view of the data, using the same dot plot form:

Junkcharts_redo_ourworldindata_infantmortality

By this measure, a baby has 99.63% chance of surviving the first 30 days while the survival rate drops to 99.5% by day 180.

There is an important distinction between these two metrics.

The "per day" death rate is the chance of dying on a given day, conditional on having survived up to that day. The chance of dying on day 2 is lower partly because some of the more vulnerable ones have died on day 1 or day 0,  etc.

The survival rate metric is cumulative: it measures how many babies are still alive given they were born on day 0. The survival rate can never go up, so long as we can't bring back the dead.

***

If we are assessing a 5-day-old baby's chance of surviving day 6, the "per-day" death rate is relevant since that baby has not died in the first 5 days.

If the baby has just been born, and we want to know the chance it might die in the first five days (or survive beyond day 5), then the cumulative survival rate curve is the answer. If we use the per-day death rate, we can't add the first five "per-day" death rates It's a more complicated calculation of dying on day 0, then having not died on day 0, dying on day 1, then having not died on day 0 or day 1, dying on day 2, etc.

 


Making colors and groups come alive

_numbersense_coverIn the May 2024 issue of Significance, there is an enlightening article (link, paywall) about a new measure of inflation being adopted by the U.K. government known as HCI (Household Costs Indices). This is expected to replace CPI which is the de facto standard measure used around the world. In Chapter 7 of Numbersense (link), I discuss the construction of the CPI, which critics have alleged is manipulated by public officials to be over-optimistic.

The HCI looks promising as it addresses several weaknesses in the CPI measure. First, it implements accounting for household spending on housing - this has always been a tricky subject, regarding those who own homes rather than rent. Second, it recognizes that the average inflation number, which represents the average price changes on the average basket of goods purchased by the average person, does not reflect the experience of many. The HCI measures are broken down into demographic subgroups, so it's possible to compare the HCI of retirees vs non-retirees, for example.

Then comes this multi-colored bar chart:

Sig_hci sm

***

The chart is servicable: the reader can find the story. For almost all the subgroups listed, the HCI measure comes in higher than the CPI measure (black). For the income deciles, the reader sense that the relationship is not linear, that is to say, inflation does not increase (or decrease) as income. It appears that inflation is highest at both ends of the spectrum, and lowest for those who are in deciles 6 to 8. The only subgroup for whom CPI overestimates inflation is "private renter," which totally makes sense since the CPI index previously did not account for "owner-occupier housing" cost.

This is a chart with 19 bars, and 19 colors. The colors do not encode any data at all, which is a bit wasteful. We can make the colors come alive by encoding subgroup identity. This is what the grouped bar chart looks like:

Junkcharts_redo_sig_hci_grouped_bars

While this is still messy, this version makes it a bit easier to compare across subgroups. The chart simultaneously plots four different grouping methods: by retired/not, by income deciles, by housing situation and by having children/not. Within each grouping, the segments are mutually exclusive but between the grouping, the segments are overlapping. For example, the same person can be counted in Retired, and having Children, and also some retirees have children while other don't.

***

To better display the interactions between groups and subgroups, I prefer using a dot plot.

Junkcharts_redo_sig_hci_dots

This is not a simple dot plot either. It's a grouped dot plot with four levels that correspond to each grouping method. One can see the distribution of HCI values across the subgroups within each grouping, and also compare the range of values from one group to another group.

One side benefit of using the dot plot is to get rid of the non-informative space between values 0 and 20. When using a bar chart, we have to start the bars at zero to avoid distorting the encoding. Not so for a dot plot.

P.S. In the next iteration, I'd consider flipping the axes as that might simplify labeling the subgroups.

 


Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


A nice plot of densities, but what's behind the colors?

I came across this chart by Planet Anomaly that compares air quality across the world's cities (link). The chart is in long form. The top part looks like this:

Visualcapitalist_airqualityinches_top

The bottom part looks like this:

Visualcapitalist_airqualityinches_bottom

You can go to the Visual Capitalist website to see the entire chart.

***

Plots of densities are relatively rare. The metric for air quality is micrograms of fine particulate matter (PM) per cubic meter, so showing densities is natural.

It's pretty clear the cities with the worst air quality at the bottom has a lot more PM in the air than the cleanest cities shown at the top.

This density chart plays looser with the data than our canonical chart types. The perceived densities of dots inside the squares do not represent the actual concentrations of PM. It's certainly not true that in New Delhi, the air is packed tightly with PM.

Further, a random number generator is required to scatter the red dots inside the circle. Thus, different software or designers will make the same chart look a bit different - the densities will be the same but the locations of the dots will not be.

I don't have a problem with this. Do you?

***

Another notable feature of this chart is the double encoding. The same metric is not just presented as densities; it is also encoded in a color scale.

Visualcapitalist_airqualityinches_color_scale

I don't think this adds much.

Both color and density are hard for humans to perceive precisely so adding color does not convey  precision to readers.

The color scale is gradated, so it effectively divided the cities into seven groups. But I don't attach particular significance to the classification. If that is important, it would be clearer to put boxes around the groups of plots. So I don't think the color scale convey clustering to readers effectively.

There is one important grouping which is defined by WHO's safe limit of 5 pg/cubic meter. A few cities pass this test while almost every other place fails. But the design pays no attention to this test, as it uses the same hue on both sides, and even the same tint changes on either side of the limit.

***

Another notable project that shows densities as red dots is this emotional chart by Mona Chalabi about measles, which I wrote about in 2019.

Monachalabi_measles