Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:


This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?


First, use colors. Here is an example in which the designer uses color to indicate the function classes:


The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:


Again, just a smallest of changes but it makes a big difference.


It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:


Vertical labels are best read...


These vertical labels are best read while doing side planks.



I'm surprised the horizontal alignment is rather rare. Here's one:



Tidying up the details

This column chart caught my attention because of the color labels.


Well, it also concerns me that the chart takes longer to take in than you'd think.


The color labels say "FY2123", "FY2022", and "FY1921". It's possible but unlikely that the author is making comparisons across centuries. The year 2123 hasn't yet passed, so such an interpretation would map the three categories to long-ago past, present and far-into-the-future.

Perhaps hyphens were inadvertently left off so "FY2123" means "FY2021 - FY2023". It's odd to report financial metrics in multi-year aggregations. I rule this out because the three categories would then also overlap.

Here's what I think the mistake is: somehow the prefix is rolled forward when it is applied to the years. "FY23", "FY22", "FY21" got turned into "FY[21]23", "FY[20]22", "FY[19]21" instead of putting 20 in all three slots.

The chart appeared in an annual financial report, and the comparisons were mostly about the reporting year versus the year before so I'm pretty confident the last two digits are accurately represented.

Please let me know if you have another key to this puzzle.

In the following, I'm going to assume that the three colors represent the most recent three fiscal years.


A few details conspire to blow up our perception time.

There was no extra spacing between groups of columns.

The columns are arranged in reverse time order, with the most recent year shown on the left. (This confuses those of us that use the left-to-right convention.)

The colors are not ordered. If asked to sort the three colors, you will probably suggest what is described as "intuitive" below:


The intuitive order aligns with the amount of black added to a base color (hue). But this isn't the order assigned to the three years on the original chart.


Some of the other details on the chart are well done. For example, I like the handling of the gridlines and the axes.

The following revision tidies up some of the details mentioned above, without changing the key features of the chart:



Using disaggregation in dataviz

This chart appears in a journal article on the use of AI (artificial intelligence) in healthcare (link).


It's a stacked bar chart in which each bar is subdivided into four segments. The authors are interested in the relative frequency of research using AI by disease type. The chart only shows the top 10 disease types.

What is unusual is that the subdivisions are years. So these authors revealed four years of journal articles, and while the overall ranking of the disease types is by the aggregated four-year total counts, each total count has been disaggregated by color so readers can also see the annual counts.


A slight rearrangement yields the following:


Most readers will only care about the left chart showing the total counts. More invested readers may consider the colored charts that show annual totals. These are arranged so that the annual counts are easily read and compared.


One annoying aspect of this type of presentation is that in almost all cases, the top 10 types in aggregate will not be the top 10 types by individual year. In some of those years, I expect that the 10 types shown do not include all of the top 10 types for a particular year.

Making colors and groups come alive

_numbersense_coverIn the May 2024 issue of Significance, there is an enlightening article (link, paywall) about a new measure of inflation being adopted by the U.K. government known as HCI (Household Costs Indices). This is expected to replace CPI which is the de facto standard measure used around the world. In Chapter 7 of Numbersense (link), I discuss the construction of the CPI, which critics have alleged is manipulated by public officials to be over-optimistic.

The HCI looks promising as it addresses several weaknesses in the CPI measure. First, it implements accounting for household spending on housing - this has always been a tricky subject, regarding those who own homes rather than rent. Second, it recognizes that the average inflation number, which represents the average price changes on the average basket of goods purchased by the average person, does not reflect the experience of many. The HCI measures are broken down into demographic subgroups, so it's possible to compare the HCI of retirees vs non-retirees, for example.

Then comes this multi-colored bar chart:

Sig_hci sm


The chart is servicable: the reader can find the story. For almost all the subgroups listed, the HCI measure comes in higher than the CPI measure (black). For the income deciles, the reader sense that the relationship is not linear, that is to say, inflation does not increase (or decrease) as income. It appears that inflation is highest at both ends of the spectrum, and lowest for those who are in deciles 6 to 8. The only subgroup for whom CPI overestimates inflation is "private renter," which totally makes sense since the CPI index previously did not account for "owner-occupier housing" cost.

This is a chart with 19 bars, and 19 colors. The colors do not encode any data at all, which is a bit wasteful. We can make the colors come alive by encoding subgroup identity. This is what the grouped bar chart looks like:


While this is still messy, this version makes it a bit easier to compare across subgroups. The chart simultaneously plots four different grouping methods: by retired/not, by income deciles, by housing situation and by having children/not. Within each grouping, the segments are mutually exclusive but between the grouping, the segments are overlapping. For example, the same person can be counted in Retired, and having Children, and also some retirees have children while other don't.


To better display the interactions between groups and subgroups, I prefer using a dot plot.


This is not a simple dot plot either. It's a grouped dot plot with four levels that correspond to each grouping method. One can see the distribution of HCI values across the subgroups within each grouping, and also compare the range of values from one group to another group.

One side benefit of using the dot plot is to get rid of the non-informative space between values 0 and 20. When using a bar chart, we have to start the bars at zero to avoid distorting the encoding. Not so for a dot plot.

P.S. In the next iteration, I'd consider flipping the axes as that might simplify labeling the subgroups.


When should we use bar charts?

Significance_13thfl sm

Two innocent looking column charts.

These came from an article in Significance magazine (link to paywall) that applies the "difference-in-difference" technique to analyze whether the superstitious act of skipping the number 13 when numbering floors in tall buildings causes an inflation of condo pricing.

The study authors are quite careful in their analysis, recognizing that building managers who decide to relabel the 13th floor as 14th may differ in other systematic ways from those who don't relabel. They use a matching technique to construct comparison groups. The left-side chart shows one effect of matching buildings, which narrowed the gap in average square footage between the relabeled and non-relabeled groups. (Any such gap suggests potential confounding; in a hypothetical, randomized experiment, the average square footage of both groups should be statistically identical.)

The left-side chart features columns that don't start as zero, thus the visualization exaggerates the differences. The degree of exaggeration here is tame: about 150 got chopped off at the bottom, which is about 10% of the total height. But why?


The right-side chart is even more problematic.

This chart shows the effect of matching buildings on the average age of the buildings (measured using the average construction year). Again, the columns don't start at zero. But for this dataset, zero is a meaningless value. Never make a column chart when the zero level has no meaning!

The story is simple: by matching, the average construction year in the relabeled group was brought closer to that in the non-relabeled group. The construction year is an ordinal categorical variable, with integer values. I think a comparison of two histograms will show the message clearer, and also provide more information than jut the two average values.

Is this dataviz?

The message in this Visual Capitalist chart is simple - that big tech firms are spending a lot of cash buying back their own stock (which reduces the number of shares in the market, which pushes up their stock price - all without actually having improved their business results.)

Visualcapitalist_Magnificent_Seven_Stock-Buybacks_MAINBut is this data visualization? How does the visual design reflect the data?

The chart form is a half-pie chart, composed of five sectors, of increasing radii. In a pie chart, the data are encoded in the sector areas. But when the sectors are of different radii, it's possible that the data are found in the angles.

The text along the perimeter, coupled with the bracketing, suggests that the angles convey information - specifically, the amount of shares repurchased as a proportion of outstanding share value (market cap). On inspection, the angles are the same for all five sectors, and each one is 180 degrees divided by five, the number of companies depicted on the chart, so they convey no information, unless the company tally is deemed informative.

Each slice of the pie represents a proportion but these proportions don't add up. So the chart isn't even a half-pie chart. (Speaking of which, should the proportions in a half-pie add up to 100% or 50%?)

What about the sector areas? Since the angles are fixed, the sector areas are directly proportional to the radii. It took me a bit of time to figure this one out. The radius actually encodes the amount spent by each company on the buyback transaction. Take the ratio of Microsoft to Meta: 20 over 25 is 80%. To obtain a ratio of areas of 80%, the ratio of radii is roughly 90%; and the radius of Microsoft's sector is indeed about 90% of that of Meta. The ratio between Alphabet and Apple is similar.

The sector areas represent the dollar value of these share buybacks, although these transactions range from 0.6% to 2.9% as a proportion of outstanding share value.

Here is a more straightforward presentation of the data:


I'm not suggesting using this display. The sector areas in the original chart depict the data in the red bars. It's not clear to me how the story is affected by the inclusion of the market value data (gray bars).

Excess delay

The hot topic in New York at the moment is congestion pricing for vehicles entering Manhattan, which is set to debut during the month of June. I found this chart (link) that purports to prove the effectiveness of London's similar scheme introduced a while back.


This is a case of the visual fighting against the data. The visual feels very busy and yet the story lying beneath the data isn't that complex.

This chart was probably designed to accompany some text which isn't available free from that link so I haven't seen it. The reader's expectation is to compare the periods before and after the introduction of congestion charges. But even the task of figuring out the pre- and post-period is taking more time than necessary. In particular, "WEZ" is not defined. (I looked this up, it's "Western Extension Zone" so presumably they expanded the area in which charges were applied when the travel rates went back to pre-charging levels.)

The one element of the graphic that raises eyebrows is the legend which screams to be read.


Why are there four colors for two items? The legend is not self-sufficient. The reader has to look at the chart itself and realize that purple is the pre-charging period while green (and blue) is the post-charging period (ignoring the distinction between CCZ and WEZ).

While we are solving this puzzle, we also notice that the bottom two colors are used to represent an unchanging quantity - which is the definition of "no congestion". This no-congestion travel rate is a constant throughout the chart and yet a lot of ink of two colors have been spilled on it. The real story is in the excess delay, which the congestion charging scheme was supposed to reduce.

The excess on the chart isn't harmless. The excess delay on the roads has been transferred to the chart reader. It actually distracts from the story the analyst is wanting to tell. Presumably, the story is that the excess delays dropped quite a bit after congestion charging was introduced. About four years later, the travel rates had creeped back to pre-charging levels, whereupon the authorities responded by extending the charging zone to WEZ (which as of the time of the chart, wasn't apparently bringing the travel rate down.)

Instead of that story, the excess of the chart makes me wonder... the roads are still highly congested with travel rates far above the level required to achieve no congestion, even after the charging scheme was introduced.


I started removing some of the excess from the chart. Here's the first cut:


This is better but it is still very busy. One problem is the choice of columns, even though the data are found strictly on the top of each column. (Besides, when I chop off the unchanging sections of the columns, I created a start-not-from-zero problem.) Also, the labeling of the months leaves much to be desired, there are too many grid lines, etc.


Here is the version I landed on. Instead of columns, I use lines. When lines are used, there is no need for month labels since we can assume a reader knows the structure of months within a year.


A priniciple I hold dear is not to have legends unless it is absolutely required. In this case, there is no need to have a legend. I also brought back the notion of a uncongested travel speed, with a single line (and annotation).


The chart raises several questions about the underlying analysis. I'd interested in learning more about "moving car observer surveys". What are those? Are they reliable?

Further, for evidence of efficacy, I think the pre-charging period must be expanded to multiple years. Was 2002 a particularly bad year?

Thirdly, assuming WEZ indicates the expansion of the program to a new geographical area, I'm not sure whether the data prior to its introduction represents the travel rate that includes the WEZ (despite no charging) or excludes it. Arguments can be made for each case so the key from a dataviz perspective is to clarify what was actually done.


P.S. [6-6-24] On the day I posted this, NY State Governer decided to cancel the congestion pricing scheme that was set to start at the end of June.

Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.


Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.


What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.

One doesn't have to plot raw data

Visual Capitalist chose a treemap to show us where gold is produced (link):


The treemap is embedded into a brick of gold. Any treemap is difficult to read, mostly because some block are vertical, others horizontal. A rough understanding is nevertheless possible: the entire global production can be roughly divided into four parts: China plus three other Asian producers account for roughly (not quite) a quarter; "rest of the world" (i.e. all countries not individually listed) is a quarter; Russia and Australia together is again a bit less than a quarter.


When I look at datasets that rank countries by some metric, I'm hoping to present insights, rather than the raw data. Insights typically involve comparing countries, or sets of countries, or one country against a set of countries. So, I made the following chart that includes some of these insights I found in the gold production dataset:


For example, the top 4 producers in Asia account for almost a quarter of the world's output; Canada, U.S. and Australia together also roughly produce a quarter; the rest of the world has a similar output. In Asia, China's output is about the sum of the next 3 producers, which is about the same as U.S. and Canada, which is about the same as the top 5 in Africa.


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".


The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.


_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:


I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.


In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:


The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.