Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal

 


Tidying up the details

This column chart caught my attention because of the color labels.

Thall_financials2023_pandl

Well, it also concerns me that the chart takes longer to take in than you'd think.

***

The color labels say "FY2123", "FY2022", and "FY1921". It's possible but unlikely that the author is making comparisons across centuries. The year 2123 hasn't yet passed, so such an interpretation would map the three categories to long-ago past, present and far-into-the-future.

Perhaps hyphens were inadvertently left off so "FY2123" means "FY2021 - FY2023". It's odd to report financial metrics in multi-year aggregations. I rule this out because the three categories would then also overlap.

Here's what I think the mistake is: somehow the prefix is rolled forward when it is applied to the years. "FY23", "FY22", "FY21" got turned into "FY[21]23", "FY[20]22", "FY[19]21" instead of putting 20 in all three slots.

The chart appeared in an annual financial report, and the comparisons were mostly about the reporting year versus the year before so I'm pretty confident the last two digits are accurately represented.

Please let me know if you have another key to this puzzle.

In the following, I'm going to assume that the three colors represent the most recent three fiscal years.

***

A few details conspire to blow up our perception time.

There was no extra spacing between groups of columns.

The columns are arranged in reverse time order, with the most recent year shown on the left. (This confuses those of us that use the left-to-right convention.)

The colors are not ordered. If asked to sort the three colors, you will probably suggest what is described as "intuitive" below:

Junkcharts_color_order

The intuitive order aligns with the amount of black added to a base color (hue). But this isn't the order assigned to the three years on the original chart.

***

Some of the other details on the chart are well done. For example, I like the handling of the gridlines and the axes.

The following revision tidies up some of the details mentioned above, without changing the key features of the chart:

Junkcharts_redo_trinhallfinancials

 


Using disaggregation in dataviz

This chart appears in a journal article on the use of AI (artificial intelligence) in healthcare (link).

Ai_healthcare_barchart

It's a stacked bar chart in which each bar is subdivided into four segments. The authors are interested in the relative frequency of research using AI by disease type. The chart only shows the top 10 disease types.

What is unusual is that the subdivisions are years. So these authors revealed four years of journal articles, and while the overall ranking of the disease types is by the aggregated four-year total counts, each total count has been disaggregated by color so readers can also see the annual counts.

***

A slight rearrangement yields the following:

Junkcharts_redo_aihealthcare

Most readers will only care about the left chart showing the total counts. More invested readers may consider the colored charts that show annual totals. These are arranged so that the annual counts are easily read and compared.

***

One annoying aspect of this type of presentation is that in almost all cases, the top 10 types in aggregate will not be the top 10 types by individual year. In some of those years, I expect that the 10 types shown do not include all of the top 10 types for a particular year.


The most dangerous day

Our World in Data published this interesting chart about infant mortality in the U.S.

Mostdangerousday

The article that sent me to this chart called the first day of life the "most dangerous day". This dot plot seems to support the notion, as the "per-day" death rate is the highest on the day of birth, and then drops quite fast (note log scale) over the the first year of life.

***

Based on the same dataset, I created the following different view of the data, using the same dot plot form:

Junkcharts_redo_ourworldindata_infantmortality

By this measure, a baby has 99.63% chance of surviving the first 30 days while the survival rate drops to 99.5% by day 180.

There is an important distinction between these two metrics.

The "per day" death rate is the chance of dying on a given day, conditional on having survived up to that day. The chance of dying on day 2 is lower partly because some of the more vulnerable ones have died on day 1 or day 0,  etc.

The survival rate metric is cumulative: it measures how many babies are still alive given they were born on day 0. The survival rate can never go up, so long as we can't bring back the dead.

***

If we are assessing a 5-day-old baby's chance of surviving day 6, the "per-day" death rate is relevant since that baby has not died in the first 5 days.

If the baby has just been born, and we want to know the chance it might die in the first five days (or survive beyond day 5), then the cumulative survival rate curve is the answer. If we use the per-day death rate, we can't add the first five "per-day" death rates It's a more complicated calculation of dying on day 0, then having not died on day 0, dying on day 1, then having not died on day 0 or day 1, dying on day 2, etc.