Hammock plots

Prof. Matthias Schonlau gave a presentation about "hammock plots" in New York recently.

Here is an example of a hammock plot that shows the progression of different rounds of voting during the 1903 papal conclave. (These are taken at the event and thus a little askew.)

Hammockplot_conclave

The chart shows how Cardinal Sarto beat the early favorite Rampolla during later rounds of voting. The chart traces the movement of votes from one round to the next. The Vatican destroys voting records, and apparently, records were unexpectedly retained for this particular conclave.

The dataset has several features that brings out the strengths of such a plot.

There is a fixed number of votes, and a fixed number of candidates. At each stage, the votes are distributed across the subset of candidates. From stage to stage, the support levels for candidate shift. The chart brings out the evolution of the vote.

From the "marginals", i.e. the stacked columns shown at each time point, we learn the relative strengths of the candidates, as they evolve from vote to vote.

The links between the column blocks display the evolution of support from one vote to the next. We can see which candidate received more votes, as well as where the additional votes came from (or, to whom some voters have drifted).

The data are neatly arranged in successive stages, resulting in discrete time steps.

Because the total number of votes are fixed, the relative sizes of the marginals are nicely constrained.

The chart is made much more readable because of binning. Only the top three candidates are shown individually with all the others combined into a single category. This chart would have been quite a mess if it showed, say, 10 candidates.

How precisely we can show the intra-stage movement depends on how the data records were kept. If we have the votes for each person in each round, then it should be simple to execute the above! If we only have the marginals (the vote distribution by candidate) at each round, then we are forced to make some assumptions about which voters switched their votes. We'd likely have to rule out unlikely scenarios, such as that in which all of the previous voters for candidate X switched to someone other candidates while another set of voters switched their votes to candidate X.

***

Matthias also showed examples of hammock plots applied to different types of datasets.

The following chart displays data from course evaluations. Unlike the conclave example, the variables tied to questions on the survey are neither ordered nor sequential. Therefore, there is no natural sorting available for the vertical axes.

Hammockplot_evals

Time is a highly useful organizing element for this type of charts. Without such an organizing element, the designer manually customizes an order.

The vertical axes correspond to specific questions on the course evaluation. Students are aggregated into groups based on the "profile" of grades given for the whole set of questions. It's quite easy to see that opinions are most aligned on the "workload" question while most of the scores are skewed high.

Missing values are handled by plotting them as a new category at the bottom of each vertical axis.

This example is similar to the conclave example in that each survey response is categorical, one of five values (plus missing). Matthias also showed examples of hammock plots in which some or all of the variables are numeric data.

***

Some of you will see some resemblance of the hammock plot with various similar charts, such as the profile chart, the alluvial chart, the parallel coordinates chart, and Sankey diagrams. Matthias discussed all those as well.

Matthias has a book out called "Applied Statistical Learning" (link).

Also, there is a Python package for the hammock plot on github.


On the interpretability of log-scaled charts

A previous post featured the following chart showing stock returns over time:

Gelman_overnightreturns_tsla

Unbeknownst to readers,  the chart plots one thing but labels it something else.

The designer of the chart explains how to read the chart in a separate note, which I included in my previous post (link). It's a crucial piece of information. Before reading his explanation, I didn't realize the sleight of hand: he made a chart with one time series, then substituted the y-axis labels with another set of values.

As I explored this design choice further, I realize that it has been widely adopted in a common chart form, without fanfare. I'll get to it in due course.

***

Let's start our journey with as simple a chart as possible. Here is a line chart showing constant growth in the revenues of a small business:

Junkcharts_dollarchart_origvalues

For all the charts in this post, the horizontal axis depicts time (x = 0, 1, 2, ...). To simplify further, I describe discrete time steps although nothing changes if time is treated as continuous.

The vertical scale is in dollars, the original units. It's conventional to modify the scale to units of thousands of dollars, like this:

Junkcharts_dollarchart_thousands

No controversy arises if we treat these two charts as identical. Here I put them onto the same plot, using dual axes, emphasizing the one-to-one correspondence between the two scales.

Junkcharts_dollarchart_dualaxes

We can do the same thing for two time series that are linearly related. The following chart shows constant growth in temperature using both Celsius and Fahrenheit scales:

Junkcharts_tempchart_dualaxes

Here is the chart displaying only the Fahrenheit axis:

Junkcharts_tempchart_fahrenheit

This chart admits two interpretations: (A) it is a chart constructed using F values directly and (B) it is a chart created using C values, after which the axis labels were replaced by F values. Interpretation B implements the sleight of hand of the log-returns plot. The issue I'm wrestling with in this post is the utility of interpretation B.

Before we move to our next stop, let's stipulate that if we are exposed to that Fahrenheit-scaled chart, either interpretation can apply; readers can't tell them apart.

***

Next, we look at the following line chart:

Junkcharts_trendchart_y

Notice the vertical axis uses a log10 scale. We know it's a log scale because the equally-spaced tickmarks represent different jumps in value: the first jump is from 1 to 10, the next jump is from 10, not to 20, but to 100.

Just like before, I make a dual-axes version of the chart, putting the log Y values on the left axis, and the original Y values on the right axis.

Junkcharts_trendchart_dualaxes
By convention, we often print the original values as the axis labels of a log chart. Can you recognize that sleight of hand? We make the chart using the log values, after which we replace the log value labels with the original value labels. We adopt this graphical trick because humans don't think in log units, thus, the log value labels are less "interpretable".

As with the temperature chart, we will attempt to interpret the chart two ways. I've already covered interpretation B. For interpretation A, we regard the line chart as a straightforward plot of the values shown on the right axis (i.e., the original values). Alas, this viewpoint fails for the log chart.

If the original data are plotted directly, the chart should look like this:

Junkcharts_trendchart_y_origvalues

It's not a straight line but a curve.

What have I just shown? That, after using the sleight of hand, we cannot interpret the chart as if it were directly plotting the data expressed in the original scale.

To nail down this idea, we ask a basic question of any chart showing trendlines. What's the rate of change of Y?

Using the transformed log scale (left axis), we find that the rate of change is 1 unit per unit time. Using the original scale, the rate of change from t=1 to t=2 is (100-10)/1 = 90 units per unit time; from t=2 to t=3, it is (1000-100)/1 = 900 units per unit time. Even though the rate of change varies by time step, the log chart using original value labels sends the misleading picture that the rate of change is constant over time (thus a straight line). The decision to substitute the log value labels backfires!

This is one reason why I use log charts sparingly. (I do like them a lot for exploratory analyses, but I avoid using them as presentation graphics.) This issue of interpretation is why I dislike the sleight of hand used to produce those log stock returns charts, even if the designer offers a note of explanation.

Do we gain or lose "interpretability" when we substitute those axis labels?

***

Let's re-examine the dual-axes temperature chart, building on what we just learned.

Junkcharts_tempchart_dualaxes

The above chart suggests that whichever scale (axis) is chosen, we get the same line, with the same steepness. Thus, the rate of change is the same regardless of scale. This turns out to be an illusion.

Using the left axis, the slope of the line is 10 degrees Celsius per unit time. Using the right axis, the slope is 18 degrees Fahrenheit per unit time. 18 F is different from 10 C, thus, the slopes are not really the same! The rate of change of the temperature is given algebraically by the slope, and visually by the steepness of the line. Since two different slopes result in the same line steepness, the visualization conveys a lie.

This situation here is a bit better than that in the log chart. Here, in either scale, the rate of change is constant over time. Differentiating the temperature conversion formula, we find that the slope of the Fahrenheit line is always 9/5*the slope of the Celsius line. So a rate of 10 Celsius per unit time corresponds to 18 Fahrenheit per unit time.

What if the chart is presented with only the Fahrenheit axis labels although it is built using Celsius data? Since readers only see the F labels, the observed slope is in Fahrenheit units. Meanwhile, the chart creator uses Celsius units. This discrepancy is harmless for the temperature chart but it is egregious for the log chart. The underlying reason is the nonlinearity of the log transform - the slope of log Y vs time is not proportional to the slope of Y vs time; in fact, it depends on the value of Y.  

***

The log chart is a sacred cow of scientists, a symbol of our sophistication. Are they as potent as we'd think? In particular, when we put original data values on the log chart, are we making it more intepretable, or less?

 

P.S. I want to tie this discussion back to my Trifecta Checkup framework. The design decision to substitute those axis labels is an example of an act that moves the visual (V) away from the data (D). If the log units were printed, the visual makes sense; when the original units were dropped in, the visual no longer conveys features of the data - the reader must ignore what the eyes are seeing, and focus instead on the brain's perspective.


Logging a sleight of hand

Andrew puts up an interesting chart submitted by one of his readers (link):

Gelman_overnightreturns_tsla

Bruce Knuteson who created this chart is pursuing a theory that there is some fishy going on in the stock markets over night (i.e. between the close of one day and the open of the next day). He split the price data into two interleaving parts: the blue line represents returns overnight and the green line represents returns intraday (from open of one day to the close of the same day). In this example related to Tesla's stock, the overnight "return" is an eyepopping 36850% while the intraday "return" is -46%.

This is an example of an average masking interesting details in the data. One typically looks at the entire sequence of values at once, while this analysis breaks it up into two subsequences. I'll write more about the data analysis at a later point. This post will be purely about the visualization.

***

It turns out that while the chart looks like a standard time series, it isn't. Bruce wrote out the following essential explanation:

Gelman_overnightreturns

The chart can't be interpreted without first reading this note.

The left chart (a) is the standard time-series chart we're thinking about. It plots the relative cumulative percentage change in the value of the investment over time. Imagine one buys $1 of Apple stock on day 1. It shows the cumulative return on day X, expressed as a percent relative to the initial investment amount. As mentioned above, the data series was split into two: the intraday return series (green) is dwarfed by the overnight return series (blue), and is barely visiable hugging the horizontal axis.

Almost without thinking, a graphics designer applies a log transform to the vertical axis. This has the effect of "taming" the extreme values in the blue line. This is the key design change in the middle chart (b). The other change is to switch back to absolute values. The day 1 number is now $1 so the day X number shows the cumulative value of the investment on day X if one started with $1 on day 1.

There's a reason why I emphasized the log transform over the switch to absolute values. That's because the relationship between absolute and relative values here is a linear one. If y(t) is the absolute cumulative value of $1 at time t, then the percent change r(t) = 100(y(t) -1). (Note that y(0) = 1 by definition.)  The shape of the middle chart is primarily conditioned by the log transform.

In the right chart (c), which is the design that Bruce features in all his work, the visual elements of chart (b) are retained while he replaced the vertical axis labels with those from chart (a). In other words, the lines show the cumulative absolute values while the labels show the relative cumulative percent returns.

I left this note on Gelman's blog (corrected a mislabeling of the chart indices):

I'm interested in the the sleight of hand related to the plots, also tying this back to the recent post about log scales. In plot (b) (a) [middle of the panel], he transformed the data to show the cumulative value of the investment assuming one puts $1 in the stock on day 1. He applied a log scale on the vertical axis. This is fine. Then in plot (c) (b), he retained the chart but changed the vertical axis labels so instead of absolute value of the investment, he shows percent changes relative to the initial value.

Why didn't he just plot the relative percent changes? Let y(t) be the absolute values and r(t) = the percent change = 100*(y(t) -1) is a simple linear transformation of y(t). This is where the log transform creates problems! The y(t) series is guaranteed to be positive since hitting y(t) = 0 means the entire investment is lost. However, the r(t) series can hit negative values and also cross over zero many times over time. Thus, log r(t) is inoperable. The problem is using the log transform for data that are not always positive, and the sleight of hand does not fix it!

Just pick any day in which the absolute return fell below $1, e.g. the last day of the plot in which the absolute value of the investment was down to $0.80. In the middle plot (b), the value depicted is ln(0.8) = -0.22. Note that the plot is in log scale, so what is labeled as $1 is really ln(1) = 0. If we instead try to plot the relative percent changes, then the day 1 number should be ln(0) which is undefined while the last number should be ln(-20%) which is also undefined.

This is another example of something umcomfortable about using log scales which I pointed out in this post. It's this idea that when we do log plots, we can freely substitute axis labels which are not directly proportional to the actual labels. It's plotting one thing, and labelling it something else. These labels are then disconnected from the visual encoding. It's against the goal of visualizing data.

 


Making major things easy, and minor things hard

A recent issue of Significance magazine carried the following stacked column chart showing how the driver license status of men and women change as they age. The data came from the U.K.

Siginificance_olddrivers_1

Quick question - what percentage of British men in their sixties hold full driver licenses?

***

I was just kidding. Those questions can't be quickly answered on a stacked column chart. That's because you have to find the axis, and then mentally invert the axis.

On that chart, larger values are shown pointing down (green) and also pointing up (blue), and ... well, I don't have words for the yellow. In fact, the yellow segments, showing people without licenses, are possibly the most important category for this report.

In making decisions about visualizing data, it's important to separate out the major things from the minor things.

***

Here is a reimagination of the chart using connected dots:

Junkcharts_redo_significanceolderdrivers

What is hard to do using this chart is to verify that the three proportions add to 100%. What is easy is to read off the proportion for any gender, age and license status subgroup.

It's really quite intricate how these researchers binned the age data. There are bins of size 1, 4, 5 and 10, plus the top group is 85 and above. The way I handled these is to turn everything to 1-year bins. I assume that in the wider bins, we don't have precise data for each age, and the bin value is the average among the bin, thus it is as if someone had drawn a horizontal line across the bin width. (I left the top bin alone as I don't know what is the maximum age of a person in this study.)

***

Those of you who have laminated the flowchart of data visualization are probably irate. According to such a flowchart, one must use a column chart because the x variable (age band) has irregularly-sized discrete values, and one must use a stacked column chart because the y variable is a percentage, grouped by a third variable (license status).

Don't be mad, just ditch the flowchart.

 


Deliberately obstructing chart elements as a plot point

Bbc_globalwarming_ridgeplot smThese "ridge plots" have become quite popular in recent times. The following example, from this BBC report (link), shows the change in global air temperatures over time.

***

This chart is in reality a panel of probability density plots, one for each year of the dataset. The years are arranged with the oldest at the top and the most recent at the bottom. You take those plots and squeeze every ounce of the space out, so that each chart overlaps massively with the ones above it.

The plot at the bottom is the only one that can be seen unobstructed.

Overplotting chart elements, deliberately obstructing them, doesn't sound useful. Is there something gained for what's lost?

***

The appeal of the ridge plot is the metaphor of ridges, or crests if you see ocean waves. What do these features signify?

The legend at the bottom of the chart gives a hint.

The main metric used to describe global warming is the amount of excess temperature, defined as the temperature relative to a historical average, set as the average temperature during the pre-industrial age. In recent years, the average global temperature is about 1.5 degrees Celsius above the reference level.

One might think that the higher the peak in a given plot, the higher the excess temperature. Not so. The heights of those peaks do not indicate temperatures.

What's the scale of the vertical axis? The labels suggest years, but that's a distractor also. If we consider the panel of non-overlapping probability density charts, the vertical axis should show probability density. In such a panel, the year labels should go to the titles of individual plots. On the ridge plot, the density axes are sacrificed, while the year labels are shifted to the vertical axis.

Admittedly, probability density is not an intuitive concept, so not much is lost by its omission.

The legend appears to suggest that the vertical scale is expressed in number of days so that in any given year, the peak of the curve occurs where the most likely excess temperature is found. But the amount of excess is read from the horizontal axis, not the vertical axis - it is encoded as a displacement in location horizontally away from the historical average. In other words, the height of the peak still doesn't correlate with the magnitude of the excess temperature.

The following set of probability density curves (with made-up data) each has the same average excess temperature of 1.5 degrees. Going from top to bottom, the variability of the excess temperatures increases. The height of the peak decreases accordingly because in a density plot, we require the total area under the curve to be fixed. Thus, the higher the peak, the lower the daily variability of the excess temperature.

Kfung_pdf_variances

A problem with this ridge plot is that it draws our attention to the heights of the peaks, which provide information about a secondary metric.

If we want to find the story that the amount of excess temperature has been increasing over time, we would have to trace a curve through the ridges, which strangely enough is a line that moves top to bottom, initially somewhat vertically, then moving sideways to the right. In a more conventional chart, the line that shows growth over time moves from bottom left to top right.

***

The BBC article (link) features several charts. The first one shows how the average excess temperature trends year to year. This is a simple column chart. By supplementing the column chart with the ridge plot, I assume that the designer wants to tell readers that the average annual excess temperature masks daily variability. Therefore, each annual average has been disaggregated into 366 daily averages.

In the column chart, the annual average is compared to the historical average of 50 years. In the ridge plot, the daily average is compared to ... the same historical average of 50 years. That's what the reference line labeled pre-industrial average is saying to me.

It makes more sense to compare the 366 daily averages to 366 daily averages from those 50 years.

But now I've ruined the dataviz because in each probability density plot, there are 366 different reference points. But not really. We just have to think a little more abstractly. These 366 different temperatures are all mapped to the number zero, after adjustment. Thus, they all coincide at the same location on the horizontal axis.

(It's possible that they actually used 366 daily averages as references to construct the ridge plot. I'm guessing not but feel free to comment if you know how these values are computed.)


Don't show everything

There are many examples where one should not show everything when visualizing data.

A long-time reader sent me this chart from the Economist, published around Thanksgiving last year:

Economist_musk

It's a scatter plot with each dot representing a single tweet by Elon Musk against a grid of years (on the horizontal axis) and time of day (on the vertical axis).

The easy messages to pick up include:

  • the increase in frequency of tweets over the years
  • especially, the jump in density after Musk bought Twitter in late 2022 (there is also a less obvious level up around 2018)
  • the almost continuous tweeting throughout 24 hours.

By contrast, it's hard if not impossible to learn the following:

  • how many tweets did he make on average or in total per year, per day, per hour?
  • the density of tweets for any single period of time (i.e., a reference for everything else)
  • the growth rate over time, especially the magnitude of the jumps

The paradox: a chart that is data-dense but information-poor.

***

The designer added gridlines and axis labels to help structure our reading. Specifically, we're cued to separate the 24 hours into four 6-hour chunks. We're also expected to divide the years into two groups (pre- and post- the Musk acquisition), and secondarily, into one-year intervals.

If we accept this analytical frame, then we can divide time into these boxes, and then compute summary statistics within each box, and present those values.  I'm working on some concepts, will show them next time.

 


Simple presentations

In the previous post, I looked at this chart that shows the distributions of four subgroups found in a dataset:

Davidcurran_originenglishwords

This chart takes quite some effort to decipher, as does another version I featured.

The key messages appear to be: (i) most English words are of Germanic origin, (ii) the most popular English words are even more skewed towards Germanic origin, (iii) words of French origin started showing up around rank 50, those of Latin origin around rank 250.

***

If we are making a graphic for presentation, we can simplify the visual clutter tremendously by - hmmm - a set of pie charts.

Junkcharts_redo_originenglishwords_pies

For those allergic to pies, here's a stacked column chart:

Junkcharts_redo_originenglishwords_columns

Both of these can be thought of as "samples" from the original chart, selected to highlight shifts in the relative proportions.

Davidcurran_originenglishwords_sampled

I also reversed the direction of the horizontal axis as I think the story is better told starting from the whole dataset and honing in on subsets.

 

P.S. [1/10/2025] A reader who has expertise in this subject also suggested a stacked column chart with reversed axis in a comment, so my recommendation here is confirmed.


Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


Dot plots with varying dot sizes

In a prior post, I appreciated the effort by the Bloomberg Graphics team to describe the diverging fortunes of Japanese and Chinese car manufacturers in various Asian markets.

The most complex chart used in that feature is the following variant of a dot plot:

Bloomberg_japancars_chinamarket

This chart plots the competitors in the Chinese domestic car market. Each bubble represents a car brand. Using the styling of the entire article, the red color is associated with Japanese brands while the medium gray color indicates Chinese brands. The light gray color shows brands from the rest of the world. (In my view, adding the pink for U.S. and blue for German brands - seen on the first chart in this series - isn't too much.)

The dot size represents the current relative market share of the brand. The main concern of the Bloomberg article is the change in market share in the period 2019-2024. This is placed on the horizontal axis, so the bubbles on the right side represent growing brands while the bubbles on the left, weakening brands.

All the Japanese brands are stagnating or declining, from the perspective of market share.

The biggest loser appears to be Volkswagen although it evidently started off at a high level since its bubble size after shrinkage is still among the largest.

***

This chart form is a composite. There are at least two ways to describe it. I prefer to see it as a dot plot with an added dimension of dot size. A dot plot typically plots a single dimension on a single axis, and here, a second dimension is encoded in the sizes of the dots.

An alternative interpretation is that it is a scatter plot with a third dimension in the dot size. Here, the vertical dimension is meaningless, as the dots are arbitrarily spread out to prevent overplotting. This arrangement is also called the bubble plot if we adopt a convention that a bubble is a dot of variable size. In a typical bubble plot, both vertical and horizontal axes carry meaning but here, the vertical axis is arbitrary.

The bubble plot draws attention to the variable in the bubble size, the scatter plot emphasizes two variables encoded in the grid while the dot plot highlights a single metric. Each shows secondary metrics.

***

Another revelation of the graph is the fragmentation of the market. There are many dots, especially medium gray dots. There are quite a few Chinese local manufacturers, most of which experienced moderate growth. Most of these brands are startups - this can be inferred because the size of the dot is about the same as the change in market share.

The only foreign manufacturer to make material gains in the Chinese market is Tesla.

The real story of the chart is BYD. I almost missed its dot on first impression, as it sits on the far right edge of the chart (in the original webpage, the right edge of the chart is aligned with the right edge of the text). BYD is the fastest growing brand in China, and its top brand. The pedestrian gray color chosen for Chinese brands probably didn't help. Besides, I had a little trouble figuring out if the BYD bubble is larger than the largest bubble in the size legend shown on the opposite end of BYD. (I measured, and indeed the BYD bubble is slightly larger.)

This dot chart (with variable dot sizes) is nice for highlighting individual brands. But it doesn't show aggregates. One of the callouts on the chart reads: "Chinese cars' share rose by 23%, with BYD at the forefront". These words are necessary because it's impossible to figure out that the total share gain by all Chinese brands is 23% from this chart form.

They present this information in the line chart that I included in the last post, repeated here:

Bloomberg_japancars_marketshares

The first chart shows that cumulatively, Chinese brands have increased their share of the Chinese market by 23 percent while Japanese brands have ceded about 9 percent of market share.

The individual-brand view offers other insights that can't be found in the aggregate line chart. We can see that in addition to BYD, there are a few local brands that have similar market shares as Tesla.

***

It's tough to find a single chart that brings out insights at several levels of analysis, which is why we like to talk about a "visual story" which typically comprises a sequence of charts.

 


Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal