« December 2024 | Main | February 2025 »

Don't show everything

There are many examples where one should not show everything when visualizing data.

A long-time reader sent me this chart from the Economist, published around Thanksgiving last year:

Economist_musk

It's a scatter plot with each dot representing a single tweet by Elon Musk against a grid of years (on the horizontal axis) and time of day (on the vertical axis).

The easy messages to pick up include:

  • the increase in frequency of tweets over the years
  • especially, the jump in density after Musk bought Twitter in late 2022 (there is also a less obvious level up around 2018)
  • the almost continuous tweeting throughout 24 hours.

By contrast, it's hard if not impossible to learn the following:

  • how many tweets did he make on average or in total per year, per day, per hour?
  • the density of tweets for any single period of time (i.e., a reference for everything else)
  • the growth rate over time, especially the magnitude of the jumps

The paradox: a chart that is data-dense but information-poor.

***

The designer added gridlines and axis labels to help structure our reading. Specifically, we're cued to separate the 24 hours into four 6-hour chunks. We're also expected to divide the years into two groups (pre- and post- the Musk acquisition), and secondarily, into one-year intervals.

If we accept this analytical frame, then we can divide time into these boxes, and then compute summary statistics within each box, and present those values.  I'm working on some concepts, will show them next time.

 


Ranks, labels, metrics, data and alignment

A long-time reader Chris V. (since 2012!) sent me to this WSJ article on airline ratings (link).

The key chart form is this:

Wsj_airlines_overallranks

It's a rhombus shaped chart, really a bar chart rotated counter-clockwise by 45 degrees. Thus, all the text is at 45 degree angles. An airplane icon is imprinted on each bar.

There is also this cute interpretation of the white (non-data-ink) space as a symmetric reflection of the bars (with one missing element). On second thought, the decision to tilt the chart was probably made in service of this quasi-symmetry. If the data bars were horizontal, then the white space would have been sliced up into columns, which just doesn't hold the same appeal.

If we be Tuftian, all of these flourishes do not serve the data. But do they do much harm? This is a case that's harder to decide. The data consist of just a ranking of airlines. The message still comes across. The head must tilt, but the chart beguiles.

***

As the article progresses, the same chart form shows up again and again, with added layers of detail. I appreciate how the author has constructed the story. Subtly, the first chart teaches the readers how the graphic encodes the data, and fills in contextual information such as there being nine airlines in the ranking table.

In the second section, the same chart form is used, while the usage has evolved. There are now a pair of these rhombuses. Each rhombus shows the rankings of a single airline while each bar inside the rhombus shows the airline's ranking on a specific metric. Contrast this with the first chart, where each bar is an airline, and the ranking is the overall ranking on all metrics.

Wsj_airlines_deltasouthwestranks

You may notice that you've used a piece of knowledge picked up from the first chart - that on each of these metrics, each airline has been ranked against eight others. Without that knowledge, we don't know that being 4th is just better than the median. So, in a sense, this second section is dependent on the first chart.

There is a nice use of layering, which links up both charts. A dividing line is drawn between the first place (blue) and not being first (gray). This layering allows us to quickly see that Delta, the overall winner, came first in two of the seven metrics while Southwest, the second-place airline, came first in three of the seven (leaving two metrics for which neither of these airlines came first).

I'd be the first to admit that I have motion sickness. I wonder how many of you are starting to feel dizzy while you read the labels, heads tilted. Maybe you're trying, like me, to figure out the asterisks and daggers.

***

Ironically, but not surprisingly, the asterisks reveal a non-trivial matter. Asterisks direct readers to footnotes, which should be supplementary text that adds color to the main text without altering its core meaning. Nowadays, asterisks may hide information that changes how one interprets the main text, such as complications that muddy the main argument.

Here, the asterisks address a shortcoming of representing ranking using bars. By convention, lower ranking indicates better, and most ranking schemes start counting from 1. If ranks are directly encoded in bars, then the best object is given the shortest bar. But that's not what we see on the chart. The bars actually encode the reverse ranking so the longest bar represents the lowest ranking.

That's level one of this complication. Level two is where these asterisks are at.

Notice that the second metric is called "Canceled flights". The asterisk quipped "fewest". The data collected is on the number of canceled flights but the performance metric for the ranking is really "fewest canceled flights". 

If we see a long bar labelled "1st" under "canceled flights", it causes a moment of pause. Is the airline ranked first because it had the most canceled flights? That would imply being first is worst for this category. It couldn't be that. So perhaps "1st" means having the fewest canceled flights but then it's just weird to show that using the longest bar. The designer correctly anticipates this moment of pause, and that's why the chart has those asterisks.

Unfortunately, six out of the seven metrics require asterisks. In almost every case, we have to think in reverse. "Extreme delays" really mean "Least extreme delays"; "Mishandled baggage" really mean "Less mishandled baggage"; etc. I'd spend some time renaming the metrics to try to fix this avoiding footnotes. For example, saying "Baggage handling" instead of "mishandled baggage" is sufficient.

***

The third section contains the greatest details. Now, each chart prints the ranking of nine airlines for a particular metric.

Wsj_airlinerankings_bymetric

 

By now, the cuteness faded while the neck muscles paid. Those nice annotations, written horizontally, offered but a twee respite.

 

 

 

 

 


Patiently looking

Voronoi (aka Visual Economist) made this map about service times at emergency rooms around the U.S.

 

Voronoi_EmergencyRoomWaitTImes

This map shows why one shouldn’t just stick state-level data into a state-level map by default.

The data are median service times, defined as the duration of the visit from the moment a patients arrive to the moment they leave. For reasons to be explained below, I don’t like this metric. The data are in terms of hours and minutes, and encoded in the color scale.

As with any choropleth, the dominant features of this map are the shapes and sizes of various pieces but these don’t carry any data. The eastern seaboard contains many states that are small in area but dense in population, and always produces a messy, crowded smorgasbord of labels and guiding lines.

The color scale is progressive (continuous) making it even harder to gain an appreciation of the spatial pattern. For the sake of argument, imagine a truly continuous color scale tuned to the median service times in number of minutes. There would be as many shades as there are unique number of values on the map. For example, the state with 2 hr 12 min median time would receive a different shade than the one with 2 hr 11 min. Looking at the dataset, I found 43 unique values of median service time in the 52 states and territories. Thus, almost every state would wear its unique shade, making it hard to answer such common questions as: which cluster of states have high/medium/low median service times?

(As the underlying software may only be capable of printing a finite number of shades so in reality, there aren’t any true continuous scales. A continuous scale is just a discrete scale with many levels of shades. For this map, I’d group the states into at most five categories, requiring five shades.)

***

We’re now reaching the D corner of the Trifecta Checkup (link). _trifectacheckup_image

I’d transform the data to relative values, such as an index against the median or average in the nation. The colors now indicate how much higher or lower is the state’s median service time than that of the nation. With this transformed data, it makes more sense to use a bidirectional color scale so that there are different colors for higher vs lower than average.

Lastly, I’m not sure about the use of median service time, as opposed to average (mean) service time. I suspect that the distribution is heavily skewed toward longer values so that the median service time falls below the mean service time. If, however, the service time distribution is roughly symmetric around the median, then the mean and median service times will be very similar, and thus the metric selection doesn’t matter.

Imagine you're the healthcare provider and your bonus is based on managing median service times. You have an incentive to let a small number of patients wait an extraordinary amount of time, while serving a bunch of patients who require relatively simple procedures. If it's a mean service time, the values of the extreme outliers will be spread over all the patients while the median service time is affected by the number of such outliers but not their magnitudes.

When I pulled down the publicly available data (link), I found additional data fields. The emergency room visits are further broken into four categories (low, medium, high, very high), and a median is reported within each category. Thus, we have a little idea how extreme the top values can be.

The following dotplot shows this:

Junkcharts_redo_voronoi_emergencyrooms

A chart like this is still challenging to read since there are 52 territories, ordered by the value on a metric. If the analyst can say what are interesting questions, e.g. breaking up the territories into regions, then a grouping can be applied to the above chart to aid comprehension.

 


Simple presentations

In the previous post, I looked at this chart that shows the distributions of four subgroups found in a dataset:

Davidcurran_originenglishwords

This chart takes quite some effort to decipher, as does another version I featured.

The key messages appear to be: (i) most English words are of Germanic origin, (ii) the most popular English words are even more skewed towards Germanic origin, (iii) words of French origin started showing up around rank 50, those of Latin origin around rank 250.

***

If we are making a graphic for presentation, we can simplify the visual clutter tremendously by - hmmm - a set of pie charts.

Junkcharts_redo_originenglishwords_pies

For those allergic to pies, here's a stacked column chart:

Junkcharts_redo_originenglishwords_columns

Both of these can be thought of as "samples" from the original chart, selected to highlight shifts in the relative proportions.

Davidcurran_originenglishwords_sampled

I also reversed the direction of the horizontal axis as I think the story is better told starting from the whole dataset and honing in on subsets.

 

P.S. [1/10/2025] A reader who has expertise in this subject also suggested a stacked column chart with reversed axis in a comment, so my recommendation here is confirmed.


Two challenging charts showing group distributions

Long-time reader Georgette A. found this chart from a Linkedin post by David Curran:

Davidcurran_originenglishwords

She found it hard to understand. Me too.

It's one of those charts that require some time to digest. And when you figured it out, you don't get the satisfaction of time well spent.

***

If I have to write a reading guide for this chart, I'd start from the right edge. The dataset consists of the top 2000 English words, ranked by popularity. The right edge of the chart says that roughly two-thirds of these 2000 words are of Germanic origin, followed by 20% French origin, 10% Latin origin, and 3% "others".

Now, look at the middle of the chart, where the 1000 gridline lies. The analyst did the same analysis but using just the top 1000 words, instead of the top 2000 words. Not surprisingly, Germanic words predominate. In fact, Germanic words account for an even higher percentage of the total, roughly three-quarters. French words are at 16% (relative to 20%), and Latin at 7% (compared to 10%).

The trend is this: as we restrict the word list to fewer and more popular words, the more Germanic words dominate. Of the top 50 words, all but 1 is of Germanic origin. (You can't tell that directly from the chart but you can figure it out if you measure it and do some calculations.)

Said differently, there are some non-Germanic words in the English language but they tend not to be used particularly often.

As we move our eyes from left to right on this chart, we are analyzing more words but the newly added words are less popular than those included prior. The distribution of words by origin is cumulative.

The problem with this data visualization is that it doesn't "locate" where these non-Germanic words exist. It's focused on a cumulative metric so the reader has to figure out where the area has increased and where it has flat-lined. This task is quite challenging in an area chart.

***

The following chart showing the same information is more canonical in the scientific literature.

Junkcharts_redo_curran_originenglishwords

This chart also requires a reading guide for those uninitiated. (Therefore, I'm not saying it's better than the original.)

The chart shows how words of a specific origin accumulates over the top X most popular English words. Each line starts at 0% on the left and ends at 100% on the right.

Note that the "other" line hugs to the zero level until X = 400, which means that there are no words of "other" origin in the top 400 list. We can see that words of "other" origin are mostly found between top 700-1000 and top 1700-2000, where the line is steepest. We can be even more precise: about 25% of these words are found in the top 700-1000 while 45% are found in the top 1700-2000.

In such a chart, the 45 degree line acts as a reference line. Any line that follows the 45 degree line indicates an even distribution: X% of the words of origin A are found in the top X% of the distribution. Origin A's words are not more or less popular than average anywhere in the distribution.

In this chart, nothing is on top of the 45 degree line. The Germanic line is everywhere above the 45 degree line. This means that on the left side, the line is steeper than 45 degrees while on the right side, its slope is less than 45 degrees. In other words, Germanic words are biased towards the left side, i.e. they are more likely to be popular words.

For example, amongst the top 400 (20%) of the word list, Germanic words accounted for 27%.

I can't imagine this chart is easy for anyone who hasn't seen it before; but if you are a scientist or economist, you might find this one easier to digest than the original.