On the interpretability of log-scaled charts

A previous post featured the following chart showing stock returns over time:

Gelman_overnightreturns_tsla

Unbeknownst to readers,  the chart plots one thing but labels it something else.

The designer of the chart explains how to read the chart in a separate note, which I included in my previous post (link). It's a crucial piece of information. Before reading his explanation, I didn't realize the sleight of hand: he made a chart with one time series, then substituted the y-axis labels with another set of values.

As I explored this design choice further, I realize that it has been widely adopted in a common chart form, without fanfare. I'll get to it in due course.

***

Let's start our journey with as simple a chart as possible. Here is a line chart showing constant growth in the revenues of a small business:

Junkcharts_dollarchart_origvalues

For all the charts in this post, the horizontal axis depicts time (x = 0, 1, 2, ...). To simplify further, I describe discrete time steps although nothing changes if time is treated as continuous.

The vertical scale is in dollars, the original units. It's conventional to modify the scale to units of thousands of dollars, like this:

Junkcharts_dollarchart_thousands

No controversy arises if we treat these two charts as identical. Here I put them onto the same plot, using dual axes, emphasizing the one-to-one correspondence between the two scales.

Junkcharts_dollarchart_dualaxes

We can do the same thing for two time series that are linearly related. The following chart shows constant growth in temperature using both Celsius and Fahrenheit scales:

Junkcharts_tempchart_dualaxes

Here is the chart displaying only the Fahrenheit axis:

Junkcharts_tempchart_fahrenheit

This chart admits two interpretations: (A) it is a chart constructed using F values directly and (B) it is a chart created using C values, after which the axis labels were replaced by F values. Interpretation B implements the sleight of hand of the log-returns plot. The issue I'm wrestling with in this post is the utility of interpretation B.

Before we move to our next stop, let's stipulate that if we are exposed to that Fahrenheit-scaled chart, either interpretation can apply; readers can't tell them apart.

***

Next, we look at the following line chart:

Junkcharts_trendchart_y

Notice the vertical axis uses a log10 scale. We know it's a log scale because the equally-spaced tickmarks represent different jumps in value: the first jump is from 1 to 10, the next jump is from 10, not to 20, but to 100.

Just like before, I make a dual-axes version of the chart, putting the log Y values on the left axis, and the original Y values on the right axis.

Junkcharts_trendchart_dualaxes
By convention, we often print the original values as the axis labels of a log chart. Can you recognize that sleight of hand? We make the chart using the log values, after which we replace the log value labels with the original value labels. We adopt this graphical trick because humans don't think in log units, thus, the log value labels are less "interpretable".

As with the temperature chart, we will attempt to interpret the chart two ways. I've already covered interpretation B. For interpretation A, we regard the line chart as a straightforward plot of the values shown on the right axis (i.e., the original values). Alas, this viewpoint fails for the log chart.

If the original data are plotted directly, the chart should look like this:

Junkcharts_trendchart_y_origvalues

It's not a straight line but a curve.

What have I just shown? That, after using the sleight of hand, we cannot interpret the chart as if it were directly plotting the data expressed in the original scale.

To nail down this idea, we ask a basic question of any chart showing trendlines. What's the rate of change of Y?

Using the transformed log scale (left axis), we find that the rate of change is 1 unit per unit time. Using the original scale, the rate of change from t=1 to t=2 is (100-10)/1 = 90 units per unit time; from t=2 to t=3, it is (1000-100)/1 = 900 units per unit time. Even though the rate of change varies by time step, the log chart using original value labels sends the misleading picture that the rate of change is constant over time (thus a straight line). The decision to substitute the log value labels backfires!

This is one reason why I use log charts sparingly. (I do like them a lot for exploratory analyses, but I avoid using them as presentation graphics.) This issue of interpretation is why I dislike the sleight of hand used to produce those log stock returns charts, even if the designer offers a note of explanation.

Do we gain or lose "interpretability" when we substitute those axis labels?

***

Let's re-examine the dual-axes temperature chart, building on what we just learned.

Junkcharts_tempchart_dualaxes

The above chart suggests that whichever scale (axis) is chosen, we get the same line, with the same steepness. Thus, the rate of change is the same regardless of scale. This turns out to be an illusion.

Using the left axis, the slope of the line is 10 degrees Celsius per unit time. Using the right axis, the slope is 18 degrees Fahrenheit per unit time. 18 F is different from 10 C, thus, the slopes are not really the same! The rate of change of the temperature is given algebraically by the slope, and visually by the steepness of the line. Since two different slopes result in the same line steepness, the visualization conveys a lie.

This situation here is a bit better than that in the log chart. Here, in either scale, the rate of change is constant over time. Differentiating the temperature conversion formula, we find that the slope of the Fahrenheit line is always 9/5*the slope of the Celsius line. So a rate of 10 Celsius per unit time corresponds to 18 Fahrenheit per unit time.

What if the chart is presented with only the Fahrenheit axis labels although it is built using Celsius data? Since readers only see the F labels, the observed slope is in Fahrenheit units. Meanwhile, the chart creator uses Celsius units. This discrepancy is harmless for the temperature chart but it is egregious for the log chart. The underlying reason is the nonlinearity of the log transform - the slope of log Y vs time is not proportional to the slope of Y vs time; in fact, it depends on the value of Y.  

***

The log chart is a sacred cow of scientists, a symbol of our sophistication. Are they as potent as we'd think? In particular, when we put original data values on the log chart, are we making it more intepretable, or less?

 

P.S. I want to tie this discussion back to my Trifecta Checkup framework. The design decision to substitute those axis labels is an example of an act that moves the visual (V) away from the data (D). If the log units were printed, the visual makes sense; when the original units were dropped in, the visual no longer conveys features of the data - the reader must ignore what the eyes are seeing, and focus instead on the brain's perspective.


The reckless practice of eyeballing trend lines

MSN showed this chart claiming a huge increase in the number of British children who believe they are born the wrong gender.

Msn_genderdysphoria

The graph has a number of defects, starting with drawing a red line that clearly isn’t the trend in the data.

To find the trend line, we have to draw a line that is closest to the top of every column. The true trend line is closer to the blue line drawn below:

Junkcharts_redo_msngenderdysphoria_1

The red line moves up one unit roughly every three years while the blue line does so every four years.

Notice the dramatic jump in the last column of the chart. The observed trend is not a straight line, and therefore it is not appropriate to force a straight-line model. Instead, it makes more sense to divide the time line into three periods, with different rates of change.

Junkcharts_redo_msngenderdysphoria_2

Most of the growth during this 10 year period occurred in the last year, and one should check the data, and also check to see if any accounting criterion changed that might explain this large unexpected jump.

***

The other curiosity about this chart is the scale of the vertical axis. Nowhere on the chart does it say which metric of gender dysphoria it is depicting. The title suggests they are counting the number of diagnoses but the axis labels that range from one to five point to some other metric.

From the article, we learn that annual number of gender dysphoria diagnoses was about 10,000 in 2021, and that is encoded as 4.5 in the column chart. The sub-header of the chart indicates that the unit is number per 1,000 people. Ten thousand diagnoses divided by the population size of under 18 x 1,000 = 4.5. This implies there were roughly 2.2 million people under 18 in the U.K. in 2021.

But according to these official statistics (link), there were about 13 million people aged 0-18 in just England and Wales in mid-2022, which is not in the right range. From a dataviz perspective, the designer needs to explain what the values on the vertical axes represent. Right now, I have no idea what it means.

***

Using the Trifecta Checkup framework, we say that the question addressed by the chart is clear but there are problems relating to data encoding as well as the trend-line visual.

_trifectacheckup_image


Don't show everything

There are many examples where one should not show everything when visualizing data.

A long-time reader sent me this chart from the Economist, published around Thanksgiving last year:

Economist_musk

It's a scatter plot with each dot representing a single tweet by Elon Musk against a grid of years (on the horizontal axis) and time of day (on the vertical axis).

The easy messages to pick up include:

  • the increase in frequency of tweets over the years
  • especially, the jump in density after Musk bought Twitter in late 2022 (there is also a less obvious level up around 2018)
  • the almost continuous tweeting throughout 24 hours.

By contrast, it's hard if not impossible to learn the following:

  • how many tweets did he make on average or in total per year, per day, per hour?
  • the density of tweets for any single period of time (i.e., a reference for everything else)
  • the growth rate over time, especially the magnitude of the jumps

The paradox: a chart that is data-dense but information-poor.

***

The designer added gridlines and axis labels to help structure our reading. Specifically, we're cued to separate the 24 hours into four 6-hour chunks. We're also expected to divide the years into two groups (pre- and post- the Musk acquisition), and secondarily, into one-year intervals.

If we accept this analytical frame, then we can divide time into these boxes, and then compute summary statistics within each box, and present those values.  I'm working on some concepts, will show them next time.

 


Ranks, labels, metrics, data and alignment

A long-time reader Chris V. (since 2012!) sent me to this WSJ article on airline ratings (link).

The key chart form is this:

Wsj_airlines_overallranks

It's a rhombus shaped chart, really a bar chart rotated counter-clockwise by 45 degrees. Thus, all the text is at 45 degree angles. An airplane icon is imprinted on each bar.

There is also this cute interpretation of the white (non-data-ink) space as a symmetric reflection of the bars (with one missing element). On second thought, the decision to tilt the chart was probably made in service of this quasi-symmetry. If the data bars were horizontal, then the white space would have been sliced up into columns, which just doesn't hold the same appeal.

If we be Tuftian, all of these flourishes do not serve the data. But do they do much harm? This is a case that's harder to decide. The data consist of just a ranking of airlines. The message still comes across. The head must tilt, but the chart beguiles.

***

As the article progresses, the same chart form shows up again and again, with added layers of detail. I appreciate how the author has constructed the story. Subtly, the first chart teaches the readers how the graphic encodes the data, and fills in contextual information such as there being nine airlines in the ranking table.

In the second section, the same chart form is used, while the usage has evolved. There are now a pair of these rhombuses. Each rhombus shows the rankings of a single airline while each bar inside the rhombus shows the airline's ranking on a specific metric. Contrast this with the first chart, where each bar is an airline, and the ranking is the overall ranking on all metrics.

Wsj_airlines_deltasouthwestranks

You may notice that you've used a piece of knowledge picked up from the first chart - that on each of these metrics, each airline has been ranked against eight others. Without that knowledge, we don't know that being 4th is just better than the median. So, in a sense, this second section is dependent on the first chart.

There is a nice use of layering, which links up both charts. A dividing line is drawn between the first place (blue) and not being first (gray). This layering allows us to quickly see that Delta, the overall winner, came first in two of the seven metrics while Southwest, the second-place airline, came first in three of the seven (leaving two metrics for which neither of these airlines came first).

I'd be the first to admit that I have motion sickness. I wonder how many of you are starting to feel dizzy while you read the labels, heads tilted. Maybe you're trying, like me, to figure out the asterisks and daggers.

***

Ironically, but not surprisingly, the asterisks reveal a non-trivial matter. Asterisks direct readers to footnotes, which should be supplementary text that adds color to the main text without altering its core meaning. Nowadays, asterisks may hide information that changes how one interprets the main text, such as complications that muddy the main argument.

Here, the asterisks address a shortcoming of representing ranking using bars. By convention, lower ranking indicates better, and most ranking schemes start counting from 1. If ranks are directly encoded in bars, then the best object is given the shortest bar. But that's not what we see on the chart. The bars actually encode the reverse ranking so the longest bar represents the lowest ranking.

That's level one of this complication. Level two is where these asterisks are at.

Notice that the second metric is called "Canceled flights". The asterisk quipped "fewest". The data collected is on the number of canceled flights but the performance metric for the ranking is really "fewest canceled flights". 

If we see a long bar labelled "1st" under "canceled flights", it causes a moment of pause. Is the airline ranked first because it had the most canceled flights? That would imply being first is worst for this category. It couldn't be that. So perhaps "1st" means having the fewest canceled flights but then it's just weird to show that using the longest bar. The designer correctly anticipates this moment of pause, and that's why the chart has those asterisks.

Unfortunately, six out of the seven metrics require asterisks. In almost every case, we have to think in reverse. "Extreme delays" really mean "Least extreme delays"; "Mishandled baggage" really mean "Less mishandled baggage"; etc. I'd spend some time renaming the metrics to try to fix this avoiding footnotes. For example, saying "Baggage handling" instead of "mishandled baggage" is sufficient.

***

The third section contains the greatest details. Now, each chart prints the ranking of nine airlines for a particular metric.

Wsj_airlinerankings_bymetric

 

By now, the cuteness faded while the neck muscles paid. Those nice annotations, written horizontally, offered but a twee respite.

 

 

 

 

 


Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


What is this "stacked range chart"?

Long-time reader Aleksander B. sent me to this video (link), in which a Youtuber ranted that most spreadsheet programs do not make his favorite chart. This one:

Two questions immediately come to mind: a) what kind of chart is this? and b) is it useful?

Evidently, the point of the above chart is to tell readers there are (at least) three places called “London”, only one of which features red double-decker buses. He calls this a “stacked range chart”. This example has three stacked columns, one for each place called London.

What can we learn from this chart? The range of temperatures is narrowest in London, England while it is broadest in London, Ontario (Canada). The highest temperature is in London, Kentucky (USA) while the lowest is in London, Ontario.

But what kind of “range” are we talking about? Do the top and bottom of each stacked column indicate the maximum and minimum temperatures as we’ve interpreted them to be? In theory, yes, but in this example, not really.

Let’s take one step back, and think about the data. Elsewhere in the video, another version of this chart contains a legend giving us hints about the data. (It's the chart on the right of the screenshot.)

Each column contains four values: the average maximum and minimum temperatures in each place, the average maximum temperature in summer, and the average minimum temperature in winter. These metrics are mouthfuls of words, because the analyst has to describe what choices were made while aggregating the raw data.

The raw data comprise daily measurements of temperatures at each location. (To make things even more complex, there are likely multiple measurement stations in each town, and thus, the daily temperatures themselves may already be averages; or else, the analyst has picked a representative station for each town.) From this single sequence of daily data, we extract two subsequences: the maximum daily, and the minimum daily. This transformation acknowledges that temperatures fluctuate, sometimes massively, over the course of each day.

Each such subsequence is aggregated to four representative numbers. The first pair of max, min is just the averages of the respective subsequences. The remaining two numbers require even more explanation. The “summer average maximum temperature” should be the average of the max subsequence after filtering it down to the “summer” months. Thus, it’s a trimmed average of the max subsequence, or the average of the summer subsequence of the max subsequence. Since summer temperatures are the highest of the four seasons, this number suggests the maximum of the max subsequence, but it’s not the maximum daily maximum since it’s still an average. Similarly, the “winter average minimum temperature” is another trimmed average, computed over the winter months, which is related to but not exactly the minimum daily minimum.

Thus, the full range of each column is the difference between the trimmed summer average and the trimmed winter average. I assume weather scientists use this metric instead of the full range of max to min temperature because it’s less affected by outlier values.

***

Stepping out of the complexity, I’ll say this: what the “stacked range chart” depicts are selected values along the distribution of a single numeric data series. In this sense, this chart is a type of “boxplot”.

Here is a random one I grabbed from a search engine.

Analytica_tukeyboxplotA boxplot, per its inventor Tukey, shows a five-number summary of a distribution: the median, the 25th and 75th percentile, and two “whisker values”. Effectively, the boxplot shows five percentile values. The two whisker values are also percentiles, but not fixed percentiles like 25th, 50th, and 75th. The placement of the whiskers is determined automatically by a formula that determines the threshold for outliers, which in turn depends on the shape of the data distribution. Anything contained within the whiskers is regarded as a “normal” value of the distribution, not an outlier. Any value larger than the upper whisker value, or lower than the lower whisker value, is an outlier. (Outliers are shown individually as dots above or below the whiskers - I see this as an optional feature because it doesn't make sense to show them individually for large datasets with lots of outliers.)

The stacked range chart of temperatures picks off different waypoints along the distribution but in spirit, it is a boxplot.

***

This discussion leads me to the answer to our second question: is the "stacked range chart" useful?  The boxplot is indeed useful. It does a good job describing the basic shape of any distribution.

I make variations of the boxplot all the time, with different percentiles. One variation commonly seen out there replaces the whisker values with the maximum and minimum values. Thus all the data live within the whiskers. This wasn’t what Tukey originally intended but the max-min version can be appropriate in some situations.

Most statistical software makes the boxplot. Excel is the one big exception. It has always been a mystery to me why the Excel developers are so hostile to the boxplot.

 

P.S. Here is the official manual for making a box plot in Excel. I wonder if they are the leading promoter of the max-min boxplot that strays from Tukey's original. It is possible to make the original whiskers but I suppose they don't want to explain it, and it's much easier to have people compute the maximum and minimum values in the dataset.

The max-min boxplot is misleading if the dataset contains true outliers. If the maximum value is really far from the 75th percentile, then most of the data between the 75th and 100th percentile could be sitting just above the top of the box.

 

P.S. [1/9/2025] See the comments below. Steve made me realize that the color legend of the London chart actually has five labels, the last one is white which blends into the white background. Note that, in the next post in this series, I found that I could not replicate the guy's process to produce the stacked column chart in Excel so I went in a different direction.


the wtf moment

You're reading some article that contains a standard chart. You're busy looking for the author's message on the chart. And then, the wtf moment strikes.

It's the moment when you discover that the chart designer has done something unexpected, something that changes how you should read the chart. It's when you learn that time is running right to left, for example. It's when you realize that negative numbers are displayed up top. It's when you notice that the columns are ordered by descending y-value despite time being on the x-axis.

Tell me about your best wtf moments!

***

The latest case of the wtf moment occurred to me when I was reading Rajiv Sethi's blog post on his theory that Kennedy voters crowded out Cheney voters in the 2024 Presidential election (link). Was the strategy to cosy up to Cheney and push out Kennedy wise?

In the post, Rajiv has included this chart from Pew:

Pew_science_confidence

The chart is actually about the public's confidence in scientists. Rajiv summarizes the message as: 'Public confidence in scientists has fallen sharply since the early days of the pandemic, especially among Republicans. There has also been a shift among Democrats, but of a slightly different kind—the proportion with “a great deal” of trust in scientists to act in our best interests rose during the first few months of the pandemic but has since fallen back.'

Pew produced a stacked column chart, with three levels for each demographic segment and month of the survey. The question about confidence in scientists admits three answers: a great deal, a fair amount, and not too much/None at all. [It's also possible that they offered 4 responses, with the bottom two collapsed as one level in the visual display.]

As I scan around the chart understanding the data, suddenly I realized that the three responses were not listed in the expected order. The top (light blue) section is the middling response of "a fair amount", while the middle (dark blue) section is the "a great deal" answer.

wtf?

***

Looking more closely, this stacked column chart has bells and whistles, indicating that the person who made it expended quite a bit of effort. Whether it's worthwhile effort, it's for us readers to decide.

By placing "a great deal" right above the horizon, the designer made it easier to see the trend in the proportion responding with "a great deal". It's also easy to read the trend of those picking the "negative" response because of how the columns are anchored. In effect, the designer is expressing the opinion that the middle group (which is also the most popular answer) is just background, and readers should not pay much attention to it.

The designer expects readers to care about one other trend, that of the "top 2 box" proportion. This is why sitting atop the columns are the data labels called "NET" which is the sum of those responding "a great deal" or "a fair amount".

***

For me, it's interesting to know whether the prior believers in science who lost faith in science went down one notch or two. Looking at the Republicans, the proportion of "a great deal" roughly went down by 10 percentage points while the proportion saying "Not too much/None at all" went up about 13%. Thus, the shift in the middle segment wasn't enough to explain all of the jump in negative sentiment; a good portion went from believer to skeptic during the pandemic.

As for Democrats, the proportion of believers also dropped by about 10 percentage points while the proportion saying "a fair amount" went up by almost 10 percent, accounting for most of the shift. The proportion of skeptics increased by about 2 percent.

So, for Democrats, I'm imagining a gentle slide in confidence that applies to the whole distribution while for Republicans, if someone loses confidence, it's likely straight to the bottom.

If I'm interested in the trends of all three responses, it's more effective to show the data in a panel like this:

Junkcharts_redo_pew_scientists

***

Remember to leave a comment when you hit your wtf moment next time!

 


Election coverage prompts good graphics

The election broadcasts in the U.S. are full-day affairs, and they make a great showcase for interactive graphics.

The election setting is optimal as it demands clear graphics that are instantly digestible. Anything else would have left viewers confused or frustrated.

The analytical concepts conveyed by the talking heads during these broadcasts are quite sophisticated, and they did a wonderful job at it.

***

One such concept is the value of comparing statistics against a benchmark (or, even multiple benchmarks). This analytics tactic comes in handy in the 2024 election especially, because both leading candidates are in some sense incumbents. Kamala was part of the Biden ticket in 2020, while Trump competed in both 2016 and 2020 elections.

Msnbc_2024_ga_douglas

In the above screenshot, taken around 11 pm on election night, the MSNBC host (that looks like Steve K.) was searching for Kamala votes because it appeared that she was losing the state of Georgia. The question of the moment: were there enough votes left for her to close the gap?

In the graphic (first numeric column), we were seeing Kamala winning 65% of the votes, against Trump's 34%, in Douglas county in Georgia. At first sight, one would conclude that Kamala did spectacularly well here.

But, is 65% good enough? One can't answer this question without knowing past results. How did Biden-Harris do in the 2020 election when they won the presidency?

The host touched the interactive screen to reveal the second column of numbers, which allows viewers to directly compare the results. At the time of the screenshot, with 94% of the votes counted, Kamala was performing better in this county than they did in 2020 (65% vs 62%). This should help her narrow the gap.

If in 2020, they had also won 65% of the Douglas county votes, then, we should not expect the vote margin to shrink after counting the remaining 6% of votes. This is why the benchmark from 2020 is crucial. (Of course, there is still the possibility that the remaining votes were severely biased in Kamala's favor but that would not be enough, as I'll explain further below.)

All stations used this benchmark; some did not show the two columns side by side, making it harder to do the comparison.

Interesting side note: Douglas county has been rapidly shifting blue in the last two decades. The proportion of whites in the county dropped from 76% to 35% since 2000 (link).

***

Though Douglas county was encouraging for Kamala supporters, the vote gap in the state of Georgia at the time was over 130,000 in favor of Trump. The 6% in Douglas represented only about 4,500 votes (= 70,000*0.06/0.94). Even if she won all of them (extremely unlikely), it would be far from enough.

So, the host flipped to Fulton county, the most populous county in Georgia, and also a Democratic stronghold. This is where the battle should be decided.

Msnbc_2024_ga_fulton

Using the same format - an interactive version of a small-multiples arrangement, the host looked at the situation in Fulton. The encouraging sign was that 22% of the votes here had not yet been counted. Moreover, she captured 73% of those votes that had been tallied. This was 10 percentage points better than her performance in Douglas, Ga. So, we know that many more votes were coming in from Fulton, with the vast majority being Democratic.

But that wasn't the full story. We have to compare these statistics to our 2020 benchmark. This comparison revealed that she faced a tough road ahead. That's because Biden-Harris also won 73% of the Fulton votes in 2020. She might not earn additional votes here that could be used to close the state-wide gap.

If the 73% margin held to the end of the count, she would win 90,000 additional votes in Fulton but Trump would win 33,000, so that the state-wide gap should narrow by 57,000 votes. Let's round that up, and say Fulton halved Trump's lead in Georgia. But where else could she claw back the other half?

***

From this point, the analytics can follow one of two paths, which should lead to the same conclusion. The first path runs down the list of Georgia counties. The second path goes up a level to a state-wide analysis, similar to what was done in my post on the book blog (link).

Cnn_2024_ga

Around this time, Georgia had counted 4.8 million votes, with another 12% outstanding. So, about 650,000 votes had not been assigned to any candidate. The margin was about 135,000 in Trump's favor, which amounted to 20% of the outstanding votes. But that was 20% on top of her base value of 48% share, meaning she had to claim 68% of all remaining votes. (If in the outstanding votes, she got the same share of 48% as in the already-counted, then she would lose the state with the same vote margin as currently seen, and would lose by even more absolute votes.)

The reason why the situation was more hopeless than it even sounded here is that the 48% base value came from the 2024 votes that had been counted; thus, for example, it included her better-than-benchmark performance in Douglas county. She would have to do even better to close the gap! In Fulton, which has the biggest potential, she was unable to push the vote share above the 2020 level.

That's why in my book blog (link), I suggested that the networks could have called Georgia (and several other swing states) earlier, if they used "numbersense" rather than mathematical impossibility as the criterion.

***

Before ending, let's praise the unsung heroes - the data analysts who worked behind the scenes to make these interactive graphics possible.

The graphics require data feeds, which cover a broad scope, from real-time vote tallies to total votes casted, both at the county level and the state level. While the focus is on the two leading candidates, any votes going to other candidates have to be tabulated, even if not displayed. The talking heads don't just want raw vote counts; in order to tell the story of the election, they need some understanding of how many votes are still to be counted, where they are coming from, what's the partisan lean on those votes, how likely is the result going to deviate from past elections, and so on.

All those computations must be automated, but manually checked. The graphics software has to be reliable; the hosts can touch any part of the map to reveal details, and it's not possible to predict all of the user interactions in advance.

Most importantly, things will go wrong unexpectedly during election night so many data analysts were on standby, scrambling to fix issues like breakage of some data feed from some county in some state.


Book review: Getting (more out of ) Graphics by Antony Unwin

Unwin_gettingmoreoutofgraphics_coverAntony Unwin, a statistics professor at Augsburg, has published a new dataviz textbook called "Getting (more out of) Graphics", and he kindly sent me a review copy. (Amazon link)

I am - not surprisingly - in the prime audience for such a book. It covers some gaps in the market:
a) it emphasizes exploratory graphics rather than presentation graphics
b) it deals not just with designing graphics but also interpreting (i.e. reading) them
c) it covers data pre-processing and data visualization in a more balanced way
d) it develops full case studies involving multiple graphics from the same data sources

The book is divided into two parts: the first, which covers 75% of the materials, details case studies, while the final quarter of the book offers "advice". The book has a github page containing R code which, as I shall explain below, is indispensable to the serious reader.

Given the aforementioned design, the case studies in Unwin's book have a certain flavor: most of the data sets are relatively complex, with many variables, including a time component. The primary goal of Unwin's exploratory graphics can be stated as stimulating "entertaining discussions" about and "involvment" with the data. They are open-ended, and frequently inconclusive. This is a major departure from other data visualization textbooks on the market, and also many of my own blog posts, where we focus on selecting a good graphic for presenting insights visually to an intended audience, without assuming domain expertise.

I particularly enjoyed the following sections: a discussion of building graphs via "layering" (starting on p. 326), enumeration of iterative improvement to graphics (starting on p. 402), and several examples of data wrangling (e.g. p.52).

Unwin_fig4.7

Unwin does not give "advice" in the typical style of do this, don't do that. His advice is fashioned in the style of an analyst. He frames and describes the issues, shows rather than tells. This paragraph from the section about grouping data is representative:

Sorting into groups gets complicated when there are several grouping variables. Variables may be nested in a hierarchy... or they may have no such structure... Groupings need to be found that reflect the aims of the study. (p. 371)

He writes down what he has done, may provide a reason for his choices, but is always understated. He sees no point in selling his reasoning.

The structure of the last part of the book, the "advice" chapters, is quite unusual. The chapter headers are: (data) provenance and quality; wrangling; colour; setting the scene (scaling, layout, etc.); ordering, sorting and arranging; what affects interpretation; and varieties of plots.

What you won't find are extended descriptions of chart forms, rules of visualization, or flowcharts tying data types to chart forms. Those are easily found online if you want them (you probably won't care if you're reading Unwin's book.)

***

For the serious reader, the book should be consumed together with the code on github. Find specific graphs from the case studies that interest you, open the code in your R editor, and follow how Unwin did it. The "advice" chapters highlight points of interest from the case studies presented earlier so you may start there, cross-reference the case studies, then jump to the code.

Unfortunately, the code is sparsely commented. So also open up your favorite chatbot, which helps to explain the code, and annotate it yourself. Unwin uses R, and in particular, lives in the "tidyverse".

To understand the data manipulation bits, reviewing the code is essential. It's hard to grasp what is being done to the data without actually seeing the datasets. There are no visuals of the datasets in the book, as the text is primarily focused on the workflow leading to a graphic. The data processing can get quite involved, such as Chapter 16.

I'm glad Unwin has taken the time to write this book and publish the code. It rewards the serious reader with skills that are not commonly covered in other textbooks. For example, I was rather amazed to find this sentence (p. 366):

To ensure that a return to a particular ordering is always possible, it is essential to have a variable with a unique value for every case, possibly an ID variable constructed for just this reason. Being able to return to the initial order of a dataset is useful if something goes wrong (and something will).

Anyone who has analyzed real-world datasets would immediately recognize this as good advice but who'd have thought to put it down in a book?


Expert handling of multiple dimensions of data

I enjoyed reading this Washington Post article about immigration in America. It features a number of graphics. Here's one graphic I particularly like:

Wpost_smallmultiplesmap

This is a small multiples of six maps, showing the spatial distribution of immigrants from different countries. The maps reveal some interesting patterns: Los Angeles is a big favorite of Guatamalans while Houston is preferred by Hondurans. Venezuelans like Salt Lake City and Denver (where there are also some Colombians and Mexicans). The breadth of the spatial distribution surprises me.

The dataset behind this graphic is complex. It's got country of origin, place of settlement, and time of arrival. The maps above collapsed the time dimension, while drawing attention to the other two dimensions.

***

They have another set of charts that highlight the time dimension while collapsing the place of settlement dimension. Here's one view of it:

Wpost_inkblot_overall

There are various names for this chart form. Stream river is one. I like to call it "inkblot", where the two sides are symmetric around the middle vertical line. The chart shows that "migrants in the U.S. immigration court" system have grown substantially since the end of the Covid-19 pandemic, during which they stopped coming.

I'm not a fan of the inkblot. One reason is visible in the following view, which showcases three Central American countries.

Wpost_inkblot_centralamerica

The main message is clear enough. The volume of immigrants from these three countries have been relatively stable over the last decade, with a bulge in the late 2000s. The recent spurt in migrants have come from other places.

But try figuring out what proportion of total immigration is accounted for by these three countries say in 2024. It's a task that is tougher than it should be, and the culprit is that the "other countries" category has been split in half with the two halves separated.