What is this "stacked range chart"?
Dec 20, 2024
Long-time reader Aleksander B. sent me to this video (link), in which a Youtuber ranted that most spreadsheet programs do not make his favorite chart. This one:
Two questions immediately come to mind: a) what kind of chart is this? and b) is it useful?
Evidently, the point of the above chart is to tell readers there are (at least) three places called “London”, only one of which features red double-decker buses. He calls this a “stacked range chart”. This example has three stacked columns, one for each place called London.
What can we learn from this chart? The range of temperatures is narrowest in London, England while it is broadest in London, Ontario (Canada). The highest temperature is in London, Kentucky (USA) while the lowest is in London, Ontario.
But what kind of “range” are we talking about? Do the top and bottom of each stacked column indicate the maximum and minimum temperatures as we’ve interpreted them to be? In theory, yes, but in this example, not really.
Let’s take one step back, and think about the data. Elsewhere in the video, another version of this chart contains a legend giving us hints about the data. (It's the chart on the right of the screenshot.)
Each column contains four values: the average maximum and minimum temperatures in each place, the average maximum temperature in summer, and the average minimum temperature in winter. These metrics are mouthfuls of words, because the analyst has to describe what choices were made while aggregating the raw data.
The raw data comprise daily measurements of temperatures at each location. (To make things even more complex, there are likely multiple measurement stations in each town, and thus, the daily temperatures themselves may already be averages; or else, the analyst has picked a representative station for each town.) From this single sequence of daily data, we extract two subsequences: the maximum daily, and the minimum daily. This transformation acknowledges that temperatures fluctuate, sometimes massively, over the course of each day.
Each such subsequence is aggregated to four representative numbers. The first pair of max, min is just the averages of the respective subsequences. The remaining two numbers require even more explanation. The “summer average maximum temperature” should be the average of the max subsequence after filtering it down to the “summer” months. Thus, it’s a trimmed average of the max subsequence, or the average of the summer subsequence of the max subsequence. Since summer temperatures are the highest of the four seasons, this number suggests the maximum of the max subsequence, but it’s not the maximum daily maximum since it’s still an average. Similarly, the “winter average minimum temperature” is another trimmed average, computed over the winter months, which is related to but not exactly the minimum daily minimum.
Thus, the full range of each column is the difference between the trimmed summer average and the trimmed winter average. I assume weather scientists use this metric instead of the full range of max to min temperature because it’s less affected by outlier values.
***
Stepping out of the complexity, I’ll say this: what the “stacked range chart” depicts are selected values along the distribution of a single numeric data series. In this sense, this chart is a type of “boxplot”.
Here is a random one I grabbed from a search engine.
A boxplot, per its inventor Tukey, shows a five-number summary of a distribution: the median, the 25th and 75th percentile, and two “whisker values”. Effectively, the boxplot shows five percentile values. The two whisker values are also percentiles, but not fixed percentiles like 25th, 50th, and 75th. The placement of the whiskers is determined automatically by a formula that determines the threshold for outliers, which in turn depends on the shape of the data distribution. Anything contained within the whiskers is regarded as a “normal” value of the distribution, not an outlier. Any value larger than the upper whisker value, or lower than the lower whisker value, is an outlier. (Outliers are shown individually as dots above or below the whiskers - I see this as an optional feature because it doesn't make sense to show them individually for large datasets with lots of outliers.)
The stacked range chart of temperatures picks off different waypoints along the distribution but in spirit, it is a boxplot.
***
This discussion leads me to the answer to our second question: is the "stacked range chart" useful? The boxplot is indeed useful. It does a good job describing the basic shape of any distribution.
I make variations of the boxplot all the time, with different percentiles. One variation commonly seen out there replaces the whisker values with the maximum and minimum values. Thus all the data live within the whiskers. This wasn’t what Tukey originally intended but the max-min version can be appropriate in some situations.
Most statistical software makes the boxplot. Excel is the one big exception. It has always been a mystery to me why the Excel developers are so hostile to the boxplot.
P.S. Here is the official manual for making a box plot in Excel. I wonder if they are the leading promoter of the max-min boxplot that strays from Tukey's original. It is possible to make the original whiskers but I suppose they don't want to explain it, and it's much easier to have people compute the maximum and minimum values in the dataset.
The max-min boxplot is misleading if the dataset contains true outliers. If the maximum value is really far from the 75th percentile, then most of the data between the 75th and 100th percentile could be sitting just above the top of the box.
P.S. [1/9/2025] See the comments below. Steve made me realize that the color legend of the London chart actually has five labels, the last one is white which blends into the white background. Note that, in the next post in this series, I found that I could not replicate the guy's process to produce the stacked column chart in Excel so I went in a different direction.