Five-value summaries of distributions
Dec 30, 2024
BG commented on my previous post, describing her frustration with the “stacked range chart”:
A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.
***
I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.
I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.
Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.
***
In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.
Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:
The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]
We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:
The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.
If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]
The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.
***
Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.
Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.
In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.
Here's a set of boxplots for the same data:
The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.
P.S. For a different example of something similar, see this old post.
Comments
You can follow this conversation by subscribing to the comment feed for this post.