This post is the second post in response to a blog post at StackOverflow (link) in which the author discusses the "harm" of "aggregating away the signal" in your dataset. The first post appears on my book blog earlier this week (link).
One stop in their exploratory data analysis journey was the following chart:
This chart plots all the raw data, all 8,760 values of electricity consumption in California in 2020. Most analysts know this isn't a nice chart, and it's an abuse of ink. This chart is used as a contrast to the 4-week moving average, which was hoisted up as an example of "over-aggregation".
Why is the above chart bad (aside from the waste of ink)? Think about how you consume the information. For me, I notice these features in the following order:
- I see the upper "envelope" of the data, i.e. the top values at each hour of each day throughout the year. This gives me the seasonal pattern with a peak in the summer months.
- I see the lower "envelope" of the data
- I see the "height" of the data, which is, roughly speaking, the range of values within a day
- If I squint hard enough, I see a darker band within the band, which roughly maps to the most frequently occurring values (this feature becomes more prominent if we select a lighter shade of gray)
The chart may not be as bad as it looks. The "moving average" is sort of visible. The variability of consumption is visible. The primary problem is it draws attention to the outliers, rather than the more common values.
The envelope of any dataset is composed of extreme values, by definition. For most analysis objectives, extreme values are "noise". In the chart above, it's hard to tell how common the maximum values are relative to other possible values but it's the upper envelope that captures my attention - simply because it's the easiest trend to make out.
The same problem actually surfaces in the "improved" chart:
As explained in the preceding post, this chart rearranges the data. Instead of a single line, therea are now 52 overlapping lines, one for each week of the year. So each line is much less dense and we can make out the hour of day/day of week pattern.
Notice that the author draws attention to the upper envelope of this chart. They notice the line(s) near the top are from the summer, and this further guides their next analysis.
The reason for focusing on the envelope is the same as in the other chart. Where the lines are dense, it's not easy to make out the pattern.
Even the envelope is not as clear as it seems! There is no reason why the highlighted week (August 16 to 23) should have the highest consumption value each hour of each day of the week. It's possible that the line dips into the middle of the range at various points along the line. In the following chart, I highlight two time points in which lines may or may not have crossed:
In an interactive chart, each line can be highlighted to resolve the confusion.
Note that the lower envelope is much harder to decipher, given the density of lines.
The author then pursues a hypothesis that there are lines (weeks) with one intra-day peak and there are those with two peaks.
I'd propose that those are not discrete states but continuous. The base pattern can be one with two peaks, a higher peak in the evening, and a lower peak in the morning. Now, if you imagine pushing up the evening peak while holding the lower peak at its height, you'd gradually "erase" the lower peak but it's just receded into the background.
Possibly the underlying driver is the total demand for energy. The higher the demand, the more likely it's concentrated in the evening, which causes the lower peak to recede. The lower the demand, the more likely we see both peaks.
In either case, the prior chart drives the direction of the next analysis.