Today, I'm returning to those "equal-area histograms" that Andrew wrote about last month. I have two previous posts about this. The first post introduces the concept: in a traditional histogram, the columns have the same bin width while the column heights can represent a variety of metrics, such as counts, relative frequencies (i.e. proportion of the data) and densities; in the equal-area histogram, the columns have varying widths while the area of each column is constant, and determined by the number of bins (columns).
In a second post, I explained the differences between using counts, frequencies and densities in the vertical axis. The underlying issue is that the histogram is not merely a column chart, in which the width of the columns is arbitrary and data-free - in the histogram, both the heights and widths of columns carry meaning. One feature of the histogram that almost everyone expects is that the area of the columns sum up to 1. This aligns with a desired interpretation of probabilities of data falling into specified ranges, as we'd like the amount of data in the entire range to add up to 100%. Unfortunately, the two items are usually incompatible with each other.
If the height of the columns represents the probability of data falling into the range as indicated by its width, then the sum of the column heights is 1, which implies that the sum of the column areas cannot be 1. On the other hand, if the column areas add up to 1, then the column heights will not add up to 1, and thus, in this scenario, we cannot interpret the column heights to be probabilities. As explained in the second post, the column heights in this situation are densities, which can be defined as the proportion of data divided by the bin width. Intuitively, it gives information on how dense or sparse the data are within the specified range.
Today's post start with a toy dataset, containing randomly generated values from a normal distribution (bell curve) centered at 4 and with standard deviation 1.
Here is the traditional histogram of the dataset, using 100 equal-width bin. (I generated 10,000 values)
The first histogram divides the data into 4 bins; then 10 bins, 20 bins and 100 bins.
In the 4-bin case, each column contains 1/4 = 25% of the data. The middle two columns contain 50% of the data, and they have high densities, as the widths of these columns are low. It's a crude approximation of the familiar bell curve.
As we increase the number of bins, the columns in the middle of the distribution, where most of the data are concentrated, become narrower. In the sparse regions, the column width doesn't necessarily grow because each column must contain 1/n of the data, where n is the number of columns. As the number of columns increases, each column contains less of the data.
The bottom chart is the "percentogram", which is what Andrew's correspondent proposed. The number of bins is set to 100, so each column contains exactly 1 percent of the data. For a normal distribution, the columns in the middle are very tall and thin.
The reason why the middle of the percentogram looks faded is that I asked for a white border around each column. But when the columns are so thin, even if one sets the border width very small, what readers see is a mixture of orange and white.
With high number of bins, we notice a few things: a) the outline of the histogram becomes "ragged" (the more bins there are), b) the middle columns become razor-thin c) the width conceded by the middle columns is absorbed not by the columns at the edges but those between the peak and the edge.
I'm struggling a bit to justify this percentogram versus the typical, equal-width histogram.
Let me go down a different path.
In "principled" histograms, the column heights represent data densities, while the total area of the columns add up to 1. This leads us to a new understanding of the relationship between the equal-width histogram and the equal-area histogram.
We start with data density defined by (proportion of data) / (bin width). Those two values are not independent - one is fully determined by the other, given the underlying dataset. In a traditional equal-width histogram, the question is: how much of the data is found in a column of fixed width? In the new equal-area histogram, the question is: how wide is the bin that contains a fixed amount of data? In the former, the denominator is fixed while the numerator varies; the opposite occurs in the latter.
We also recognize that given the range of the data, there is a relationship between the the set of bin widths in the two types of histograms. In the traditional histogram, all bin widths have the same value, equal to the range of the data divided by the number of bins. Think of this as the average bin width. In an equal-area histogram, the set of bin widths varies: however, the sum of the bin widths must still add up to the range of the data. For two comparable histograms with the same number of bins, the average of the bin widths must be the same for both sets. (I'm ignoring any rounding situations in which the range of the histogram is larger than the range of the data.)
Now, consider the middle of the normal distribution where the data are dense. In the traditional histogram, the column in the middle still has width equal to the average bin width. In the equal-area histogram, the middle column has width much smaller than the average bin width. In other words, we can think of the column in the traditional histogram being broken up into many thin and slim columns in the equal-area histogram, each containing 1% of the data in the case of the percentogram.
The height of the column is the data density. In the traditional histogram, the middle column is the pooled sample of larger size; in the equal-area histogram, each of those thin and slim columns is a partition of the sample. This explains observation (a) above in which the outline of the equal-area histogram is more ragged - it's because each column contains fewer data from which to estimate the data density.
But this raggedness is artificial, sampling noise.
The sparse areas are more complicated still. It's also the reverse of the above. On the edges of the normal distribution, the columns of the new histogram are wider than those of the traditional histogram. So, we can think of breaking up the edge column of the new histogram into multiple columns of the traditional histogram.
The interpretation is more complicated because the data are sparse in this region. Obviously, the estimates of density on the traditional histogram in sparse regions are poor because not enough data reside in there. The density estimate on the new histogram is based on a larger sample size.
Yes, however, whether the new histogram's density estimate is better depends on the shape of the tail of the distribution. A normal distribution has exponential tails, which means that the data density declines quite drastically the further we go into the tail. Therefore, the new histogram averages the data densities across a large part of the tail, wiping out the exponential shape while the traditional histogram preserves that shape - at the expense of greater sampling variability due to smaller sample sizes.
For what it's worth, let's look at some histograms for an exponential random variable.
Here is the traditional histogram:
The data are extremely dense on the left side while it has a long tail on the right side.
The four-bin version gives a nice summary of the shape. As the number of bins goes up, as before, the denser regions now have tall, thin spikes. Again, because of the white borders, the last histogram with 100 bins is faded where the data are densest. (So obviously, don't follow my lead, and eliminate borders if you want to use it.)
The 100-bin version looks almost the same as the traditional histogram.
At this stage of the exploration, I still haven't found a compelling reason to switch to equal-area hist0grams. In the denser regions, it's adding sampling noise. If I don't care about the sparser areas, specifically, the shape of the tails, maybe they provide a cleaner presentation.