Apr 28, 2023
Andrew posted about a message from one of his readers about a "percentogram", which is a variant of the histogram (link).
Let's review what a histogram is. This one I created by grabbing a dataset of scoring in international football (soccer) matches.
The histogram is a specialized type of column chart, typically displayed with zero spacing between columns. The horizontal axis represents the metric being plotted. In the example above, it's the minute of scoring with values between 0 and 120. (There are few points beyond 90 minutes as only certain tournaments prescribe extra-time in case of ties at the end of 90 minutes.) Other examples are income if it's a histogram of income distribution, or age if it's an age distribution. The metric is typically measured numerically.
The notion of "bins" is fundamental. The numeric metric on the horizontal axis is discretized into "bins". (In the above example, R selected a bin width of 10 minutes.) Binning transforms the data to fit natural concepts such as income groups and age groups. Each column has a bin width, and its height is either counts or frequencies. Counts are the number of data points that fall inside a given bin. Frequencies are the proportion of data points that fall inside a given bin. The sum of counts equals the sample size while the sum of frequencies is 100%.
There is a large literature on how to "bin" the data. In the canonical histogram, the horizontal axis is subdivided into bins of equal length. This means the histogram columns are equal width. You can readily see that in the histogram shown above.
The histogram conveys information about where the data is. The following set includes three possibilities: a uniform distribution (in which the data density is the same over the range of the horizontal axis), a normal distribution (the familiar bell curve), and a long-tailed distribution with mostly small values.
In general, the bin widths need not to be the same for all bins. One can vary bin widths. The "percentogram" referenced in Andrew's post specifies one way of setting varying bin widths: it prescribes that the breaks between bins be percentiles of the distribution.
In other words, the first bin contains the lowest 1% of the data, the second bin, the next 1%, etc. Let's see the effect of varying bin widths on the normal distribution. Note that I set the average value to be 4 so that almost all the data fall between 1 and 7.
The top chart is a dot plot showing the actual data. The middle chart shows a generic equal-width histogram. The third chart shows the percentogram, in which each bin contains 1% of the data.
In the percentogram, the bin width is a function of the density of data, with wider bins where data are sparse and narrower bins where data are dense. For a normal distribution, the data are quite concentrated in the middle. The columns on the side are wide and short.
While the standard histogram has equal-width bins, the percentogram has bin widths that vary. What is fixed is the amount of data in each column, that is to say, the area of each column is fixed.
The following set of charts corresponds to the triple shown above. Each chart contains 10,000 random samples. The same datasets were used in both sets of charts.
A few negatives of the percentogram jump out.
The column heights are rather jagged, and that is purely random noise, since the data are drawn from standard distributions. Because the data are divided into 100 parts, each column contains 100 data points, and that sample size is not large enough to tame the sampling noise.
Also, the middle of the normal distribution, where most of the data sit, look hollow. This is feature, not a bug, of the percentogram. Dense columns will be narrow and tall, like lines while sparse columns will be wide and short. Even though each column has the same area, our eyes tend to focus on the short wide ones, rather than the pencil columns.
Nothing should stop us from making equal-area bins based on other quantiles. For example, the following set divides the data into 20 bins (demideciles) instead of 100:
With smaller number of bins, the envelope of the histograms are less jagged. The denser columns are also less narrow, and thus don't exhibit the hollowing effect.
For validation, the equal-area histogram for uniform distributions looks the same as the equal-width histogram. This is expected since by definition, the data density is uniform across the whole range. Columns that contain equal counts should therefore have equal widths as the heights should be equal (excluding randomness).
What's a use case for equal-area histograms? It's for reading off sections of the data.
With 20 bins, each column contains 5% of the data. So it's easy to find the cutoff value for the top 10% of the above distribution. Just count the two columns from the right. Finding the middle 50% is a bit harder, unless the column index is printed on the chart but it's possible to find the range of values that include half the dataset.
By contrast, the standard histogram does not offer a ready answer to this type of question. One would have to look at the height of each column and start adding.
I have a few other comments on this variant of the histogram as well as some details I glossed over in this post. Stay tuned!