The one thing you're afraid to ask about histograms
May 02, 2023
In the previous post about a variant of the histogram, I glossed over a few perplexing issues - deliberately. Today's post addresses one of these topics: what is going on in the vertical axis of a histogram?
The real question is: what data are encoded in the histogram, and where?
Let's return to the dataset from the last post. I grabbed data from a set of international football (i.e. soccer) matches. Each goal scored has a scoring minute. If the goal is scored in regulation time, the scoring minute is a number between 1 and 90 minutes. Specifically, the data collector applies a rounding up: any goal scored between 0 and 60 seconds is recorded as 1, all the way up to a goal scored between 89 and 90th minute being recorded as 90. In this post, I only consider goals scored in regulation time so the horizontal axis is between 1-90 minutes.
The kneejerk answer to the posed question is: counts in bins. Isn't it the case that in constructing a histogram, we divide the range of values (1-90) into bins, and then plot the counts within bins, i.e. the number of goals scored within each bin of minutes?
The following is what we have in mind:
Let's call this the "count histogram".
Some readers may dislike the scale of the vertical axis, as its interpretation hinges on the total sample size. Hence, another kneejerk answer is: frequencies in bins. Instead of plotting counts directly, plot frequencies, which are just standardized counts. Just divide each value by the sample size. Here's the "frequency histogram":
The count and frequency histograms are identical except for the scale, and appear intuitively clear. The count and frequency data are encoded in the heights of the columns. The column widths are an afterthought, and they adhere to a fixed constant. Unlike a column chart, typically the gap width in a histogram is zero, as we want to partition the horizontal range into adjoining sections.
Now, if you look carefully at the histogram from the last post, reproduced below, you'd find that it plots neither counts nor frequencies:
The numbers on the axis are fractions, and suggest that they may be frequencies, but a quick check proves otherwise: with 9 columns, the average column should contain at least 10 percent of the data. The total of the displayed fractions is nowhere near 100%, which is our expectation if the values are relative frequencies. You may have come across this strangeness when creating histograms using R or some other software.
The purpose of this post is to explain what values are being plotted and why.
What are the kinds of questions we like to answer about the distribution of data?
At a high level, we want to know "where are my data"?
Arguably these two questions are fundamental:
- what is the probability that the data falls within a given range of values? e.g., what is the probability that a goal is scored in the first 15 minutes of a football match?
- what is the relative probability of data between two ranges of values? e.g. are teams more likely to score in last 5 minutes of the first half or the last five minutes of the second half of a football match?
In a histogram, the first question is answered by comparing a given column to the entire set of columns while the second question is answered by comparing one column to another column.
Let's see what we can learn from the count histogram.
In a count histogram, the heights encode the count data. To address the relative probability question, we note that the ratio of heights is the ratio of counts, and the ratio of counts is the same as the ratio of frequencies. Thus, we learn that teams are roughly 3000/1500 = 1.5 times more likely to score in the last 5 minutes of the second half than during the last 5 minutes of the first half. (See the green columns).
[For those who follow football, it's clear that the data collector treated goals scored during injury time of either half as scored during the last minute of the half, so this dataset can't be used to analyze timing of goals unless the real minutes were recorded for injury-time goals.]
To address the range probability question, we compare the aggregate height of the three orange columns with the total heights of all columns. Note that I said "height", not "area," because the heights directly encode counts. It's actually taxing to figure out the total height!
We resort to reading the total area of all columns. This should yield the correct answer: the area is directly proportional to the height because the column widths are fixed as a constant. Bear in mind, though, if the column widths vary (the theme of the last post), then areas and heights are not interchangable concepts.
Estimating the total area is still not easy, especially if the column heights exhibit high variance. What we need is the proportion of the total area that is orange. It's possible to see, not easy.
You may interject now to point out that the total area should equal the aggregate count (sample size). But that is a fallacy! It's very easy to make this error. The aggregate count is actually the total height, and because of that, the total area is the aggregate count multiplied by the column width! In my example, the total height is 23,682, which is the number of goals in the dataset, while the total area is 23,682 times 5 minutes.
[For those who think in equations, the total area is the sum over all columns of height(i) x width(i). When width is constant, we can take it outside the sum, and the sum of height(i) is just the total count.]
The count histogram is hard to use because it requires knowing the sample size. It's the first thing that is produced because the raw data are counts in bins. The frequency histogram is better at delivering answers.
In the frequency histogram, the heights encode frequency data. We can therefore just read off the relative probability of the orange column, bypassing the need to compute the total area.
This workaround actually promotes the fallacy described above for the count histogram. It is easy to fall into the trap of thinking that the total area of all columns is 100%. It isn't.
Similar to before, the total height should be the total frequency but the total area is the total frequency multipled by the column width, that is to say, the total area is the reciprocal of the bin width. In the football example, using 5-minute intervals, the total area of the frequency histogram is 1/(5 minutes) in the case of equal bin widths.
How about the relative probability question? On the frequency histogram, the ratio of column heights is the ratio of frequencies, which is exactly what we want. So long as the column width is constant, comparing column heights is easy.
One theme in the above discussion is that in the count and frequency histograms, the count and frequency data are encoded in the column heights but not the column areas. This is a source of major confusion. Because of the convention of using equal column widths, one treats areas and heights as interchangable... but not always. The total column area isn't the same as the total column height.
This observation has some unsettling implications.
As shown above, the total area is affected by the column width. The column width in an equal-width histogram is the range of the x-values divided by the number of bins. Thus, the total area is a function of the number of bins.
Consider the following frequency histograms of the same scoring minutes dataset. The only difference is the number of bins used.
Increasing the number of bins has a series of effects:
- the columns become narrower
- the columns become shorter, because each narrower bin can contain at most the same count as the wider bin that contains it.
- the total area of the columns become smaller.
This last one is unexpected and completely messes up our intuition. When we increase the number of bins, not only are the columns shortening but the total area covered by all the columns is also shrinking. Remember that the total area whether it is a count or frequency histogram has a factor equal to the bin width. Higher number of bins means smaller bin width, which means smaller total area.
What if we force the total area to be constant regardless of how many bins we use? This setting seems more intuitive: in the 5-bin histogram, we partition the total area into five parts while in the 10-bin histogram, we divide it into 10 parts.
This is the principle used by R and the other statistical software when they produce so-called density histograms. The count and frequency data are encoded in the column areas - by implication, the same data could not have been encoded simultaneously in the column heights!
The way to accomplish this is to divide by the bin width. If you look at the total area formulas above, for the count histogram, total area is total count x bin width. If the height is count divided by bin width, then the total area is the total count. Similarly, if the height in the frequency histogram is frequency divided by bin width, then the total area is 100%.
Count divided by some section of the x-range is otherwise known as "density". It captures the concept of how tightly the data are packed inside a particular section of the dataset. Thus, in a count-density histogram, the heights encode densities while the areas encode counts. In this case, total area is the total count. If we want to standardize total area to be 1, then we should compute densities using frequencies rather than counts. Frequency densities are just count densities divided by the total count.
To summarize, in a frequency-density histogram, the heights encode densities, defined as frequency divided by the bin width. This is not very intuitive; just think of densities as how closely packed the data are in the specified bin. The column areas encode frequencies so that the total area is 100%.
The reason why density histograms are confusing is that we are reading off column heights while thinking that the total area should add up to 100%. Column heights and column areas cannot both add up to 100%. We have to pick one or the other.
Comparing relative column heights still works when the density histogram has equal bin widths. In this case, the relative height and relative area are the same because relative density equals relative frequencies if the bin width is fixed.
The following charts recap the discussion above. It shows how the frequency histogram does not preserve the total area when bin sizes are changed while the density histogram does.
The density histogram is a major pain for solving range probability questions because the frequencies are encoded in the column areas, not the heights. Areas are not marked out in a graph.
The column height gives us densities which are not probabilities. In order to retrieve probabilities, we have to multiply the density by the bin width, that is to say, we must estimate the area of the column. That requires mapping two dimensions (width, height) onto one (area). It is in fact impossible without measurement - unless we make the bin widths constant.
When we make the bin widths constant, we still can't read densities off the vertical axis, and treat them as probabilities. If I must use the density histogram to answer the question of how likely a team scores in the first 15 minutes, I'd sum the heights of the first 3 columns, which is about 0.025, and then multiply it by the bin width of 5 minutes, which gives 0.125 or 12.5%.
At the end of this exploration, I like the frequency histogram best. The density histogram is useful when we are comparing different histograms, which isn't the most common use case.
The histogram is a basic chart in the tool kit. It's more complicated than it seems. I haven't come across any intro dataviz books that explain this clearly.
Most of this post deals with equal-width histograms. If we allow bin widths to vary, it gets even more complicated. Stay tuned.
For those using base R graphics, I hope this post helps you interpret what they say in the manual. The default behavior of the "hist" function depends on whether the bins are equal width:
- if the bin width is constant, then R produces a count histogram. As shown above, in a count histogram, the column heights indicate counts in bins but the total column area does not equal the total sample size, but the total sample size multiplied by the bin width. (Equal width is the default unless the user specifies bin breakpoints.)
- if the bin width is not constant, then R produces a (frequency-)density histogram. The column heights are densities, defined as frequencies divided by bin width while the column areas are frequencies, with the total area summing to 100%.
Unfortunately, R does not generate a frequency histogram. To make one, you'd have to divide the counts in bins by the sum of counts. (In making some of the graphs above, I tricked it.) You also need to trick it to make a frequency-density histogram with equal-width bins, as it's coded to produce a count histogram when bin size is fixed.
P.S. [5-2-2023] As pointed out by a reader, I should clarify that R and I use the word "frequency" differently. Specifically, R uses frequency to mean counts, therefore, what I have been calling the "count histogram", R would have called it a "frequency histogram", and what I have been describing as a "frequency histogram", the "hist" function simply does not generate it unless you trick it to do so. I'm using "frequency" in the everyday sense of the word, such as "the frequency of the bus". In many statistical packages, frequency is used to mean "count", as in the frequency table which is just a table of counts. The reader suggested proportion which I like, or something like weight.