« March 2023 | Main | May 2023 »

Equal-area histograms

Andrew posted about a message from one of his readers about a "percentogram", which is a variant of the histogram (link).

Let's review what a histogram is. This one I created by grabbing a dataset of scoring in international football (soccer) matches. 

Junkcharts_histogram_scoringminute

The histogram is a specialized type of column chart, typically displayed with zero spacing between columns. The horizontal axis represents the metric being plotted. In the example above, it's the minute of scoring with values between 0 and 120. (There are few points beyond 90 minutes as only certain tournaments prescribe extra-time in case of ties at the end of 90 minutes.) Other examples are income if it's a histogram of income distribution, or age if it's an age distribution. The metric is typically measured numerically.

The notion of "bins" is fundamental. The numeric metric on the horizontal axis is discretized into "bins". (In the above example, R selected a bin width of 10 minutes.) Binning transforms the data to fit natural concepts such as income groups and age groups. Each column has a bin width, and its height is either counts or frequencies. Counts are the number of data points that fall inside a given bin. Frequencies are the proportion of data points that fall inside a given bin. The sum of counts equals the sample size while the sum of frequencies is 100%. 

There is a large literature on how to "bin" the data. In the canonical histogram, the horizontal axis is subdivided into bins of equal length. This means the histogram columns are equal width. You can readily see that in the histogram shown above.

The histogram conveys information about where the data is. The following set includes three possibilities: a uniform distribution (in which the data density is the same over the range of the horizontal axis), a normal distribution (the familiar bell curve), and a long-tailed distribution with mostly small values.

Junkcharts_histograms_panel

***

In general, the bin widths need not to be the same for all bins. One can vary bin widths. The "percentogram" referenced in Andrew's post specifies one way of setting varying bin widths: it prescribes that the breaks between bins be percentiles of the distribution.

In other words, the first bin contains the lowest 1% of the data, the second bin, the next 1%, etc. Let's see the effect of varying bin widths on the normal distribution. Note that I set the average value to be 4 so that almost all the data fall between 1 and 7.

Junkcharts_histograms_normalsThe top chart is a dot plot showing the actual data. The middle chart shows a generic equal-width histogram. The third chart shows the percentogram, in which each bin contains 1% of the data.

In the percentogram, the bin width is a function of the density of data, with wider bins where data are sparse and narrower bins where data are dense. For a normal distribution, the data are quite concentrated in the middle. The columns on the side are wide and short.

While the standard histogram has equal-width bins, the percentogram has bin widths that vary. What is fixed is the amount of data in each column, that is to say, the area of each column is fixed.

 

 

 

 

 

The following set of charts corresponds to the triple shown above. Each chart contains 10,000 random samples. The same datasets were used in both sets of charts.Junkcharts_standardhistograms_panel

A few negatives of the percentogram jump out.

The column heights are rather jagged, and that is purely random noise, since the data are drawn from standard distributions. Because the data are divided into 100 parts, each column contains 100 data points, and that sample size is not large enough to tame the sampling noise.

Also, the middle of the normal distribution, where most of the data sit, look hollow. This is feature, not a bug, of the percentogram. Dense columns will be narrow and tall, like lines while sparse columns will be wide and short. Even though each column has the same area, our eyes tend to focus on the short wide ones, rather than the pencil columns.

***

Nothing should stop us from making equal-area bins based on other quantiles. For example, the following set divides the data into 20 bins (demideciles) instead of 100:

Junkcharts_equalareahistograms20
With smaller number of bins, the envelope of the histograms are less jagged. The denser columns are also less narrow, and thus don't exhibit the hollowing effect.

For validation, the equal-area histogram for uniform distributions looks the same as the equal-width histogram. This is expected since by definition, the data density is uniform across the whole range. Columns that contain equal counts should therefore have equal widths as the heights should be equal (excluding randomness).

What's a use case for equal-area histograms? It's for reading off sections of the data.

Junkcharts_equalareahistogram_usecase

With 20 bins, each column contains 5% of the data. So it's easy to find the cutoff value for the top 10% of the above distribution. Just count the two columns from the right. Finding the middle 50% is a bit harder, unless the column index is printed on the chart but it's possible to find the range of values that include half the dataset.

By contrast, the standard histogram does not offer a ready answer to this type of question. One would have to look at the height of each column and start adding.

***

I have a few other comments on this variant of the histogram as well as some details I glossed over in this post. Stay tuned!

 


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.

Junkcharts_cbcrevenues_deconstructed1

The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.

***

The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:

Junkcharts_cbcrevenues_deconstructed2

Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.

***

Next, I add back the data labels:

Cbc_revenues_original

The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.

***

Here is the whole chart:

Cbc_revenues_original

A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.

***

The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.

***

A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.


Showing both absolute and relative values on the same chart 2

In the previous post, I looked at Visual Capitalist's visualization of the amount of uninsured deposits at U.S. banks. Using a stacked bar chart, I placed both absolute and relative values on the same chart.

In making that chart, I made these three tradeoffs.

First, I elevated absolute values (dollar amounts) over relative values (proportions). The original designer decided the opposite.

Second, I elevated the TBTF banks over the smaller banks. The original designer also decided the opposite.

Third, I elevated the total value over the disaggregated values (insured, uninsured). The original designer only visualized the uninsured values in the bars.

Which chart is better depends on what story one wants to tell.

***
For today's post, I'm showing another sketch of the same data, with the same goal of putting both absolute and relative values on the same chart.

Redo_visualcapitalist_uninsureddeposits_2b

The starting point of this sketch is the original chart - the stacked bar chart showing relative proportions. I added the insured portion so that it is on almost equal footing as the uninsured portion of the deposits. This edit is crucial to convey the impression of proportions.

My story hasn't changed; I still want to elevate the TBTF banks.

For this version, I try a different way of elevating TBTF banks. The key step is to encode data into the heights of the bars. I use these bar heights to convey the relative importance of banks, as reflected by total deposits.

The areas of the red blocks represent the uninsured amounts. That said, it's not easy to compare rectangular areas when both dimensions are different.

Comparing the total red area with the total yellow area, we learn that the majority of deposits in these banks are uninsured(!)

 


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:

Fivethirtyeight_abortionmap_colorlegend

Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.

Joshuastephens_singlemap

Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?

Joshuastephens_twomaps

The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.

***

Think about what the reader goes through when comparing two counties.

Junkcharts_bivariatechoropleths

Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.