What's a histogram?

Almost all graphing tools make histograms, and almost all dataviz books cover the subject. But I've always felt there are many unanswered questions. In my talk this Thursday in NYC, I'll provide some answers. You can reserve a spot here.

***

Here's the most generic histogram:

Salaries_count_histogram

Even Excel can make this kind of histogram. Notice that we have counts in the y-axis. Is this really a useful chart?

I haven't found this type of histogram useful ever, since I don't do analyses in which I needed to know the exact count of something - when I analyze data, I'm generalizing from the observed sample to a larger group.

Speaking of Excel, I felt that the developers have always hated histograms. Why is it much harder to make histograms than other basic charts?

***

Another question. We often think of histograms as a crude approximation to a probability density function (PDF). An example of a PDF is the famous bell curve. Textbooks sometimes show the concept like this:

Histogram_normal_pdf

This is true of only some types of histograms (and not the one shown in the first section!) Instead, we often face the following situation:

Normals_histogram50_undercurve

This isn't a trick. The data in the histogram above were generated by sampling the pink bell curve.

***

If you've used histograms, you probably also have run into strange issues. I haven't found much materials out there to address these questions, and they have been lingering in my mind, hidden, for a long time.

My Thursday talk will hopefully fill in some of these gaps.


My talk next week on histograms

Next Thursday (March 14), I'll be presenting at the Data Visualization New York Meetup, hosted by Naomi and Cameron. The event is in-person at Datadog's office. You can reserve your spot here.

Kfung_dataviznewyorkmeetup_mar2024

This talk is brand new, based on some work inspired by a blog post by Andrew Gelman. One of Andrew's correspondents asked about a particular type of histogram. While exploring this topic, I filled some of my own gaps in knowledge about this deceptively simple chart form. I'll be sharing this story.

Bits and pieces have appeared before on my blog. See this, this, and this for background.

If you're attending the talk, come up and say hi.

To register, click here.


Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


The art of making simple things harder

It's no longer a shock when a TV network such as MSNBC plays loose with the scaling of the column heights, as in this recent example:

Rachelbitecofer_markp_2024candidatescashonhand

Hat tip to Mark P. for forwarding the image, and Rachel for the original tweet.

***

What's shocking is that the designer appears to believe that the column heights of a column chart can be determined without reference to the data.

There is not a single relationship that has been retained on this chart. The designer just picks whatever size column is desired.

One obvious distortion is between the Biden and Trump columns. Trump's number is about 1/3 of Biden's (120 vs 40), and yet the red column's height is 70% of the blue's.

Furthermore, amongst the red columns, the heights are also haphazard. Trump's number is almost 3 times larger than Haley's; the ratio of column heights is almost 4 times. Haley's number is just a tad higher than DeSantis and yet Haley's column is twice the height of DeSantis.

Junkcharts_msnbc_candidatecash_analysis

***

There is a further, subtle distortion of the column's widths. By curving the chart canvas, certain columns are widened more than others. The diagram above retains the distorted widths and you can see that the Desantis column is wider than that of Haley's.

Here is what the undistorted column chart looks like:

Junkcharts_redo_msnbc_candidatecash

It's easy to make such a chart in Excel or any charting software, so it's mystery why this type of distortion happens. Did the designer open up an empty canvas and start putting up columns of any size?


Elevator shoes for column charts

Continuing my review of some charts spammed to me, I wasn’t expecting to find any interest in the following:

Masterworks_chart4

It’s a column chart showing the number of years of data available for different asset classes. The color has little value other than to subtly draw the reader’s attention to the bar called “Art,” which is the focus of the marketing copy.

Do the column heights encode the data?

The answer is no.

***

Let’s take a little journey. First I notice there is a grid behind the column chart, hanging above the baseline.

Redo_masterworks4_grid
I marked out two columns with values 50 and 25, so the second column should be exactly half the height of the first. Each column consists of two parts, the first overlapping the grid while the second connecting the bottom of the grid to the baseline. The second part is a constant for every column; I label this distance Y.  

Against the grid, the column “50” spans 9 cells while the column “25” spans 4 cells. I label the grid height X. Now, if the first column is twice the height of the second, the equation: 9X + Y = 2*(4X+Y) should hold.

The only solution to this equation is X = Y. In other words, the distance between the bottom of the grid to the baseline must be exactly the height of one grid cell if the column heights were to faithfully represent the data. Well – it’s obvious that the former is larger than the latter.

In the revision, I have chopped off the excess height by moving the baseline upwards.

Redo_masterworks4_corrected

That’s the mechanics. Now, figuring out the motivation is another matter.


Chartjunk as marketing copy

I got some spam marketing message last week. How exciting. They even use a subject line that has absolutely nothing to do with its content, baiting me to open it. And open I did, to some data graphics horrors.

The marketer promises a whole series of charts to prove that art is a great asset class for investment returns.

The very first chart already caught my full attention. It's this one:

Masterworks_chart1

It's a simple bar chart, with four values. Looks innocuous.

I'm unable to appreciate the recent trend to align bars in the middle, rather than at their bases. So I converted it to the canonical form:

Redo_masterworks_1_barchart

Do you see the problem?

The second value ($1.7 trillion) is exactly half the size of the first value ($3.4 trillion) and yet the second bar is two-thirds of the length of the first bar. So, the size of the second bar is exaggerated relative to its label – and that’s the bar displaying the market size for “art,” which is what the spammer is pitching.

The bottom pair of values share the same relationship: $0.8 trillion is exactly half of $1.6 trillion. Again, the relative lengths of those two bars are not 50% but slightly over 60%.

Redo_masterworks_1_barchart_excess

Did the designer think that the bar lengths could be customized to whatever s/he desires? This one is hard to crack.

***

The sixth chart in the series is a different kind of puzzle:

Masterworks_chart6

All three lines have the exact same labels but show different values over time.

***

And they have pie charts, of course. Take a look:

Masterworks_chart

Something went wrong here too. I'll leave it to my readers who can certainly figure it out :)

***

These charts were probably spammed to at least thousands.

 


Two metrics in-fighting

The Wall Street Journal shows the following chart which pits two metrics against each other:

Wsj_salaries25to29

The primary metric is the change in median yearly salary between the two periods of time. We presume it's primary because of its presence in the chart title, and the blue bars being more readable than the green bubbles. The secondary metric is the median yearly salary in the later period.

That, I believe, was the intended design. When I saw this chart, my eyes went to the numbers inside the green bubbles. Perhaps it's because I didn't read the chart title first, and the horizontal axis wasn't labelled so it wasn't obvious what the blue bars coded.

As with most bubble charts, the data labels exist to cover up the inadequacy of circular areas. The self-sufficiency test - removing the data labels - shows this well:

Redo_wsj_salaries25to29

It's simply impossible to know what values should be in each bubble, or to perceive the relative sizes of those bubbles.

***

Reversing the order of the blue bars also helps:

Redo_wsjsalaries25to29_2

The original order is one of the more annoying features in most visualization packages. Because internally, the categories are numbered 1, 2, 3, ..., and because the convention is to have values run higher as they run up the vertical axis, these packages would place the top-ranked item at the bottom of the chart.

Most people read top to bottom, which means that they read the least important item first, and the most important item last!

In most visualization packages, it takes only 1 click or 1 action to reverse the order of the items. Please do it!

***

For change over time, I like using a Bumps chart, otherwise called a slope graph:

Redo_wsjsalaries25to29_3


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


Losing the plot while stacking up the bars

I came across this chart from an infographics that claims to show which zip codes in the U.S. are the "dirtiest" (link). I won't go into the data analysis in this post - it's the usual "open data" style analysis that takes whatever data they could find (in this case, 311 calls) and make some hay out of it.

03_Dirtiest-Zip-Codes-in-New-York

It's amazing how such analyses frequently land on the Top N, Bottom N table. Top/Bottom N is euphemistically called "insights". But "insights" should answer at least one of these following questions: Where are these zip codes? What's the reason why 11216 has the highest rate of complaints while 11040 has the lowest? What measures can be taken to make the city cleaner?

***

The basic form chosen for this graphic is the bar chart. The data concerns the number of complaints per 100,000 people (about sanitation - they didn't disclose how they classified a complaint as about sanitation).

To mitigate the "boredom" of bar charts, the designer made the edges of the bars swiggly, and added icons of items found in trash inside the bars. These are thankfully not too intrusive.

Why are all the data printed on the chart? Try mentally wiping the data labels, and you'll understand why the designer did it.

If readers look at data labels rather than the bars, then the data visualization surely has failed. I'd prefer to use an axis

If you spend a few more minutes on the chart, you may notice the gray parts. This is not the simple bar chart but a stacked bar chart. In effect, every bar is referenced to the first bar, which shows the maximum number of complaints per 100K people. For example, zip code 10474 has about 90% of the complaints experienced in zip code 11216, the "dirtiest" place in New York.

***

The infographic then moves on to Los Angeles, and repeats the Top N/Bottom N presentation:

04_Dirtiest-Zip-Codes-in-Los-Angeles

With this, the plot is lost.

For an inexplicable reason, the dirtiest zip code in LA does not occupy the entire length of the bar. The worst zip code here fills out 87% of the bar length, implying that the entire bar represents the value of 34,978 complaints per 100K people. How did the designer decide on this number?

As a result, every other value is referenced to 34,978 and not to the rate of complaints in the dirtiest zip code!

***

The infographic eventually covers Houston. Here are the dirtiest two zip codes in Houston:

Housefresh_houston_dirtiest2

How does one interpret the orange section of the second bar? The original intention is for us to see that this zip code is about 80% as dirty as the dirtiest zip code. However, the full length of the bar does not here represent the dirtiest zip code.

***

We also got a hint as to why this entire analysis is problematic. The values in LA are way bigger than those in NY, about 4 times higher at the top of the table. Is LA really that much dirtier than NY? Or perhaps the data have not been properly aligned between cities?

 

P.S. [8-26-2023] Added link to the infographic.

 


Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.