Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.

Guardian_globalwarming

Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.

***

What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

Vitamins

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

Junkcharts_redo_msn_dairyplantmilks_2

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

Junkcharts_redo_msn_dairyplantmilks_1

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.


The cult of raw unadjusted data

Long-time reader Aleks came across the following chart on Facebook:

Unadjusted temp data fgfU4-ia fb post from aleks

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

_numbersense_bookcoverIn Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

Epa-seasonal-temperature_2022

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.


Several tips for visualizing matrices

Continuing my review of charts that were spammed to my inbox, today I look at the following visualization of a matrix of numbers:

Masterworks_chart9

The matrix shows pairwise correlations between the returns of 16 investment asset classes. Correlation is a number between -1 and 1. It is a symmetric scale around 0. It embeds two dimensions: the magnitude of the correlation, and its direction (positive or negative).

The correlation matrix is a special type of matrix: a bit easier to deal with as the data already come “standardized”. As with the other charts in this series, there is a good number of errors in the chart's execution.

I’ll leave the details maybe for a future post. Just check two key properties of a correlation matrix: the diagonal consisting of self-correlations should contain all 1s; and the matrix should be symmetric across that diagonal.

***

For this post, I want to cover nuances of visualizing matrices. The chart designer knows exactly what the message of the chart is - that the asset class called "art" is attractive because it has little correlation with other popular asset classes. Regardless of the chart's errors, it’s hard for the reader to find the message in the matrix shown above.

That's because the specific data carrying the message sit in the bottom row (and the rightmost column). The cells in this row (and column) has a light purple color, which has been co-opted by the even lighter gray color used for the diagonal cells. These diagonal cells pop out of the chart despite being the least informative (they have the same values for all correlation matrices!)

***

Several tactics can be deployed to push the message to the fore.

First, let's bring the key data to the prime location on the chart - this is the top row and left column (for cultures which read top to bottom, left to right).

Redo_masterwork9_matrix_arttop

For all the drafts in this post, I have dropped the text descriptions of the asset classes, and replaced them with numbers so that it's easier to follow the changes. (For those who're paying attention, I also edited the data to make the matrix symmetric.)

Second, let's look at the color choice. Here, the designer made a wise choice of restricting the number of color levels to three (dark, medium and light). I retained that decision in the above revision - actually, I used four colors but there are no values in one of the four sections, therefore, effectively, only three colors appear. But let's look at what happens when the number of color levels is increased.

Redo_masterwork9_matrix_colors

The more levels of color, the more strain it puts on our processing... with little reward.

Third, and most importantly, the order of the categories affects perception majorly. I have no idea what the designer used as the sorting criterion. In step one of the fix, I moved the art category to the front but left all the other categories in the original order.

The next chart has the asset classes organized from lowest to highest average correlation. Conveniently, using this sorting metric leaves the art category in its prime spot.

Redo_masterwork9_matrix_orderbyavg

Notice that the appearance has completely changed. The new version brings out clusters in the data much more effectively. Most of the assets in the bottom of the chart have high correlation with each other.

Finally, because the correlation matrix is symmetric across the diagonal of self-correlations, the two halves are mirror images and thus redundant. The following removes one of the mirrored halves, and also removes the diagonal, leading to a much cleaner look.

Redo_masterwork9_matrix_orderbyavg_tri

Next time you visualize a matrix, think about how you sort the rows/columns, how you choose the color scale, and whether to plot the mirrored image and the diagonal.

 

 

 


Elevator shoes for column charts

Continuing my review of some charts spammed to me, I wasn’t expecting to find any interest in the following:

Masterworks_chart4

It’s a column chart showing the number of years of data available for different asset classes. The color has little value other than to subtly draw the reader’s attention to the bar called “Art,” which is the focus of the marketing copy.

Do the column heights encode the data?

The answer is no.

***

Let’s take a little journey. First I notice there is a grid behind the column chart, hanging above the baseline.

Redo_masterworks4_grid
I marked out two columns with values 50 and 25, so the second column should be exactly half the height of the first. Each column consists of two parts, the first overlapping the grid while the second connecting the bottom of the grid to the baseline. The second part is a constant for every column; I label this distance Y.  

Against the grid, the column “50” spans 9 cells while the column “25” spans 4 cells. I label the grid height X. Now, if the first column is twice the height of the second, the equation: 9X + Y = 2*(4X+Y) should hold.

The only solution to this equation is X = Y. In other words, the distance between the bottom of the grid to the baseline must be exactly the height of one grid cell if the column heights were to faithfully represent the data. Well – it’s obvious that the former is larger than the latter.

In the revision, I have chopped off the excess height by moving the baseline upwards.

Redo_masterworks4_corrected

That’s the mechanics. Now, figuring out the motivation is another matter.


One bubble is a tragedy, and a bag of bubbles is...

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

Ukraine_grains_bubbles

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

Redo_junkcharts_afp_ukrainegrains_europe_1

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Redo_junkcharts_afpukrainegrains_2

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

Afp_ukrianegrains_legend

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

Ukraine_grains_flows

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!


Why some dataviz fail

Maxim Lisnic's recent post should delight my readers (link). Thanks Alek for the tip. Maxim argues that charts "deceive" not merely by using visual tricks but by a variety of other non-visual means.

This is also the reasoning behind my Trifecta Checkup framework which looks at a data visualization project holistically. There are lots of charts that are well designed and constructed but fail for other reasons. So I am in agreement with Maxim.

He analyzed "10,000 Twitter posts with data visualizations about COVID-19", and found that 84% are "misleading" while only 11% of the 84% "violate common design guidelines". I presume he created some kind of computer program to evaluate these 10,000 charts, and he compiled some fixed set of guidelines that are regarded as "common" practice.

***

Let's review Maxim's examples in the context of the Trifecta Checkup.

_trifectacheckup_image

The first chart shows Covid cases in the U.S. in July and August of 2021 (presumably the time when the chart was published) compared to a year ago (prior to the vaccination campaign).

Maxim_section1

Maxim calls this cherry-picking. He's right - and this is a pet peeve of mine, even with all the peer-reviewed scientific research. In my paper on problems with observational studies (link), my coauthors and I call for a new way forward: researchers should put their model calculations up on a website which is updated as new data arrive, so that we can be sure that the conclusions they published apply generally to all periods of time, not just the time window chosen for the publication.

Looking at the pair of line charts, readers can quickly discover its purpose, so it does well on the Q(uestion) corner of the Trifecta. The cherry-picking relates to the link between the Question and the Data, showing that this chart suffers from subpar analysis.

In addition, I find that the chart also misleads visually - the two vertical scales are completely different: the scale on the left chart spans about 60,000 cases while on the right, it's double the amount.

Thus, I'd call this a Type DV chart, offering opportunities to improve in two of the three corners.

***

The second chart cited by Maxim plots a time series of all-cause mortality rates (per 100,000 people) from 1999 to 2020 as columns.

The designer does a good job drawing our attention to one part of the data - that the average increase in all-cause mortality rate in 2020 over the previous five years was 15%. I also like the use of a different color for the pandemic year.

Then, the designer lost the plot. Instead of drawing a conclusion based on the highlighted part of the data, s/he pushed a story that the 2020 rate was about the same as the 2003 rate. If that was the main message, then instead of computing a 15% increase relative to the past five years, s/he should have shown how the 2003 and 2020 levels are the same!

On a closer look, there is a dashed teal line on the chart but the red line and text completely dominate our attention.

This chart is also Type DV. The intention of the designer is clear: the question is to put the jump in all-cause mortality rate in a historical context. The problem lies again with subpar analysis. In fact, if we take the two insights from the data, they both show how serious a problem Covid was at the time.

When the rate returned to the level of 2003, we have effectively gave up all the gains made over 17 years in a few months.

Besides, a jump in 15% from year to year is highly significant if we look at all other year-to-year changes shown on the chart.

***

The next section concerns a common misuse of charts to suggest causality when the data could only indicate correlation (and where the causal interpretation appears to be dubious). I may write a separate post about this vast topic in the future. Today, I just want to point out that this problem is acute with any Covid-19 research, including official ones.

***

I find the fourth section of Maxim's post to be less convincing. In the following example, the tweet includes two charts, one showing proportion of people vaccinated, and the other showing the case rate, in Iceland and Nigeria.

Maxim_section4

This data visualization is poor even on the V(isual) corner. The first chart includes lots of countries that are irrelevant to the comparison. It includes the unnecessary detail of fully versus partially vaccinated, unnecessary because the two countries selected are at two ends of the scale. The color coding is off sync between the two charts.

Maxim's critique is:

The user fails to account, however, for the fact that Iceland had a much higher testing rate—roughly 200 times as high at the time of posting—making it unreasonable to compare the two countries.

And the section is titled "Issues with Data Validity". It's really not that simple.

First, while the differential testing rate is one factor that should be considered, this factor alone does not account for the entire gap. Second, this issue alone does not disqualify the data. Third, if testing rate differences should be used to invalidate this set of data, then all of the analyses put out by official sources lauding the success of vaccination should also be thrown out since there are vast differences in testing rates across all countries (and also across different time periods for the same country).

One typical workaround for differential testing rate is to look at deaths rather than cases. For the period of time plotted on the case curve, Nigeria's cumulative death per million is about 1/8th that of Iceland. The real problem is again in the Data analysis, and it is about how to interpret this data casually.

This example is yet another Type DV chart. I'd classify it under problems with "Casual Inference". "Data Validity" is definitely a real concern; I just don't find this example convincing.

***

The next section, titled "Failure to account for statistical nuance," is a strange one. The example is a chart that the CDC puts out showing the emergence of cases in a specific county, with cases classified by vaccination status. The chart shows that the vast majority of cases were found in people who were fully vaccinated. The person who tweeted concluded that vaccinated people are the "superspreaders". Maxim's objection to this interpretation is that most cases are in the fully vaccinated because most people are fully vaccinated.

I don't think it's right to criticize the original tweeter in this case. If by superspreader, we mean people who are infected and out there spreading the virus to others through contacts, then what the data say is exactly that most such people are fully vaccinated. In fact, one should be very surprised if the opposite were true.

Indeed, this insight has major public health implications. If the vaccine is indeed 90% effective at stopping cases, we should not be seeing that level of cases. And if the vaccine is only moderately effective, then we may not be able to achieve "herd immunity" status, as was the plan originally.

I'd be reluctant to complain about this specific data visualization. It seems that the data allow different interpretations - some of which are contradictory but all of which are needed to draw a measured conclusion.

***
The last section on "misrepresentation of scientific results" could use a better example. I certainly agree with the message: that people have confirmation bias. I have been calling this "story-first thinking": people with a set story visualize only the data that support their preconception.

However, the example given is not that. The example shows a tweet that contains a chart from a scientific paper that apparently concludes that hydroxychloroquine helps treat Covid-19. Maxim adds this study was subsequently retracted. If the tweet was sent prior to the retraction, then I don't think we can grumble about someone citing a peer reviewed study published in Lancet.

***

Overall, I like Maxim's message. In some cases, I think there are better examples.

 

 


The one thing you're afraid to ask about histograms

In the previous post about a variant of the histogram, I glossed over a few perplexing issues - deliberately. Today's post addresses one of these topics: what is going on in the vertical axis of a histogram?

The real question is: what data are encoded in the histogram, and where?

***

Let's return to the dataset from the last post. I grabbed data from a set of international football (i.e. soccer) matches. Each goal scored has a scoring minute. If the goal is scored in regulation time, the scoring minute is a number between 1 and 90 minutes. Specifically, the data collector applies a rounding up: any goal scored between 0 and 60 seconds is recorded as 1, all the way up to a goal scored between 89 and 90th minute being recorded as 90. In this post, I only consider goals scored in regulation time so the horizontal axis is between 1-90 minutes.

The kneejerk answer to the posed question is: counts in bins. Isn't it the case that in constructing a histogram, we divide the range of values (1-90) into bins, and then plot the counts within bins, i.e. the number of goals scored within each bin of minutes?

The following is what we have in mind:

Junkcharts_counthistogram_1

Let's call this the "count histogram".

Some readers may dislike the scale of the vertical axis, as its interpretation hinges on the total sample size. Hence, another kneejerk answer is: frequencies in bins. Instead of plotting counts directly, plot frequencies, which are just standardized counts. Just divide each value by the sample size. Here's the "frequency histogram":

Junkcharts_freqhistogram_1

The count and frequency histograms are identical except for the scale, and appear intuitively clear. The count and frequency data are encoded in the heights of the columns. The column widths are an afterthought, and they adhere to a fixed constant. Unlike a column chart, typically the gap width in a histogram is zero, as we want to partition the horizontal range into adjoining sections.

Now, if you look carefully at the histogram from the last post, reproduced below, you'd find that it plots neither counts nor frequencies:

Junkcharts_densityhistogram_1

The numbers on the axis are fractions, and suggest that they may be frequencies, but a quick check proves otherwise: with 9 columns, the average column should contain at least 10 percent of the data. The total of the displayed fractions is nowhere near 100%, which is our expectation if the values are relative frequencies. You may have come across this strangeness when creating histograms using R or some other software.

The purpose of this post is to explain what values are being plotted and why.

***

What are the kinds of questions we like to answer about the distribution of data?

At a high level, we want to know "where are my data"?

Arguably these two questions are fundamental:

  • what is the probability that the data falls within a given range of values? e.g., what is the probability that a goal is scored in the first 15 minutes of a football match?
  • what is the relative probability of data between two ranges of values? e.g. are teams more likely to score in last 5 minutes of the first half or the last five minutes of the second half of a football match?

In a histogram, the first question is answered by comparing a given column to the entire set of columns while the second question is answered by comparing one column to another column.

Let's see what we can learn from the count histogram.

Junkcharts_counthistograms_questions

In a count histogram, the heights encode the count data. To address the relative probability question, we note that the ratio of heights is the ratio of counts, and the ratio of counts is the same as the ratio of frequencies. Thus, we learn that teams are roughly 3000/1500 = 1.5 times more likely to score in the last 5 minutes of the second half than during the last 5 minutes of the first half. (See the green columns).

[For those who follow football, it's clear that the data collector treated goals scored during injury time of either half as scored during the last minute of the half, so this dataset can't be used to analyze timing of goals unless the real minutes were recorded for injury-time goals.]

To address the range probability question, we compare the aggregate height of the three orange columns with the total heights of all columns. Note that I said "height", not "area," because the heights directly encode counts. It's actually taxing to figure out the total height!

We resort to reading the total area of all columns. This should yield the correct answer: the area is directly proportional to the height because the column widths are fixed as a constant. Bear in mind, though, if the column widths vary (the theme of the last post), then areas and heights are not interchangable concepts.

Estimating the total area is still not easy, especially if the column heights exhibit high variance. What we need is the proportion of the total area that is orange. It's possible to see, not easy.

You may interject now to point out that the total area should equal the aggregate count (sample size). But that is a fallacy! It's very easy to make this error. The aggregate count is actually the total height, and because of that, the total area is the aggregate count multiplied by the column width! In my example, the total height is 23,682, which is the number of goals in the dataset, while the total area is 23,682 times 5 minutes.

[For those who think in equations, the total area is the sum over all columns of height(i) x width(i). When width is constant, we can take it outside the sum, and the sum of height(i) is just the total count.]

***

The count histogram is hard to use because it requires knowing the sample size. It's the first thing that is produced because the raw data are counts in bins. The frequency histogram is better at delivering answers.

In the frequency histogram, the heights encode frequency data. We can therefore just read off the relative probability of the orange column, bypassing the need to compute the total area.

This workaround actually promotes the fallacy described above for the count histogram. It is easy to fall into the trap of thinking that the total area of all columns is 100%. It isn't.

Similar to before, the total height should be the total frequency but the total area is the total frequency multipled by the column width, that is to say, the total area is the reciprocal of the bin width. In the football example, using 5-minute intervals, the total area of the frequency histogram is 1/(5 minutes) in the case of equal bin widths.

How about the relative probability question? On the frequency histogram, the ratio of column heights is the ratio of frequencies, which is exactly what we want. So long as the column width is constant, comparing column heights is easy.

***

One theme in the above discussion is that in the count and frequency histograms, the count and frequency data are encoded in the column heights but not the column areas. This is a source of major confusion. Because of the convention of using equal column widths, one treats areas and heights as interchangable... but not always. The total column area isn't the same as the total column height.

This observation has some unsettling implications.

As shown above, the total area is affected by the column width. The column width in an equal-width histogram is the range of the x-values divided by the number of bins. Thus, the total area is a function of the number of bins.

Consider the following frequency histograms of the same scoring minutes dataset. The only difference is the number of bins used.

Junkcharts_freqhistogram_differentbins

Increasing the number of bins has a series of effects:

  • the columns become narrower
  • the columns become shorter, because each narrower bin can contain at most the same count as the wider bin that contains it.
  • the total area of the columns become smaller.

This last one is unexpected and completely messes up our intuition. When we increase the number of bins, not only are the columns shortening but the total area covered by all the columns is also shrinking. Remember that the total area whether it is a count or frequency histogram has a factor equal to the bin width. Higher number of bins means smaller bin width, which means smaller total area.

***

What if we force the total area to be constant regardless of how many bins we use? This setting seems more intuitive: in the 5-bin histogram, we partition the total area into five parts while in the 10-bin histogram, we divide it into 10 parts.

This is the principle used by R and the other statistical software when they produce so-called density histograms. The count and frequency data are encoded in the column areas - by implication, the same data could not have been encoded simultaneously in the column heights!

The way to accomplish this is to divide by the bin width. If you look at the total area formulas above, for the count histogram, total area is total count x bin width. If the height is count divided by bin width, then the total area is the total count. Similarly, if the height in the frequency histogram is frequency divided by bin width, then the total area is 100%.

Count divided by some section of the x-range is otherwise known as "density". It captures the concept of how tightly the data are packed inside a particular section of the dataset. Thus, in a count-density histogram, the heights encode densities while the areas encode counts. In this case, total area is the total count. If we want to standardize total area to be 1, then we should compute densities using frequencies rather than counts. Frequency densities are just count densities divided by the total count.

To summarize, in a frequency-density histogram, the heights encode densities, defined as frequency divided by the bin width. This is not very intuitive; just think of densities as how closely packed the data are in the specified bin. The column areas encode frequencies so that the total area is 100%.

The reason why density histograms are confusing is that we are reading off column heights while thinking that the total area should add up to 100%. Column heights and column areas cannot both add up to 100%. We have to pick one or the other.

Comparing relative column heights still works when the density histogram has equal bin widths. In this case, the relative height and relative area are the same because relative density equals relative frequencies if the bin width is fixed.

The following charts recap the discussion above. It shows how the frequency histogram does not preserve the total area when bin sizes are changed while the density histogram does.

Junkcharts_freqdensityhistograms_differentbins

***

The density histogram is a major pain for solving range probability questions because the frequencies are encoded in the column areas, not the heights. Areas are not marked out in a graph.

The column height gives us densities which are not probabilities. In order to retrieve probabilities, we have to multiply the density by the bin width, that is to say, we must estimate the area of the column. That requires mapping two dimensions (width, height) onto one (area). It is in fact impossible without measurement - unless we make the bin widths constant.

When we make the bin widths constant, we still can't read densities off the vertical axis, and treat them as probabilities. If I must use the density histogram to answer the question of how likely a team scores in the first 15 minutes, I'd sum the heights of the first 3 columns, which is about 0.025, and then multiply it by the bin width of 5 minutes, which gives 0.125 or 12.5%.

At the end of this exploration, I like the frequency histogram best. The density histogram is useful when we are comparing different histograms, which isn't the most common use case.

***

The histogram is a basic chart in the tool kit. It's more complicated than it seems. I haven't come across any intro dataviz books that explain this clearly.

Most of this post deals with equal-width histograms. If we allow bin widths to vary, it gets even more complicated. Stay tuned.

***

For those using base R graphics, I hope this post helps you interpret what they say in the manual. The default behavior of the "hist" function depends on whether the bins are equal width:

  • if the bin width is constant, then R produces a count histogram. As shown above, in a count histogram, the column heights indicate counts in bins but the total column area does not equal the total sample size, but the total sample size multiplied by the bin width. (Equal width is the default unless the user specifies bin breakpoints.)
  • if the bin width is not constant, then R produces a (frequency-)density histogram. The column heights are densities, defined as frequencies divided by bin width while the column areas are frequencies, with the total area summing to 100%.

Unfortunately, R does not generate a frequency histogram. To make one, you'd have to divide the counts in bins by the sum of counts. (In making some of the graphs above, I tricked it.) You also need to trick it to make a frequency-density histogram with equal-width bins, as it's coded to produce a count histogram when bin size is fixed.

 

P.S. [5-2-2023] As pointed out by a reader, I should clarify that R and I use the word "frequency" differently. Specifically, R uses frequency to mean counts, therefore, what I have been calling the "count histogram", R would have called it a "frequency histogram", and what I have been describing as a "frequency histogram", the "hist" function simply does not generate it unless you trick it to do so. I'm using "frequency" in the everyday sense of the word, such as "the frequency of the bus". In many statistical packages, frequency is used to mean "count", as in the frequency table which is just a table of counts. The reader suggested proportion which I like, or something like weight.

 

 

 

 

 


Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:

Fivethirtyeight_abortionmap_colorlegend

Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.

Joshuastephens_singlemap

Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?

Joshuastephens_twomaps

The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.

***

Think about what the reader goes through when comparing two counties.

Junkcharts_bivariatechoropleths

Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.