The one thing you're afraid to ask about histograms

In the previous post about a variant of the histogram, I glossed over a few perplexing issues - deliberately. Today's post addresses one of these topics: what is going on in the vertical axis of a histogram?

The real question is: what data are encoded in the histogram, and where?


Let's return to the dataset from the last post. I grabbed data from a set of international football (i.e. soccer) matches. Each goal scored has a scoring minute. If the goal is scored in regulation time, the scoring minute is a number between 1 and 90 minutes. Specifically, the data collector applies a rounding up: any goal scored between 0 and 60 seconds is recorded as 1, all the way up to a goal scored between 89 and 90th minute being recorded as 90. In this post, I only consider goals scored in regulation time so the horizontal axis is between 1-90 minutes.

The kneejerk answer to the posed question is: counts in bins. Isn't it the case that in constructing a histogram, we divide the range of values (1-90) into bins, and then plot the counts within bins, i.e. the number of goals scored within each bin of minutes?

The following is what we have in mind:


Let's call this the "count histogram".

Some readers may dislike the scale of the vertical axis, as its interpretation hinges on the total sample size. Hence, another kneejerk answer is: frequencies in bins. Instead of plotting counts directly, plot frequencies, which are just standardized counts. Just divide each value by the sample size. Here's the "frequency histogram":


The count and frequency histograms are identical except for the scale, and appear intuitively clear. The count and frequency data are encoded in the heights of the columns. The column widths are an afterthought, and they adhere to a fixed constant. Unlike a column chart, typically the gap width in a histogram is zero, as we want to partition the horizontal range into adjoining sections.

Now, if you look carefully at the histogram from the last post, reproduced below, you'd find that it plots neither counts nor frequencies:


The numbers on the axis are fractions, and suggest that they may be frequencies, but a quick check proves otherwise: with 9 columns, the average column should contain at least 10 percent of the data. The total of the displayed fractions is nowhere near 100%, which is our expectation if the values are relative frequencies. You may have come across this strangeness when creating histograms using R or some other software.

The purpose of this post is to explain what values are being plotted and why.


What are the kinds of questions we like to answer about the distribution of data?

At a high level, we want to know "where are my data"?

Arguably these two questions are fundamental:

  • what is the probability that the data falls within a given range of values? e.g., what is the probability that a goal is scored in the first 15 minutes of a football match?
  • what is the relative probability of data between two ranges of values? e.g. are teams more likely to score in last 5 minutes of the first half or the last five minutes of the second half of a football match?

In a histogram, the first question is answered by comparing a given column to the entire set of columns while the second question is answered by comparing one column to another column.

Let's see what we can learn from the count histogram.


In a count histogram, the heights encode the count data. To address the relative probability question, we note that the ratio of heights is the ratio of counts, and the ratio of counts is the same as the ratio of frequencies. Thus, we learn that teams are roughly 3000/1500 = 1.5 times more likely to score in the last 5 minutes of the second half than during the last 5 minutes of the first half. (See the green columns).

[For those who follow football, it's clear that the data collector treated goals scored during injury time of either half as scored during the last minute of the half, so this dataset can't be used to analyze timing of goals unless the real minutes were recorded for injury-time goals.]

To address the range probability question, we compare the aggregate height of the three orange columns with the total heights of all columns. Note that I said "height", not "area," because the heights directly encode counts. It's actually taxing to figure out the total height!

We resort to reading the total area of all columns. This should yield the correct answer: the area is directly proportional to the height because the column widths are fixed as a constant. Bear in mind, though, if the column widths vary (the theme of the last post), then areas and heights are not interchangable concepts.

Estimating the total area is still not easy, especially if the column heights exhibit high variance. What we need is the proportion of the total area that is orange. It's possible to see, not easy.

You may interject now to point out that the total area should equal the aggregate count (sample size). But that is a fallacy! It's very easy to make this error. The aggregate count is actually the total height, and because of that, the total area is the aggregate count multiplied by the column width! In my example, the total height is 23,682, which is the number of goals in the dataset, while the total area is 23,682 times 5 minutes.

[For those who think in equations, the total area is the sum over all columns of height(i) x width(i). When width is constant, we can take it outside the sum, and the sum of height(i) is just the total count.]


The count histogram is hard to use because it requires knowing the sample size. It's the first thing that is produced because the raw data are counts in bins. The frequency histogram is better at delivering answers.

In the frequency histogram, the heights encode frequency data. We can therefore just read off the relative probability of the orange column, bypassing the need to compute the total area.

This workaround actually promotes the fallacy described above for the count histogram. It is easy to fall into the trap of thinking that the total area of all columns is 100%. It isn't.

Similar to before, the total height should be the total frequency but the total area is the total frequency multipled by the column width, that is to say, the total area is the reciprocal of the bin width. In the football example, using 5-minute intervals, the total area of the frequency histogram is 1/(5 minutes) in the case of equal bin widths.

How about the relative probability question? On the frequency histogram, the ratio of column heights is the ratio of frequencies, which is exactly what we want. So long as the column width is constant, comparing column heights is easy.


One theme in the above discussion is that in the count and frequency histograms, the count and frequency data are encoded in the column heights but not the column areas. This is a source of major confusion. Because of the convention of using equal column widths, one treats areas and heights as interchangable... but not always. The total column area isn't the same as the total column height.

This observation has some unsettling implications.

As shown above, the total area is affected by the column width. The column width in an equal-width histogram is the range of the x-values divided by the number of bins. Thus, the total area is a function of the number of bins.

Consider the following frequency histograms of the same scoring minutes dataset. The only difference is the number of bins used.


Increasing the number of bins has a series of effects:

  • the columns become narrower
  • the columns become shorter, because each narrower bin can contain at most the same count as the wider bin that contains it.
  • the total area of the columns become smaller.

This last one is unexpected and completely messes up our intuition. When we increase the number of bins, not only are the columns shortening but the total area covered by all the columns is also shrinking. Remember that the total area whether it is a count or frequency histogram has a factor equal to the bin width. Higher number of bins means smaller bin width, which means smaller total area.


What if we force the total area to be constant regardless of how many bins we use? This setting seems more intuitive: in the 5-bin histogram, we partition the total area into five parts while in the 10-bin histogram, we divide it into 10 parts.

This is the principle used by R and the other statistical software when they produce so-called density histograms. The count and frequency data are encoded in the column areas - by implication, the same data could not have been encoded simultaneously in the column heights!

The way to accomplish this is to divide by the bin width. If you look at the total area formulas above, for the count histogram, total area is total count x bin width. If the height is count divided by bin width, then the total area is the total count. Similarly, if the height in the frequency histogram is frequency divided by bin width, then the total area is 100%.

Count divided by some section of the x-range is otherwise known as "density". It captures the concept of how tightly the data are packed inside a particular section of the dataset. Thus, in a count-density histogram, the heights encode densities while the areas encode counts. In this case, total area is the total count. If we want to standardize total area to be 1, then we should compute densities using frequencies rather than counts. Frequency densities are just count densities divided by the total count.

To summarize, in a frequency-density histogram, the heights encode densities, defined as frequency divided by the bin width. This is not very intuitive; just think of densities as how closely packed the data are in the specified bin. The column areas encode frequencies so that the total area is 100%.

The reason why density histograms are confusing is that we are reading off column heights while thinking that the total area should add up to 100%. Column heights and column areas cannot both add up to 100%. We have to pick one or the other.

Comparing relative column heights still works when the density histogram has equal bin widths. In this case, the relative height and relative area are the same because relative density equals relative frequencies if the bin width is fixed.

The following charts recap the discussion above. It shows how the frequency histogram does not preserve the total area when bin sizes are changed while the density histogram does.



The density histogram is a major pain for solving range probability questions because the frequencies are encoded in the column areas, not the heights. Areas are not marked out in a graph.

The column height gives us densities which are not probabilities. In order to retrieve probabilities, we have to multiply the density by the bin width, that is to say, we must estimate the area of the column. That requires mapping two dimensions (width, height) onto one (area). It is in fact impossible without measurement - unless we make the bin widths constant.

When we make the bin widths constant, we still can't read densities off the vertical axis, and treat them as probabilities. If I must use the density histogram to answer the question of how likely a team scores in the first 15 minutes, I'd sum the heights of the first 3 columns, which is about 0.025, and then multiply it by the bin width of 5 minutes, which gives 0.125 or 12.5%.

At the end of this exploration, I like the frequency histogram best. The density histogram is useful when we are comparing different histograms, which isn't the most common use case.


The histogram is a basic chart in the tool kit. It's more complicated than it seems. I haven't come across any intro dataviz books that explain this clearly.

Most of this post deals with equal-width histograms. If we allow bin widths to vary, it gets even more complicated. Stay tuned.


For those using base R graphics, I hope this post helps you interpret what they say in the manual. The default behavior of the "hist" function depends on whether the bins are equal width:

  • if the bin width is constant, then R produces a count histogram. As shown above, in a count histogram, the column heights indicate counts in bins but the total column area does not equal the total sample size, but the total sample size multiplied by the bin width. (Equal width is the default unless the user specifies bin breakpoints.)
  • if the bin width is not constant, then R produces a (frequency-)density histogram. The column heights are densities, defined as frequencies divided by bin width while the column areas are frequencies, with the total area summing to 100%.

Unfortunately, R does not generate a frequency histogram. To make one, you'd have to divide the counts in bins by the sum of counts. (In making some of the graphs above, I tricked it.) You also need to trick it to make a frequency-density histogram with equal-width bins, as it's coded to produce a count histogram when bin size is fixed.


P.S. [5-2-2023] As pointed out by a reader, I should clarify that R and I use the word "frequency" differently. Specifically, R uses frequency to mean counts, therefore, what I have been calling the "count histogram", R would have called it a "frequency histogram", and what I have been describing as a "frequency histogram", the "hist" function simply does not generate it unless you trick it to do so. I'm using "frequency" in the everyday sense of the word, such as "the frequency of the bus". In many statistical packages, frequency is used to mean "count", as in the frequency table which is just a table of counts. The reader suggested proportion which I like, or something like weight.






Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.


This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.


In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)


The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.


Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.

All about Connecticut

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.


The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.


The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.


This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.


The final example is a two-sided bar chart:


This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.


There are many other charts, and I encourage you to visit and support this data journalism.




Visual cues affect how data are perceived

Here's a recent NYT graphic showing California's water situation at different time scales (link to article).


It's a small multiples display, showing the spatial distribution of the precipitation amounts in California. The two panels show, respectively, the short-term view (past month) and the longer-term view (3 years). Precipitation is measured in relative terms,  so what is plotted is the relative ratio of precipitation in the reference period, with 100 being the 30-year average.

Green is much wetter than average while brown is much drier than average.

The key to making this chart work is a common color scheme across the two panels.

Also, the placement of major cities provides anchor points for our eyes to move back and forth between the two panels.


The NYT graphic is technically well executed. I'm a bit unhappy with the headline: "Recent rains haven't erased California's long-term drought".

At the surface, the conclusion seems sensible. Look, there is a lot of green, even deep green, on the left panel, which means the state got lots more rain than usual in the past month. Now, on the right panel, we find patches of brown, and very little green.

But pay attention to the scale. The light brown color, which covers the largest area, has value 70 to 90, thus, these regions have gotten 10-30% less precipitation than average in the past three years relative to the 30-year average.

Here's the question: what does it mean by "erasing California's long-term drought"? Does the 3-year average have to equal or exceed the 30-year average? Why should that be the case?

If we took all 3-year windows within those 30 years, we're definitely not going to find that each such 3-year average falls at or above the 30-year average. To illustrate this, I pulled annual rainfall data for San Francisco. Here is a histogram of 3-year averages for the 30-year period 1991-2020.


For example, the first value is the average rainfall for years 1989, 1990 and 1991, the next value is the average of 1990, 1991, and 1992, and so on. Each value is a relative value relative to the overall average in the 30-year window. There are two more values beyond 2020 that is not shown in the histogram. These are 57%, and 61%, so against the 30-year average, those two 3-year averages were drier than usual.

The above shows the underlying variability of the 3-year averages inside the reference time window. We have to first define "normal", and that might be a value between 70% and 130%.

In the same way, we can establish the "normal" range for the entire state of California. If it's also 70% to 130%, then the last 3 years as shown in the map above should be considered normal.



A graphical compass

A Twitter user pointed me to this article from Washington Post, ruminating about the correlation between gas prices and measures of political sentiment (such as Biden's approval rating or right-track-wrong-track). As common in this genre, the analyst proclaims that he has found something "counter intuitive".

The declarative statement strikes me as odd. In the first two paragraphs, he said the data showed "as gas prices fell, American optimism rose. As prices rose, optimism fell... This seems counterintuitive."

I'm struggling to see what's counterintuitive. Aren't the data suggesting people like lower prices? Is that not what we think people like?

The centerpiece of the article concerns the correlation between metrics. "If two numbers move in concert, they can be depicted literally moving in concert. One goes up, the other moves either up or down consistently." That's a confused statement and he qualifies it by typing "That sort of thing."

He's reacting to the following scatter plot with lines. The Twitter user presumably found it hard to understand. Count me in.


Why is this chart difficult to grasp?

The biggest puzzle is: what differentiates those two lines? The red and the gray lines are not labelled. One would have to consult the article to learn that the gray line represents the "raw" data at weekly intervals. The red line is aggregated data at monthly intervals. In other words, each red dot is an average of 4 or 5 weekly data points. The red line is just a smoothed version of the gray line. Smoothed lines show the time trend better.

The next missing piece is the direction of time, which can only be inferred by reading the month labels on the red line. But the chart without the direction of time is like a map without a compass. Take this segment for example:


If time is running up to down, then approval ratings are increasing over time while gas prices are decreasing. If time is running down to up, then approval ratings are decreasing over time while gas prices are increasing. Exactly the opposite!

The labels on the red line are not sufficient. It's possible that time runs in the opposite direction on the gray line! We only exclude that possibility if we know that the red line is a smoothed version of the gray line.

This type of chart benefits from having a compass. Here's one:


It's useful for readers to know that the southeast direction is "good" (higher approval ratings, lower gas prices) while the northwest direction is "bad". Going back to the original chart, one can see that the metrics went in the "bad" direction at the start of the year and has reverted to a "good" direction since.


What does this chart really say? The author remarked that "correlation is not causation". "Just because Biden’s approval rose as prices dropped doesn’t mean prices caused the drop."

Here's an alternative: People have general sentiments. When they feel good, they respond more positively to polls, as in they rate everything more positively. The approval ratings are at least partially driven by this general sentiment. The same author apparently has another article saying that the right-track-wrong-track sentiment also moved in tandem with gas prices.

One issue with this type of scatter plot is that it always cues readers to make an incorrect assumption: that the outcome variables (approval rating) is solely - or predominantly - driven by the one factor being visualized (gas prices). This visual choice completely biases the reader's perception.

P.S. [11-11-22] The source of the submission was incorrectly attributed.

Another reminder that aggregate trends hide information

The last time I looked at the U.S. employment situation, it was during the pandemic. The data revealed the deep flaws of the so-called "not in labor force" classification. This classification is used to dehumanize unemployed people who are declared "not in labor force," in which case they are neither employed nor unemployed -- just not counted at all in the official unemployment (or employment) statistics.

The reason given for such a designation was that some people just have no interest in working, or even looking for a job. Now they are not merely discouraged - as there is a category of those people. In theory, these people haven't been looking for a job for so long that they are no longer visible to the bean counters at the Bureau of Labor Statistics.

What happened when the pandemic precipitated a shutdown in many major cities across America? The number of "not in labor force" shot up instantly, literally within a few weeks. That makes a mockery of the reason for such a designation. See this post for more.


The data we saw last time was up to April, 2020. That's more than two years old.

So I have updated the charts to show what has happened in the last couple of years.

Here is the overall picture.


In this new version, I centered the chart at the 1990 data. The chart features two key drivers of the headline unemployment rate - the proportion of people designated "invisible", and the proportion of those who are considered "employed" who are "part-time" workers.

The last two recessions have caused structural changes to the labor market. From 1990 to late 2000s, which included the dot-com bust, these two metrics circulated within a small area of the chart. The Great Recession of late 2000s led to a huge jump in the proportion called "invisible". It also pushed the proportion of part-timers to all0time highs. The proportion of part-timers has fallen although it is hard to interpret from this chart alone - because if the newly invisible were previously part-time employed, then the same cause can be responsible for either trend.

_numbersense_bookcoverReaders of Numbersense (link) might be reminded of a trick used by school deans to pump up their US News rankings. Some schools accept lots of transfer students. This subpopulation is invisible to the US News statisticians since they do not factor into the rankings. The recent scandal at Columbia University also involves reclassifying students (see this post).

Zooming in on the last two years. It appears that the pandemic-related unemployment situation has reversed.


Let's split the data by gender.

American men have been stuck in a negative spiral since the 1990s. With each recession, a higher proportion of men are designated BLS invisibles.


In the grid system set up in this scatter plot, the top right corner is the worse of all worlds - the work force has shrunken and there are more part-timers among those counted as employed. The U.S. men are not exiting this quadrant any time soon.

What about the women?


If we compare 1990 with 2022, the story is not bad. The female work force is gradually reaching the same scale as in 1990 while the proportion of part-time workers have declined.

However, celebrating the above is to ignore the tremendous gains American women made in the 1990s and 2000s. In 1990, only 58% of women are considered part of the work force - the other 42% are not working but they are not counted as unemployed. By 2000, the female work force has expanded to include about 60% with similar proportions counted as part-time employed as in 1990. That's great news.

The Great Recession of the late 2000s changed that picture. Just like men, many women became invisible to BLS. The invisible proportion reached 44% in 2015 and have not returned to anywhere near the 2000 level. Fewer women are counted as part-time employed; as I said above, it's hard to tell whether this is because the women exiting the work force previously worked part-time.


The color of the dots in all charts are determined by the headline unemployment number. Blue represents low unemployment. During the 1990-2022 period, there are three moments in which unemployment is reported as 4 percent or lower. These charts are intended to show that an aggregate statistic hides a lot of information. The three times at which unemployment rate reached historic lows represent three very different situations, if one were to consider the sizes of the work force and the number of part-time workers.


P.S. [8-15-2022] Some more background about the visualization can be found in prior posts on the blog: here is the introduction, and here's one that breaks it down by race. Chapter 6 of Numbersense (link) gets into the details of how unemployment rate is computed, and the implications of the choices BLS made.

P.S. [8-16-2022] Corrected the axis title on the charts (see comment below). Also, added source of data label.

Dataviz is good at comparisons if we make the right comparisons

In an article about gas prices around the world, the Washington Post uses the following bar chart (link):


There are a few wrinkles in this one compared to the most generic bar chart one can produce:


(The numbers on my chart are not the same as Washington Post's. That's because the data vendor charges for data, except for the most recent week. So, my data is from a different week.)

_trifectacheckup_imageThe gas prices are not expressed in dollars but a transformation turns prices into a cost-effectiveness metric: miles per dollar, or more precisely, miles per $40 dollars of gas. The metric has a reverse direction - the higher the price, the lower the miles. The data transformation belongs to the D corner of the Trifecta Checkup framework (link). Depending on how one poses the Q(uestion) of the chart, the shift from dollars to miles can bring the Q and the D in sync.

In the V(isual) corner, the designer embellishes the bars. A car icon is placed at the tip of each bar while the bar itself is turned into a wavy path, symbolizing a dirt path. The driving metaphor is in full play. In fact, the video makes the most out of it. There is no doubt that the embellishment has turned a mere scientific presentation into a form of entertainment.


Did the embellishment harm visual clarity? For the most part, no.

The worst it can get is when they compared U.S. and India/South Africa:


The left column shows the original charts from the article. In  both charts, the two cars are so close together that it is impossible to learn the scale of the difference. The amount of difference is a fraction of the width of a car icon.

The right column shows the "self-sufficiency test". Imagine the data labels are not on the chart. What we learn is that if we wanted to know how big of a gap is between the two countries, when reading the charts on the left, we are relying on the data labels, not the visual elements. On the right side, if we really want to learn the gaps, we have to look through the car icons to find the tips of the bars!

This discussion does not necessarily doom the appealing chart. If the message one wants to send with the India/South Afrcia charts is that there is negligible difference between them, then it is not crucial to present the precise differences in prices.


The real problem with this dataviz is in the D corner. Comparing countries is hard.

As shown above, by the miles per $40 spend metric, U.S. and India are rated essentially the same. So is the average American and the average Indian suffering equally?

Far from it. The clue comes from the aggregate chart, in which countries are divided into three tiers: high income, upper middle income and lower middle income. The U.S. belongs to the high-income tier while India falls into the lower-middle-income tier.

The cost of living in India is much lower than in the US. Forty dollars is a much bigger chunk of an Indian paycheck than an American one.

To adjust for cost of living, economists use a PPP (purchasing power parity) value. The following chart shows the difference:


The right graph contains cost-of-living adjustments. It shows a completely different picture. Nominally (left chart), the price of gas in about the same in dollar terms between U.S. and India. In terms of cost of living, gas is actually 5 times more expensive in India. Thus, the adjusted miles per $40 gas number is much smaller for India than the unadjusted. (Because PPP is relative to U.S. prices, the U.S. numbers are not affected.)

PPP is not the end-all here. According to the Economic Times (India), only 22 out of 1,000 Indians own cars, compared to 980 out of 1,000 Americans. Think about the implication of using any statistic that averages the entire population!


Why is gas more expensive in California than the U.S. average? The talking point I keep hearing is environmental regulations. Gas prices may be higher in Europe for a similar reason. Residents in those places may be willing to pay higher prices because they get satisfaction from playing their part in preserving the planet for future generations.

The footnote discloses this not-trivial issue.


When converting from dollars per gallon/liter into miles per $40, we need data on miles per gallon/liter. Americans notoriously drive cars (trucks, SUVs, etc.) that have much lower mileage than those driven by other countries. However, this factor is artificially removed by assuming the same car with 32 mpg on all countries. A quick hop to the BTS website tells us that the average mpg of American cars is a third of that assumption. [See note below.]

Ignoring cross-country comparisons for the time being, the true number for U.S. is not 247 miles per $40 spent on gas as claimed. It is a third of that value: 82 miles per $40 spent.

It's tough to find data on fuel economy of all passenger cars, not just new passenger cars. I found Australia's number, which is 21 mpg. So this brings the miles per $40 number down from about 230 to 115. These are not small adjustments.

Washington Post's analysis paints a simplistic picture that presupposes that price is the only thing people care about. I call this issue xyopia. It's when the analyst frames the problem as factor x explaining outcome y, and when factor x is not the only, and frequently not even the most important, factor affecting y.

More on xyopia.

More discussion of Washington Post graphics.


[P.S. 7-25-2022. Reader Cody Curtis pointed out in the comments that the Bureau of Transportation Statistics report was using km/liter as units, not miles per gallon. The 10 km/liter number for average cars is roughly 23 mpg. I'll leave the text as is in the post as the larger point is valid: that there is variation in average fuel economy between nations - partly due to environemental regulation and consumer behavior - and thus, a proper comparison requires adjusting for this factor.]

Superb tile map offering multiple avenues for exploration

Here's a beauty by WSJ Graphics:


The article is here.

This data graphic illustrates the power of the visual medium. The underlying dataset is complex: power production by type of source by state by month by year. That's more than 90,000 numbers. They all reside on this graphic.

Readers amazingly make sense of all these numbers without much effort.

It starts with the summary chart on top.


The designer made decisions. The data are presented in relative terms, as proportion of total power production. Only the first and last years are labeled, thus drawing our attention to the long-term trend. The order of the color blocks is carefully selected so that the cleaner sources are listed at the top and the dirtier sources at the bottom. The order of the legend labels mirrors the color blocks in the area chart.

It takes only a few seconds to learn that U.S. power production has largely shifted away from coal with most of it substituted by natural gas. Other than wind, the green sources of power have not gained much ground during these years - in a relative sense.

This summary chart serves as a reading guide for the rest of the chart, which is a tile map of all fifty states. Embedded in the tile map is a small-multiples arrangement.


The map offers multiple avenues for exploration.

Some readers may look at specific states. For example, California.


Currently, about half of the power production in California come from natural gas. Notably, there is no coal at all in any of these years. In addition to wind, solar energy has also gained. All of these insights come without the need for any labels or gridlines!

Wsj_powerproduction_westernstatesBrowsing around California, readers find different patterns in other Western states like Oregon and Washington.

Hydroelectric energy is the dominant source in those two states, with wind gradually taking share.

At this point, readers realize that the summary chart up top hides remarkable state-level variations.


There are other paths through the map.

Some readers may scan the whole map, seeking patterns that pop out.

One such pattern is the cluster of states that use coal. In most of these states, the proportion of coal has declined.

Yet another path exists for those interested in specific sources of power.

For example, the trend in nuclear power usage is easily followed by tracking the purple. South Carolina, Illinois and New Hampshire are three states that rely on nuclear for more than half of its power.

Wsj_powerproduction_vermontI wonder what happened in Vermont about 8 years ago.

The chart says they renounced nuclear energy. Here is some history. This one-time event caused a disruption in the time series, unique on the entire map.


This work is wonderful. Enjoy it!

To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.


The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.


Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?


A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.

Check your presumptions while you're reading this chart about Israel's vaccination campaign

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?


Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):


It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).


C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)


Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.


D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:


Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:


In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.


That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.


If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.


This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.


As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.