If you blink, you'd miss this axis trick

When I set out to write this post, I was intending to make a quick point about the following chart found in the current issue of Harvard Magazine (link):

Harvardmag_humanities

This chart concerns the "tectonic shift" of undergraduates to STEM majors at the expense of humanities in the last 10 years.

I like the chart. The dot plot is great for showing this data. They placed the long text horizontally. The use of color is crucial, allowing us to visually separate the STEM majors from the humanities majors.

My intended post is to suggest dividing the chart into four horizontal slices, each showing one of the general fields. It's a small change that makes the chart even more readable. (It has the added benefit of not needing a legend box.)

***

Then, the axis announced itself.

I was baffled, then disgusted.

Here is a magnified view of the axis:

Harvardmag_humanitiesmajors_axis

It's not a linear scale, as one would have expected. What kind of transformation did they use? It's baffling.

Notice the following features of this transformed scale:

  • It can't be a log scale because many of the growth values are negative.
  • The interval for 0%-25% is longer than for 25%-50%. The interval for 0%-50% is also longer than for 50%-100%. On the positive side, the larger values are pulled in and the smaller values are pushed out.
  • The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0

I have no idea what transformation was applied. I took the growth values, measured the locations of the dots, and asked Excel to fit a polynomial function, and it gave me a quadratic fit, R square > 99%.

Redo_harvardmaghumanitiesmajors_scale2

This formula fits the values within the range extremely well. I hope this isn't the actual transformation. That would be disgusting. Regardless, they ought to have advised readers of their unusual scale.

***

Without having the fitted formula, there is no way to retrieve the actual growth values except for those that happen to fall on the vertical gridlines. Using the inverse of the quadratic formula, I deduced what the actual values were. The hardest one is for Computer Science, since the dot sits to the right of the last gridline. I checked that value against IPEDS data.

The growth values are not extreme, falling between -50% and 125%. There is nothing to be gained by transforming the scale.

The following chart undoes the transformation, and groups the majors by field as indicated above:

Redo_harvardmagazine_humanitiesmajors

***

Yesterday, I published a version of this post at Andrew's blog. Several readers there figured out that the scale is the log of the relative ratio of the number of degrees granted. In the above notation, it is log10(100%+x), where x is the percent change in number of degrees between 2011 and 2021.

Here is a side-by-side view of the two scales:

Redo_harvardmaghumanitiesmajors_twoscales

The chart on the right spreads the negative growth values further apart while slightly compressing the large positive values. I still don't think there is much benefit to transforming this set of data.

 

P.S. [1/31/2023]

(1) A reader on Andrew's blog asked what's wrong with using the log relative ratio scale. What's wrong is exactly what this post is about. For any non-linear scale, the reader can't make out the values between gridlines. In the original chart, there are four points that exist between 0% and 25%. What values are those? That chart is even harder because now that we know what the transform is, we'd need to first think in terms of relative ratios, so 1.25 instead of 25%, then think in terms of log, if we want to know what those values are.

(2) The log scale used for change values is often said to have the advantage that equal distances on either side represent counterbalancing values. For example, (1.5) (0.66) = (3/2) (2/3)  = 1. But this is a very specific scenario that doesn't actually apply to our dataset.  Consider these scenarios:

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: # degrees went from 2000 to 3000 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while the number of Psychology degrees grew by 1000 (Psychology I think is the more popular major)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 1000 to 1500, i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 500
(Assume same starting values)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 666 to 666*3/2 = 999 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 333
(Assume Psychology's starting value to be History's ending value)

Psychology: # degrees went from 1000 to 1500 i.e. Relative ratio = 3/2
History: # degrees went from 1500 to 1000 i.e. Relative ratio = 2/3

The # of Psychology degrees grew by 500 while the # of History degrees dropped by 500
(Assume History's starting value to be Psychology's ending value)

 

 


Dual axes: a favorite of tricksters

Twitter readers directed me to this abomination from the St. Louis Fed (link).

Stlouisfed_military_spend

This chart is designed to paint the picture that China is this grave threat because it's been ramping up military expenditure so much so that it exceeded U.S. spending since the 2000s.

Sadly, this is not what the data are suggesting at all! This story is constructed by manipulating the dual axes. Someone has already fixed it. Here's the same data plotted with a single axis:

Redo_military_spend

(There are two set of axis labels but they have the same scale and both start at zero, so there is only one axis.)

Certainly, China has been ramping up military spending. Nevertheless, China's current level of spending is about one-third of America's. Also, imagine the cumulative spending excess over the 30 years shown on the chart.

Note also, the growth line of U.S. military spending in this period is actually similarly steep as China's.

***

Apparently, the St. Louis Fed is intent on misleading its readers. Even though on Twitter, they acknowledged people's feedback, they decided not to alter the chart.

Stlouisfed_militaryexpenditure_tweet

If you click through to the article, you'll find the same flawed chart as before so I'm not sure how they "listened". I went to Wayback Machine to check the first version of this page, and I notice no difference.

***

If one must make a dual axes chart, it is the responsibility of the chart designer to make it clear to readers that different lines on the chart use different axes. In this case, since the only line that uses the right hand side axis is the U.S. line, which is blue, they should have colored the right hand axis blue. Doing that does not solve the visualization problem; it merely reduces the chance of not noticing the dual axes.

***

I have written about dual axes a lot in the past. Here's a McKinsey chart from 2006 that offends.


Visual cues affect how data are perceived

Here's a recent NYT graphic showing California's water situation at different time scales (link to article).

Nyt_california_drought

It's a small multiples display, showing the spatial distribution of the precipitation amounts in California. The two panels show, respectively, the short-term view (past month) and the longer-term view (3 years). Precipitation is measured in relative terms,  so what is plotted is the relative ratio of precipitation in the reference period, with 100 being the 30-year average.

Green is much wetter than average while brown is much drier than average.

The key to making this chart work is a common color scheme across the two panels.

Also, the placement of major cities provides anchor points for our eyes to move back and forth between the two panels.

***

The NYT graphic is technically well executed. I'm a bit unhappy with the headline: "Recent rains haven't erased California's long-term drought".

At the surface, the conclusion seems sensible. Look, there is a lot of green, even deep green, on the left panel, which means the state got lots more rain than usual in the past month. Now, on the right panel, we find patches of brown, and very little green.

But pay attention to the scale. The light brown color, which covers the largest area, has value 70 to 90, thus, these regions have gotten 10-30% less precipitation than average in the past three years relative to the 30-year average.

Here's the question: what does it mean by "erasing California's long-term drought"? Does the 3-year average have to equal or exceed the 30-year average? Why should that be the case?

If we took all 3-year windows within those 30 years, we're definitely not going to find that each such 3-year average falls at or above the 30-year average. To illustrate this, I pulled annual rainfall data for San Francisco. Here is a histogram of 3-year averages for the 30-year period 1991-2020.

Redo_nyt_californiadrought_sfrainfall

For example, the first value is the average rainfall for years 1989, 1990 and 1991, the next value is the average of 1990, 1991, and 1992, and so on. Each value is a relative value relative to the overall average in the 30-year window. There are two more values beyond 2020 that is not shown in the histogram. These are 57%, and 61%, so against the 30-year average, those two 3-year averages were drier than usual.

The above shows the underlying variability of the 3-year averages inside the reference time window. We have to first define "normal", and that might be a value between 70% and 130%.

In the same way, we can establish the "normal" range for the entire state of California. If it's also 70% to 130%, then the last 3 years as shown in the map above should be considered normal.

 

 


Accounting app advertises that it doesn't understand fractions

I captured the following image of an ad at the airport at the wrong moment, so you can only see the dataviz but not the text that came with it. The dataviz is animated with blue section circling around and then coming to a halt.

Tripactions_partial sm

The text read something like "75% of the people who saw this ad subsequently purchased something". I think the advertiser was TripActions. It is an app for accounting. Too bad their numbers people don't know 75% is three-quarters and their donut chart showed a number larger than 75%.

***

Browsing around the TripActions website, I also found this pie chart.

Tripactions_Most_Popular_Recurring_Pandemic_Era_Monthly_Expenses_-_TripActions_jmogxx

The radius of successive sectors is decreasing as the size of the proportions shrinks. As a result, the same two sectors labeled 12% at the bottom have differently-sized areas. The only way this dataviz can work is if the reader decodes the angles sustained at the center, and ignores the areas of the sectors. However, the visual cues all point readers to the areas rather than the angles.

In this sense, the weakness of this pie chart is the same as that of the racetrack chart, discussed recently here.

In addition, the color dimension is not used wisely. Color can be used to group the expenses into categories, or it can be used to group them by proportion of total (20%+, 10-19%, 5-9%, 1-4%, <1%).