« June 2024 | Main | August 2024 »

Pie charts and self-sufficiency

This graphic shows up in a recent issue of Princeton alumni magazine, which has a series of pie charts.

Pu_aid sm

The story being depicted is clear: the school has been generously increasing the amount of financial aid given to students since 1998. The proportion receiving any aid went from 43% to 67% so about two out of three students who enrolled in 2023 are getting aid.

The key components of the story are the values in 1998 and 2023, and the growth trend over this period.

***

Here is an exercise worth doing. Think about how you figured out the story components.

Is it this?

Junkcharts_redo_pu_aid_1

Or is it this?

Junkcharts_redo_pu_aid_2

***

This is what I've been calling a "self-sufficiency test" (link). How much work are the visual elements doing in conveying the graph's message to you? If the visual elements aren't doing much, then the designer hasn't taken advantage of the visual medium.


Approaching the Paris Olympics

If you're looking for dataviz about the upcoming Paris Olympics, I recommend this one by the great SCMP team.

Scmp_parisianolympics100years

The impact of this piece starts with picking an engaging topic: how have the disciplines changed over the last 100 years? It capitalizes on the fact that the Games are returning to Paris after a century.

Most of the infographics contain illustrations, with the interactive device of a slider that makes it easier to compare two graphics, one for each year. Without the slider, the graphics have to be placed top and bottom, or side by side, both of which require a lot of eye movements.

Here are some bits that I particularly enjoyed:

Scmp_olympics_medaldesign

Not surprisingly, the 2024 medal is much larger and heavier than the 1924 one. The old one emphasizes sportsmanship while the new medal frontlines victory.

Scmp_olympics_polevault

Having only seen pole vaulting on modern equipment, I find it fascinating to imagine athletes using rigid wooden poles, and then having to land on their feet in the sawdust pit. Moving the slide to the left reveals the current setup, with fiberglass poles that bend, and landing mattresses. Cheekily, they also tell us where the cameras are placed. Quite a bit of the performance gain (from 3.95 to 6.22 m) can be attributed to equipment improvements.

These illustrations convince me that a lot of the performance gains over time can be attributed to better technologies, better equipment, and rule changes (that accommodate these modern innovations). For example, swimmers starting off a jumping block versus from the side of the pool.

Scmp_olympics_roadrace

Yes, and they have some statistical graphics. This one about the cycling road race is really nice. It shows that the total distance of the 2024 race is about 1/3 longer than the 1924 race. It also shows that the new route features a lot more ups and downs than the original route. The highest point of the 1924 route is higher than the new route, though. This is a great example of the conciseness of visual language.

Scmp_olympics_womenfencing

I chuckled at this one. This was the gear worn by women fencers back at the 1924 Olympics.

***

There's a lot more at SCMP (SCMP). Go take a look!


Expert handling of multiple dimensions of data

I enjoyed reading this Washington Post article about immigration in America. It features a number of graphics. Here's one graphic I particularly like:

Wpost_smallmultiplesmap

This is a small multiples of six maps, showing the spatial distribution of immigrants from different countries. The maps reveal some interesting patterns: Los Angeles is a big favorite of Guatamalans while Houston is preferred by Hondurans. Venezuelans like Salt Lake City and Denver (where there are also some Colombians and Mexicans). The breadth of the spatial distribution surprises me.

The dataset behind this graphic is complex. It's got country of origin, place of settlement, and time of arrival. The maps above collapsed the time dimension, while drawing attention to the other two dimensions.

***

They have another set of charts that highlight the time dimension while collapsing the place of settlement dimension. Here's one view of it:

Wpost_inkblot_overall

There are various names for this chart form. Stream river is one. I like to call it "inkblot", where the two sides are symmetric around the middle vertical line. The chart shows that "migrants in the U.S. immigration court" system have grown substantially since the end of the Covid-19 pandemic, during which they stopped coming.

I'm not a fan of the inkblot. One reason is visible in the following view, which showcases three Central American countries.

Wpost_inkblot_centralamerica

The main message is clear enough. The volume of immigrants from these three countries have been relatively stable over the last decade, with a bulge in the late 2000s. The recent spurt in migrants have come from other places.

But try figuring out what proportion of total immigration is accounted for by these three countries say in 2024. It's a task that is tougher than it should be, and the culprit is that the "other countries" category has been split in half with the two halves separated.

 


When should we use bar charts?

Significance_13thfl sm

Two innocent looking column charts.

These came from an article in Significance magazine (link to paywall) that applies the "difference-in-difference" technique to analyze whether the superstitious act of skipping the number 13 when numbering floors in tall buildings causes an inflation of condo pricing.

The study authors are quite careful in their analysis, recognizing that building managers who decide to relabel the 13th floor as 14th may differ in other systematic ways from those who don't relabel. They use a matching technique to construct comparison groups. The left-side chart shows one effect of matching buildings, which narrowed the gap in average square footage between the relabeled and non-relabeled groups. (Any such gap suggests potential confounding; in a hypothetical, randomized experiment, the average square footage of both groups should be statistically identical.)

The left-side chart features columns that don't start as zero, thus the visualization exaggerates the differences. The degree of exaggeration here is tame: about 150 got chopped off at the bottom, which is about 10% of the total height. But why?

***

The right-side chart is even more problematic.

This chart shows the effect of matching buildings on the average age of the buildings (measured using the average construction year). Again, the columns don't start at zero. But for this dataset, zero is a meaningless value. Never make a column chart when the zero level has no meaning!

The story is simple: by matching, the average construction year in the relabeled group was brought closer to that in the non-relabeled group. The construction year is an ordinal categorical variable, with integer values. I think a comparison of two histograms will show the message clearer, and also provide more information than jut the two average values.