Expert handling of multiple dimensions of data

I enjoyed reading this Washington Post article about immigration in America. It features a number of graphics. Here's one graphic I particularly like:

Wpost_smallmultiplesmap

This is a small multiples of six maps, showing the spatial distribution of immigrants from different countries. The maps reveal some interesting patterns: Los Angeles is a big favorite of Guatamalans while Houston is preferred by Hondurans. Venezuelans like Salt Lake City and Denver (where there are also some Colombians and Mexicans). The breadth of the spatial distribution surprises me.

The dataset behind this graphic is complex. It's got country of origin, place of settlement, and time of arrival. The maps above collapsed the time dimension, while drawing attention to the other two dimensions.

***

They have another set of charts that highlight the time dimension while collapsing the place of settlement dimension. Here's one view of it:

Wpost_inkblot_overall

There are various names for this chart form. Stream river is one. I like to call it "inkblot", where the two sides are symmetric around the middle vertical line. The chart shows that "migrants in the U.S. immigration court" system have grown substantially since the end of the Covid-19 pandemic, during which they stopped coming.

I'm not a fan of the inkblot. One reason is visible in the following view, which showcases three Central American countries.

Wpost_inkblot_centralamerica

The main message is clear enough. The volume of immigrants from these three countries have been relatively stable over the last decade, with a bulge in the late 2000s. The recent spurt in migrants have come from other places.

But try figuring out what proportion of total immigration is accounted for by these three countries say in 2024. It's a task that is tougher than it should be, and the culprit is that the "other countries" category has been split in half with the two halves separated.

 


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

Vitamins

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

Junkcharts_redo_msn_dairyplantmilks_2

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

Junkcharts_redo_msn_dairyplantmilks_1

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.


Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


Stranger things found on scatter plots

Washington Post published a nice scatter plot which deconstructs scores from the recent World Championships in Gymnastics. (link)

Wpost_simonebiles

The chart presents the main message clearly - the winner Simone Biles scored the highest on both components of the score (difficulty and execution), by quite some margin.

What else can we learn from this chart?

***

Every athlete who qualified for the final scored at or above average on both components.

Scoring below average on either component is a death knell: no athlete scored enough on the other component to compensate. (The top left and bottom right quadrants would have had some yellow dots otherwise.)

Several athletes in the top right quadrant presumably scored enough to qualify but didn't. The footnote likely explains it: each country can send at most two athletes to the final. It may be useful to mark out these "unlucky" athletes using a third color.

Curiously, it's not easy to figure out who these unlucky athletes were from this chart alone. We need two pieces of data: the minimum qualifying score, and the total score for each athlete. The scatter plot isn't the best chart form to show totals, but qualification to the final is based on the sum of the difficulty and execution scores. (Note also, neither axis starts at zero, compounding the challenge.)

***

This scatter plot is most memorable for shattering one of my expectations about risk and reward in sports.

I expect risk-seeking athletes to suffer from higher variance in performance. The tennis player who goes for big serves tend to also commit more double faults. The sluggers who hit home runs tend to strike out more often. Similarly, I expect gymnasts who attempt more difficult skills to receive lower execution scores.

Indeed, the headline writer seemed to agree, suggesting that Biles is special because she's both high in difficulty and strong in execution.

The scatter plot, however, sends the opposite message - this should not surprise. The entire field shows a curiously strong positive correlation between difficulty and execution scores. The more difficult is the routine, the higher the excution score!

It's hard to explain such a pattern. My guesses are:

a) judges reward difficult routines, and subconsciously confound execution and difficulty scores. They use separate judges for excecution and difficulty. Paradoxically, this arrangement may have caused separation anxiety - the judges for execution might just feel the urge to reward high difficulty.

b) those athletes who are skilled enough to attempt more difficult routines are also those who are more consistent in execution. This is a type of self-selection bias frequently found in observational data.

Regardless of the reasons for the strong correlation, the chart shows that these two components of the total score are not independent, i.e. the metrics have significant overlap in what they measure. Thus, one cannot really talk about a difficult routine without also noting that it's a well-executed routine, and vice versa. In an ideal scoring design, we'd like to have independent components.


Graphics that stretch stomachs and make merry

Washington Post has a fun article about the Hot Dog Eating Contest in Coney Island here.

This graphic shows various interesting insights about the annual competition:

Washingtonpost_hotdogeating_scatter

Joey Chestnut is the recent king of hot-dog eating. Since the late 2000s, he's dominated the competition. He typically chows down over 60 hot dogs in 10 minutes. This is shown by the yellow line. Even at that high level, Chestnut has shown steady growth over time.

The legend tells us that the chart shows the results of all the other competitors. It's pretty clear that few have been able to even get close to Chestnut all these years. Most contestants were able to swallow 30 hot dogs or fewer.

It doesn't appear that the general standard has increased over time.

In 2011, a separate competition for women started. There is also a female champion (Miki Sudo) who has won almost every competition since she started playing.

One strange feature is the lack of competition in the early years. The footnote informs us that the trend is not real - they simply did not keep records of other competitors in early contests.

The only question I can't answer from this chart is the general standard and number of female competitors. The chart designer chooses not to differentiate between male and female contestants, other than the champions. I can understand that. Adding another dimension to the chart is a double-edged sword.

***

There is even more fun. There is a little video illustrating theories about what kind of human bodies can take in that many hot dogs in a short time. Here is a screen shot of it:

Washingtonpost_hotdogeating_body

 

 


A graphical compass

A Twitter user pointed me to this article from Washington Post, ruminating about the correlation between gas prices and measures of political sentiment (such as Biden's approval rating or right-track-wrong-track). As common in this genre, the analyst proclaims that he has found something "counter intuitive".

The declarative statement strikes me as odd. In the first two paragraphs, he said the data showed "as gas prices fell, American optimism rose. As prices rose, optimism fell... This seems counterintuitive."

I'm struggling to see what's counterintuitive. Aren't the data suggesting people like lower prices? Is that not what we think people like?

The centerpiece of the article concerns the correlation between metrics. "If two numbers move in concert, they can be depicted literally moving in concert. One goes up, the other moves either up or down consistently." That's a confused statement and he qualifies it by typing "That sort of thing."

He's reacting to the following scatter plot with lines. The Twitter user presumably found it hard to understand. Count me in.

Washingtonpost_gasprices

Why is this chart difficult to grasp?

The biggest puzzle is: what differentiates those two lines? The red and the gray lines are not labelled. One would have to consult the article to learn that the gray line represents the "raw" data at weekly intervals. The red line is aggregated data at monthly intervals. In other words, each red dot is an average of 4 or 5 weekly data points. The red line is just a smoothed version of the gray line. Smoothed lines show the time trend better.

The next missing piece is the direction of time, which can only be inferred by reading the month labels on the red line. But the chart without the direction of time is like a map without a compass. Take this segment for example:

Wpost_gaspricesapproval_directionoftime

If time is running up to down, then approval ratings are increasing over time while gas prices are decreasing. If time is running down to up, then approval ratings are decreasing over time while gas prices are increasing. Exactly the opposite!

The labels on the red line are not sufficient. It's possible that time runs in the opposite direction on the gray line! We only exclude that possibility if we know that the red line is a smoothed version of the gray line.

This type of chart benefits from having a compass. Here's one:

Wpost_gaspricesapproval_compass

It's useful for readers to know that the southeast direction is "good" (higher approval ratings, lower gas prices) while the northwest direction is "bad". Going back to the original chart, one can see that the metrics went in the "bad" direction at the start of the year and has reverted to a "good" direction since.

***

What does this chart really say? The author remarked that "correlation is not causation". "Just because Biden’s approval rose as prices dropped doesn’t mean prices caused the drop."

Here's an alternative: People have general sentiments. When they feel good, they respond more positively to polls, as in they rate everything more positively. The approval ratings are at least partially driven by this general sentiment. The same author apparently has another article saying that the right-track-wrong-track sentiment also moved in tandem with gas prices.

One issue with this type of scatter plot is that it always cues readers to make an incorrect assumption: that the outcome variables (approval rating) is solely - or predominantly - driven by the one factor being visualized (gas prices). This visual choice completely biases the reader's perception.

P.S. [11-11-22] The source of the submission was incorrectly attributed.


Where have the graduates gone?

Someone submitted this chart on Twitter as an example of good dataviz.

Washingtonpost_aftercollege

The chart shows the surprising leverage colleges have on where students live after graduation.

The primary virtue of this chart is conservation of space. If our main line of inquiry is the destination states of college graduations - by state, then it's hard to beat this chart's efficiency at delivering this information. For each state, it's easy to see what proportion of graduates leave the state after graduation, and then within those who leave, the reader can learn which are the most popular destination states, and their relative importance.

The colors link the most popular destination states (e.g. Texas in orange) but they are not enough because the designer uses state labels also. A next set of states are labeled without being differentiated by color. In particular, New York and Massachusetts share shades of blue, which also is the dominant color on the left side.

***

The following is a draft of a concept I have in my head.

Junkcharts_redo_washpost_postgraddestinations_1

I imagine this to be a tile map. The underlying data are not public so I just copied down a bunch of interesting states. This view brings out the spatial information, as we expect graduates are moving to neighboring states (or the states with big cities).

The students in the Western states are more likely to stay in their own state, and if they move, they stay in the West Coast. The graduates in the Eastern states also tend to stay nearby, except for California.

I decided to use groups of color - blue for East, green for South, red for West. Color is a powerful device, if used well. If the reader wants to know which states send graduates to New York, I'm hoping the reader will see the chart this way:

Junkcharts_redo_washpost_postgraddestinations_2

 


Dataviz is good at comparisons if we make the right comparisons

In an article about gas prices around the world, the Washington Post uses the following bar chart (link):

Wpost_gasprices_highincome

There are a few wrinkles in this one compared to the most generic bar chart one can produce:

Redo_wpost_gasprices_0

(The numbers on my chart are not the same as Washington Post's. That's because the data vendor charges for data, except for the most recent week. So, my data is from a different week.)

_trifectacheckup_imageThe gas prices are not expressed in dollars but a transformation turns prices into a cost-effectiveness metric: miles per dollar, or more precisely, miles per $40 dollars of gas. The metric has a reverse direction - the higher the price, the lower the miles. The data transformation belongs to the D corner of the Trifecta Checkup framework (link). Depending on how one poses the Q(uestion) of the chart, the shift from dollars to miles can bring the Q and the D in sync.

In the V(isual) corner, the designer embellishes the bars. A car icon is placed at the tip of each bar while the bar itself is turned into a wavy path, symbolizing a dirt path. The driving metaphor is in full play. In fact, the video makes the most out of it. There is no doubt that the embellishment has turned a mere scientific presentation into a form of entertainment.

***

Did the embellishment harm visual clarity? For the most part, no.

The worst it can get is when they compared U.S. and India/South Africa:

Redo_wpost_gasprices_indiasouthafrica

The left column shows the original charts from the article. In  both charts, the two cars are so close together that it is impossible to learn the scale of the difference. The amount of difference is a fraction of the width of a car icon.

The right column shows the "self-sufficiency test". Imagine the data labels are not on the chart. What we learn is that if we wanted to know how big of a gap is between the two countries, when reading the charts on the left, we are relying on the data labels, not the visual elements. On the right side, if we really want to learn the gaps, we have to look through the car icons to find the tips of the bars!

This discussion does not necessarily doom the appealing chart. If the message one wants to send with the India/South Afrcia charts is that there is negligible difference between them, then it is not crucial to present the precise differences in prices.

***

The real problem with this dataviz is in the D corner. Comparing countries is hard.

As shown above, by the miles per $40 spend metric, U.S. and India are rated essentially the same. So is the average American and the average Indian suffering equally?

Far from it. The clue comes from the aggregate chart, in which countries are divided into three tiers: high income, upper middle income and lower middle income. The U.S. belongs to the high-income tier while India falls into the lower-middle-income tier.

The cost of living in India is much lower than in the US. Forty dollars is a much bigger chunk of an Indian paycheck than an American one.

To adjust for cost of living, economists use a PPP (purchasing power parity) value. The following chart shows the difference:

Redo_wpost_gasprices_1

The right graph contains cost-of-living adjustments. It shows a completely different picture. Nominally (left chart), the price of gas in about the same in dollar terms between U.S. and India. In terms of cost of living, gas is actually 5 times more expensive in India. Thus, the adjusted miles per $40 gas number is much smaller for India than the unadjusted. (Because PPP is relative to U.S. prices, the U.S. numbers are not affected.)

PPP is not the end-all here. According to the Economic Times (India), only 22 out of 1,000 Indians own cars, compared to 980 out of 1,000 Americans. Think about the implication of using any statistic that averages the entire population!

***

Why is gas more expensive in California than the U.S. average? The talking point I keep hearing is environmental regulations. Gas prices may be higher in Europe for a similar reason. Residents in those places may be willing to pay higher prices because they get satisfaction from playing their part in preserving the planet for future generations.

The footnote discloses this not-trivial issue.

Wpost_gasprices_footnote

When converting from dollars per gallon/liter into miles per $40, we need data on miles per gallon/liter. Americans notoriously drive cars (trucks, SUVs, etc.) that have much lower mileage than those driven by other countries. However, this factor is artificially removed by assuming the same car with 32 mpg on all countries. A quick hop to the BTS website tells us that the average mpg of American cars is a third of that assumption. [See note below.]

Ignoring cross-country comparisons for the time being, the true number for U.S. is not 247 miles per $40 spent on gas as claimed. It is a third of that value: 82 miles per $40 spent.

It's tough to find data on fuel economy of all passenger cars, not just new passenger cars. I found Australia's number, which is 21 mpg. So this brings the miles per $40 number down from about 230 to 115. These are not small adjustments.

Washington Post's analysis paints a simplistic picture that presupposes that price is the only thing people care about. I call this issue xyopia. It's when the analyst frames the problem as factor x explaining outcome y, and when factor x is not the only, and frequently not even the most important, factor affecting y.

More on xyopia.

More discussion of Washington Post graphics.

 

[P.S. 7-25-2022. Reader Cody Curtis pointed out in the comments that the Bureau of Transportation Statistics report was using km/liter as units, not miles per gallon. The 10 km/liter number for average cars is roughly 23 mpg. I'll leave the text as is in the post as the larger point is valid: that there is variation in average fuel economy between nations - partly due to environemental regulation and consumer behavior - and thus, a proper comparison requires adjusting for this factor.]


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.

Wp_waterstress

This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.

***

The key, as always, is to nail down your message, then select the right chart form.