« May 2021 | Main

Further exploration of tessellation density

Last year, I explored using bar-density (and pie-density) charts to illustrate 80/20-type distributions, which are very common in real life (link).


The key advantage of this design is that the most important units (i.e. the biggest stars/creators) are represented by larger pieces while the long tail is shown by little pieces. The skewness is encoded in the density of the tessellation.

So when the following chart showed up on my Twitter feed, I returned to the idea of using tessellation density as a visual cue.


This wbur chart is a good statistical chart - effiicient at communicating the data, but "boring". The only things I'd change is to remove the vertical axis, gridlines, and the decimals.

In concept, the underlying data is similar to the Youtube data. Less than 0.5 percent of Youtubers produced 38% of the views on the platform. The richest 1% of the population took 15% of Harvard's spots; the richest 20% took 70%.

As I explore this further, the analogy falls apart. In the Youtube scenario, the stars should naturally occupy bigger spaces. In the Harvard scenario, letting the children of the top 1% taking up more space on the chart doesn't really make sense since each incoming Harvard student has equal status.

Instead of going down that potential deadend, I investigated how tessellation density can be used for visualization. For one thing, tessellations are pretty things and appealing.

Here is something I created:


The chart is read vertically by comparing Harvard's selection of students with the hypothetical "ideal" of equal selection. (I don't agree that this type of equality is the right thing but let me focus on the visualization here.) This, selectivity is coded in the density. Selectivity is defined here as the over/under representation. Harvard is more "selective" in lower-income groups.

In the first and second columns, we see that Harvard's densities are lower than the densities as expected in the general population, indicating that the poorest 20%, and the middle 20% of the population are under-represented in Harvard's student body. Then in the third column, the comparison flips. The density in the top box is about 3-4 times as high as the bottom box. You may have to expand the graphic to see the 1% slither, which also shows a much higher density in the top box.

I was surprised by how well I was able to eyeball the relative densities. You can try it and let me know how you fare.

(There is even a trick to do this. From the diagram with larger pieces, pick a representative piece. Then, roughly estimate how many smaller pieces from the other tessellation can fit into that representative piece. Using this guideline, I estimate that the ratios of the densities to be 1:6, 1:2, 3:1, 10:1. The actual ratios are 1:6.7, 1:2.5, 3:1, 15:1. I find that my intuition gets me most of the way there even if I don't use this trick.)

Density encoding is under-used as a visual cue. I think our ability to compare densities is surprisingly good (when the units are not overlapping). Of course, you wouldn't use density if you need to be precise, just as you wouldn't use color, or circular areas. Nevertheless, there are many occasions where you can afford to be less precise, and you'd like to spice up your charts.

Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.


The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:


The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.


This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).





Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:


I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:


Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.


Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.


Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.