Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)

 


Distorting perception versus distorting the data

This chart appears in the latest ("last print issue") of Schwab's On Investing magazine:

Schwab_oninvesting_returnlandscape

I know I don't like triangular charts, and in this post, I attempt to verbalize why.

It's not the usual complaint of distorting the data. When the base of the triangle is fixed, and only the height is varied, then the area is proportional to the height and thus nothing is distorted.

Nevertheless, my ability to compare those triangles pales in comparison to the following columns.

Junkcharts_triangles_rectangles

This phenomenon is not limited to triangles. One can take columns and start varying the width, and achieve a similar effect:

Junkcharts_changing_base

It's really the aspect ratio - the relationship between the height and the width that's the issue.

***

Interestingly, with an appropriately narrow base, even the triangular shape can be saved.

Junkcharts_narrower_base

In a sense, we can think of the width of these shapes as noise, a distraction - because the width is constant, and not encoding any data.

It's like varying colors for no reason at all. It introduces a pointless dimension.

Junkcharts_color_notdata

It may be prettier but the colors also interfere with our perception of the changing heights.


Further exploration of tessellation density

Last year, I explored using bar-density (and pie-density) charts to illustrate 80/20-type distributions, which are very common in real life (link).

Kaiserfung_youtube_bardensity

The key advantage of this design is that the most important units (i.e. the biggest stars/creators) are represented by larger pieces while the long tail is shown by little pieces. The skewness is encoded in the density of the tessellation.

So when the following chart showed up on my Twitter feed, I returned to the idea of using tessellation density as a visual cue.

Harvard_income_students

This wbur chart is a good statistical chart - effiicient at communicating the data, but "boring". The only things I'd change is to remove the vertical axis, gridlines, and the decimals.

In concept, the underlying data is similar to the Youtube data. Less than 0.5 percent of Youtubers produced 38% of the views on the platform. The richest 1% of the population took 15% of Harvard's spots; the richest 20% took 70%.

As I explore this further, the analogy falls apart. In the Youtube scenario, the stars should naturally occupy bigger spaces. In the Harvard scenario, letting the children of the top 1% taking up more space on the chart doesn't really make sense since each incoming Harvard student has equal status.

Instead of going down that potential deadend, I investigated how tessellation density can be used for visualization. For one thing, tessellations are pretty things and appealing.

Here is something I created:

Junkcharts_redo_wbur_harvard_rich

The chart is read vertically by comparing Harvard's selection of students with the hypothetical "ideal" of equal selection. (I don't agree that this type of equality is the right thing but let me focus on the visualization here.) This, selectivity is coded in the density. Selectivity is defined here as the over/under representation. Harvard is more "selective" in lower-income groups.

In the first and second columns, we see that Harvard's densities are lower than the densities as expected in the general population, indicating that the poorest 20%, and the middle 20% of the population are under-represented in Harvard's student body. Then in the third column, the comparison flips. The density in the top box is about 3-4 times as high as the bottom box. You may have to expand the graphic to see the 1% slither, which also shows a much higher density in the top box.

I was surprised by how well I was able to eyeball the relative densities. You can try it and let me know how you fare.

(There is even a trick to do this. From the diagram with larger pieces, pick a representative piece. Then, roughly estimate how many smaller pieces from the other tessellation can fit into that representative piece. Using this guideline, I estimate that the ratios of the densities to be 1:6, 1:2, 3:1, 10:1. The actual ratios are 1:6.7, 1:2.5, 3:1, 15:1. I find that my intuition gets me most of the way there even if I don't use this trick.)

Density encoding is under-used as a visual cue. I think our ability to compare densities is surprisingly good (when the units are not overlapping). Of course, you wouldn't use density if you need to be precise, just as you wouldn't use color, or circular areas. Nevertheless, there are many occasions where you can afford to be less precise, and you'd like to spice up your charts.


Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.

Econ_seriea_racism

The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:

Junkcharts_redo_seriea_racism

The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.

***

This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).

 

 

 

 


Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:

Banknote_picker_us

I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

***
I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:

Banknote_picker_euro

Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.

***

Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.

Banknote_us_2

Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.


Tip of the day: transform data before plotting

The Financial Times called out a twitter user for some graphical mischief. Here are the two charts illustrating the plunge in Bitcoin's price last week : (Hat tip to Mark P.)

Ft_tradingview_btcprices

There are some big differences between the two charts. The left chart depicts this month's price actions, drawing attention to the last week while the right chart shows a longer period of time, starting from 2012. The author of the tweet apparently wanted to say that the recent drop is nothing to worry about. 

The Financial Times reporter noted another subtle difference - the right chart uses a log scale while the left chart is linear. Specifically, it's a log 2 scale, which means that each step up is double the previous number (1, 2, 4, 8, etc.). The effect is to make large changes look smaller. Presumably most readers fail to notice the scale. Even if they do, it's not natural to assign different differences to the same physical distances.

***

Junkcharts_redo_fttradingviewbitcoinpricechart

These price charts always miss the mark. That's because the current price is insufficient to capture whether a Bitcoin investor made money or lost money. If you purchased Bitcoins this month, you lost money. If your purchase was a year ago, you still made quite a bit of money despite the recent price plunge.

The following chart should not be read as a time series, even though the horizontal axis is time. Think date of Bitcoin purchase. This chart tells you how much $1 of Bitcoin is worth last week, based on what day the purchase was made.

Junkcharts_redo_fttradingviewbitcoinpricechart_2

People who bought this year have mostly been in the red. Those who purchased before October 2020 and held on are still very pleased with their decision.

This example illustrates that simple transformations of the raw data yield graphics that are much more informative.

 


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.


Probabilities and proportions: which one is the chart showing

The New York Times showed this chart (link):

Nyt_unvaccinated_undeterred

My first read: oh my gosh, 40-50% of the unvaccinated Americans are living their normal lives - dining at restaurants, assembling with more than 10 people, going to religious gatherings.

After reading the text around this chart, I realize I have misinterpreted it.

The chart should be read by columns. Each column is a "pie chart". For example, the first column shows that half the restaurant diners are not vaccinated, a third are fully vaccinated, and the remainder are partially vaccinated. The other columns have roughly the same proportions.

The author says "The rates of vaccination among people doing these activities largely reflect the rates in the population." This line is perhaps more confusing than intended. What she's saying is that in the general population, half of us are unvaccinated, a third are fully unvaccinated, and the remainder are partially vaccinated.

Here's a picture:

Junkcharts_redo_nyt_unvaccinatedundeterred

What this chart is saying is that the people dining out is like a random sample from all Americans. So too the other groups depicted. What Americans are choosing to do is independent of their vaccination status.

Unvaccinated people are no less likely to be doing all these activities than the fully vaccinated. This raises the question: are half of the people not wearing masks outdoors unvaccinated?

***

Why did I read the chart wrongly in the first place? It has to do with expectations.

Most survey charts plot probabilities not proportions. I haphazardly grabbed the following Pew Research chart as an example:

Pew_kids_socialmedia

From this chart, we learn that 30% of kids 9-11 years old uses TikTok compared to 11% of kids 5-8.  The percentages down a column do not sum to 100%.

 


Reading this chart won't take as long as withdrawing troops from Afghanistan

Art sent me the following Economist chart, noting how hard it is to understand. I took a look, and agreed. It's an example of a visual representation that takes more time to comprehend than the underlying data.

Econ_theendisnear

The chart presents responses to 3 questions on a survey. For each question, the choices are Approve, Disapprove, and "Neither" (just picking a word since I haven't seen the actual survey question). The overall approval/disapproval rates are presented, and then broken into two subgroups (Democrats and Republicans).

The first hurdle is reading the scale. Because the section from 75% to 100% has been removed, we are left with labels 0, 25, 50, 75, which do not say percentages unless we've consumed the title and subtitle. The Economist style guide places the units of data in the subtitle instead of on
the axis itself.

Our attention is drawn to the thick lines, which represent the differences between approval and disapproval rates. These differences are signed: it matters whether the proportion approving is higher or lower than the proportion disapproving. This means the data are encoded in the order of the dots plus the length of the line segment between them.

The two bottom rows of the Afghanistan question demonstrates this mental challenge. Our brains have to process the following visual cues:

1) the two lines are about the same lengths

2) the Republican dots are shifted to the right by a little

3) the colors of the dots are flipped

What do they all mean?

Econ_theendofforever_subset

A chart runs in trouble when you need a paragraph to explain how to read it.

It's sometimes alright to make complicated data visualization that illustrates complicated concepts. What justifies it is the payoff. I wrote about the concept of return on effort in data visualization here.

The payoff for this chart escaped me. Take the Democratic response to troop withdrawal. About 3/4 of Democrats approve while 15% disapprove. The thick line says 60% more Democrats approve than disapprove.

***

Here, I show the full axis, and add a 50% reference line

Junkcharts_redo_econ_theendofforever_1

Small edits but they help visualize "half of", "three quarters of".

***

Next, I switch to the more conventional stacked bars.

Junkcharts_redo_econ_theendofforever_stackedbars

This format reveals some of the hidden data on the chart - the proportion answering neither approve/disapprove, and neither yes/no.

On the stacked bars visual, the proportions are counted from both ends while in the dot plot above, the proportions are measured from the left end only.

***

Read all my posts about Economist charts here

 


Two commendable student projects, showing different standards of beauty

A few weeks ago, I did a guest lecture for Ray Vella's dataviz class at NYU, and discussed a particularly hairy dataset that he assigns to students.

I'm happy to see the work of the students, and there are two pieces in particular that show promise.

The following dot plot by Christina Barretto shows the disparities between the richest and poorest nations increasing between 2000 and 2015.

BARRETTO  Christina - RIch Gets Richer Homework - 2021-04-14

The underlying dataset has the average GDP per capita for the richest and the poor regions in each of nine countries, for two years (2000 and 2015). With each year, the data are indiced to the national average income (100). In the U.K., the gap increased from around 800 to 1,100 in the 15 years. It's evidence that the richer regions are getting richer, and the poorer regions are getting poorer.

(For those into interpreting data, you should notice that I didn't say the rich getting richer. During the lecture, I explain how to interpret regional averages.)

Christina's chart reflects the tidy, minimalist style advocated by Tufte. The countries are sorted by the 2000-to-2015 difference, with Britain showing up as an extreme outlier.

***

The next chart by Adrienne Umali is more infographic than Tufte.

Adrienne Umali_v2

It's great story-telling. The top graphic explains the underlying data. It shows the four numbers and how the gap between the richest and poorest regions is computed. Then, it summarizes these four numbers into a single metric, "gap increase". She chooses to measure the change as a ratio while Christina's chart uses the difference, encoded as a vertical line.

Adrienne's chart is successful because she filters our attention to a single country - the U.S. It's much too hard to drink data from nine countries in one gulp.

This then sets her up for the second graphic. Now, she presents the other eight countries. Because of the work she did in the first graphic, the reader understands what those red and green arrows mean, without having to know the underlying index values.

Two small suggestions: a) order the countries from greatest to smallest change; b) leave off the decimals. These are minor flaws in a brilliant piece of work.