To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.

***

The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.

***

Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?

***

A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.


Displaying convoluted indices

I reviewed another batch of projects from Ray Vella's class at NYU. The following piece by Carlos Lasso made an impression on me. There are no pyrotechnics but he made one decision that added a lot of clarity to the graphic.

The Rich get Richer - Carlos Lasso

The underlying dataset gauges the income disparity of regions within nine countries. The richest and the poorest regions are selected for each country. Two time points are shown. Altogether, there are 9x2x2 = 36 numbers.

***

Let's take a deeper look at these numbers. Notice they are not in dollars, or any kind of currency, despite being about incomes. The numbers are index values, relative to 100. What does the reference level of 100 represent?

The value of 100 crosses every bar of the chart so that 100 has meaning in each country and each year. In fact, there are 18 definitions of 100 in this chart with 36 numbers, one for each country-year pair. The average national income is set to 100 for each country in each year. This is a highly convoluted indexing strategy.

The following chart is a re-visualization of the bottom part of Carlos' infographic.

Junkcharts_richricher2021_2columns

I shifted the scale of the horizontal axis. The value of zero does not hold special meaning in Carlos' chart. I subtracted 100 from the relative regional income indices, thus all regions with income above the average have positive values while those below the national average have negative values. (There are other challenges with the ratio scale, which I'll skip over in this post. The minimum value is -100 while the maximum value can be very large.)

The rescaling is not really the point of this post. To see what Carlos did, we have to look at the example shown in class. The graphic which the students were asked to improve has the following structure:

Junkcharts_richricher2021_1column

This one-column structure places four bars beside each country, grouped by year. Carlos pulled the year dimension out, and showed the same dataset in two columns.

This small change makes a great difference in ease of comprehension. Carlos' version unpacks the two key types of comparisons one might want to make: trend within a given country (horizontal comparison) and contrast between countries in a given year (vertical comparison).

***

I always try to avoid convoluted indexing. The cost of using such indices is the big how-to-read-this box.


Asymmetry and orientation

An author in Significance claims that a single season of Premier League football without live spectators is enough to prove that the so-called home field advantage is really a live-spectator advantage.

The following chart depicts the data going back many seasons:

Significance_premierleaguehomeadvantage_chart_2

I find this bar chart challenging.

It plots the ratio of home wins to away wins using an odds scale, which is not intuitive. The odds scale (probability of success divided by probability of failure) runs from 0 to positive infinity, with 1 being a special value indicating equal odds. But all the values for which away wins exceed home wins are squeezed into the interval between 0 and 1 while the values for which home wins exceed away wins are laid out between 1 and infinity. So it's an inherently asymmetric graphic for a symmetric formula.

The section labeled "more away wins than home wins" are filled with red bars for all those seasons with positive home field advantage while the most recent season, the outlier, has a shorter bar in that section than the rest.

Here's an alternative view:

Redo_significance_premierleaguehomeawaywins_2

I have incorporated dual axes here - but both axes are different only by scaling. There are 380 games in a Premier League season so the percentage scale is just a re-expression of the counts.

 

 


Ridings, polls, elections, O Canada

Stephen Taylor reached out to me about his work to visualize Canadian elections data. I took a look. I appreciate the labor of love behind this project.

He led with a streamgraph, which presents a quick overview of relative party strengths over time.

Stephentaylor_canadianelections_streamgraph

I am no Canadian election expert, and I did a bare minimum of research in writing this blog. From this chart, I learn that:

  • the Canadians have an irregular election schedule
  • Canada has a two party plus breadcrumbs system
  • The two dominant parties are Liberals and Conservatives. The Liberals currently hold just less than half of the seats. The Conservatives have more than half of the seats not held by Liberals
  • The Conservative party (maybe) rebranded as "progressive conservative" for several decades. The Reform/Alliance party was (maybe) a splinter movement within the Conservatives as well.
  • Since the "width" of the entire stream increased over time, I'm guessing the number of seats has expanded

That's quite a bit of information obtained at a glance. This shows the power of data visualization. Notice Stephen didn't even have to include a "how to read this" box.

The streamgraph form has its limitations.

The feature that makes it more attractive than an area chart is its middle anchoring, resulting in a form of symmetry. The same feature produces erroneous intuition - the red patch draws out a declining trend; the reader must fight the urge to interpret the lines and focus on the areas.

The breadcrumbs are well hidden. The legend below discloses that the Green Party holds 3 seats currently. The party has never held enough seats to appear on the streamgraph though.

The bars showing proportions in the legend is a very nice touch. (The numbers appear messed up - I have to ask Stephen whether the seats shown are current values, or some kind of historical average.) I am a big fan of informative legends.

***

The next featured chart is a dot plot of polling results since 2020.

Stephentaylor_canadianelections_streamgraph_polls_dotplot

One can see a three-tier system: the two main parties, then the NDP (yellow) is the clear majority of the minority, and finally you have a host of parties that don't poll over 10%.

It looks like the polls are favoring the Conservatives over the Liberals in this election but it may be an election-day toss-up.

The purple dots represent "PPC" which is a party not found elsewhere on the page.

This chart is clear as crystal because of the structure of the underlying data. It just amazes me that the polls are so highly correlated. For example, across all these polls, the NDP has never once polled better than either the Liberals or the Conservatives, and in addition, it has never polled worse than any of the small parties.

What I'd like to see is a chart that merges the two datasets, addressing the question of how well these polls predicted the actual election outcomes.

***

The project goes very deep as Stephen provides charts for individual "ridings" (perhaps similar to U.S. precincts).

Here we see population pyramids for Vancouver Center, versus British Columbia (Province), versus Canada.

Stephentaylor_canadianelections_riding_populationpyramids

This riding has a large surplus of younger people in their twenties and thirties. Be careful about the changing scales though. The relative difference in proportions are more drastic than visually displayed because the maximum values (5%) on the Province and Canada charts are half that on the Riding chart (10%). Imagine squashing the Province and Canada charts to half their widths.

Analyses of income and rent/own status are also provided.

This part of the dashboard exhibits a problem common in most dashboards - they present each dimension of the data separately and miss out on the more interesting stuff: the correlation between dimensions. Do people in their twenties and thirties favor specific parties? Do richer people vote for certain parties?

***

The riding-level maps are the least polished part of the site. This is where I'm looking for a "how to read it" box.

Stephentaylor_canadianelections_ridingmaps_pollwinner

It took me a while to realize that the colors represent the parties. If I haven't come in from the front page, I'd have been totally lost.

Next, I got confused by the use of the word "poll". Clicking on any of the subdivisions bring up details of an actual race, with party colors, candidates and a donut chart showing proportions. The title gives a "poll id" and the name of the riding in parentheses. Since the poll id changes as I mouse over different subdivisions, I'm wondering whether a "poll" is the term for a subdivision of a riding. A quick wiki search indicates otherwise.

Stephentaylor_canadianelections_ridingmaps_donut

My best guess is the subdivisions are indicated by the numbers.

Back to the donut charts, I prefer a different sorting of the candidates. For this chart, the two most logical orderings are (a) order by overall popularity of the parties, fixed for all ridings and (b) order by popularity of the candidate, variable for each riding.

The map shown above gives the winner in each subdivision. This type of visualization dumps a lot of information. Stephen tackles this issue by offering a small multiples view of each party. Here is the Liberals in Vancouver.

Stephentaylor_canadianelections_ridingmaps_partystrength

Again, we encounter ambiguity about the color scheme. Liberals have been associated with a red color but we are faced with abundant yellow. After clicking on the other parties, you get the idea that he has switched to a divergent continuous color scale (red - yellow - green). Is red or green the higher value? (The answer is red.)

I'd suggest using a gray scale for these charts. The hardest decision is going to be the encoding between values and shading. Should each gray scale be different for each riding and each party?

If I were to take a guess, Stephen must have spent weeks if not months creating these maps (depending on whether he's full-time or part-time). What he has published here is a great start. Fine-tuning the issues I've mentioned may take more weeks or months more.

****

Stephen is brave and smart to send this project for review. For one thing, he's got some free consulting. More importantly, we should always send work around for feedback; other readers can tell us where our blind spots are.

To read more, start with this post by Stephen in which he introduces his project.


Ranking data provide context but can also confuse

This dataviz from the Economist had me spending a lot of time clicking around - which means it is a success.

Econ_usaexcept_hispanic

The graphic presents four measures of wellbeing in society - life expectancy, infant mortality rate, murder rate and prison population. The primary goal is to compare nations across those metrics. The focus is on comparing how certain nations (or subgroups) rank against each other, as indicated by the relative vertical position.

The Economist staff has a particular story to tell about racial division in the US. The dotted bars represent the U.S. average. The colored bars are the averages for Hispanic, white and black Americans. The wider the gap between the colored bars, the more variant is the experiences between American races.

The chart shows that the racial gap of life expectancy is the widest. For prison population, the U.S. and its racial subgroups occupy many of the lowest (i.e. least desirable) ranks, with the smallest gap in ranking.

***

The primary element of interactivity is hovering on a bar, which then highlights the four bars corresponding to the particular nation selected. Here is the picture for Thailand:

Econ_usaexcept_thailand

According to this view of the world, Thailand is a close cousin of the U.S. On each metric, the Thai value clings pretty near the U.S. average and sits within the range by racial groups. I'm surprised to learn that the prison population in Thailand is among the highest in the world.

Unfortunately, this chart form doesn't facilitate comparing Thailand to a country other than the U.S as one can highlight only one country at a time.

***

While the main focus of the chart is on relative comparison through ranking, the reader can extract absolute difference by reading the lengths of the bars.

This is a close-up of the bottom of the prison population metric:

Econ_useexcept_prisonpop_bottomThe length of each bar displays the numeric data. The red line is an outlier in this dataset. Black Americans suffer an incarceration rate that is almost three times the national average. Even white Americans (blue line) is imprisoned at a rate higher than most countries around the world.

As noted above, the prison population metric exhibits the smallest gap between racial subgroups. This chart is a great example of why ranking data frequently hide important information. The small gap in ranking masks the extraordinary absolute difference in incareration rates between white and black America.

The difference between rank #1 and rank #2 is enormous.

Econ_useexcept_lifeexpect_topThe opposite situation appears for life expectancy. The life expectancy values are bunched up especially at the top of the scale. The absolute difference between Hispanic and black America is 82 - 75 = 7 years, which looks small because the axis starts at zero. On a ranking scale, Hispanic is roughly in the top 15% while black America is just above the median. The relative difference is huge.

For life expectancy, ranking conveys the view that even a 7-year difference is a big deal because the countries are tightly bunched together. For prison population, ranking shows the view that a multiple fold difference is "unimportant" because a 20-0 blowout and a 10-0 blowout are both heavy defeats.

***

Whenever you transform numeric data to ranks, remember that you are artificially treating the gap between each value and the next value as a constant, even when the underlying numeric gaps show wide variance.

 

 

 

 

 


Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)

 


Distorting perception versus distorting the data

This chart appears in the latest ("last print issue") of Schwab's On Investing magazine:

Schwab_oninvesting_returnlandscape

I know I don't like triangular charts, and in this post, I attempt to verbalize why.

It's not the usual complaint of distorting the data. When the base of the triangle is fixed, and only the height is varied, then the area is proportional to the height and thus nothing is distorted.

Nevertheless, my ability to compare those triangles pales in comparison to the following columns.

Junkcharts_triangles_rectangles

This phenomenon is not limited to triangles. One can take columns and start varying the width, and achieve a similar effect:

Junkcharts_changing_base

It's really the aspect ratio - the relationship between the height and the width that's the issue.

***

Interestingly, with an appropriately narrow base, even the triangular shape can be saved.

Junkcharts_narrower_base

In a sense, we can think of the width of these shapes as noise, a distraction - because the width is constant, and not encoding any data.

It's like varying colors for no reason at all. It introduces a pointless dimension.

Junkcharts_color_notdata

It may be prettier but the colors also interfere with our perception of the changing heights.


Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.

Econ_seriea_racism

The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:

Junkcharts_redo_seriea_racism

The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.

***

This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).

 

 

 

 


Tip of the day: transform data before plotting

The Financial Times called out a twitter user for some graphical mischief. Here are the two charts illustrating the plunge in Bitcoin's price last week : (Hat tip to Mark P.)

Ft_tradingview_btcprices

There are some big differences between the two charts. The left chart depicts this month's price actions, drawing attention to the last week while the right chart shows a longer period of time, starting from 2012. The author of the tweet apparently wanted to say that the recent drop is nothing to worry about. 

The Financial Times reporter noted another subtle difference - the right chart uses a log scale while the left chart is linear. Specifically, it's a log 2 scale, which means that each step up is double the previous number (1, 2, 4, 8, etc.). The effect is to make large changes look smaller. Presumably most readers fail to notice the scale. Even if they do, it's not natural to assign different differences to the same physical distances.

***

Junkcharts_redo_fttradingviewbitcoinpricechart

These price charts always miss the mark. That's because the current price is insufficient to capture whether a Bitcoin investor made money or lost money. If you purchased Bitcoins this month, you lost money. If your purchase was a year ago, you still made quite a bit of money despite the recent price plunge.

The following chart should not be read as a time series, even though the horizontal axis is time. Think date of Bitcoin purchase. This chart tells you how much $1 of Bitcoin is worth last week, based on what day the purchase was made.

Junkcharts_redo_fttradingviewbitcoinpricechart_2

People who bought this year have mostly been in the red. Those who purchased before October 2020 and held on are still very pleased with their decision.

This example illustrates that simple transformations of the raw data yield graphics that are much more informative.

 


Reading this chart won't take as long as withdrawing troops from Afghanistan

Art sent me the following Economist chart, noting how hard it is to understand. I took a look, and agreed. It's an example of a visual representation that takes more time to comprehend than the underlying data.

Econ_theendisnear

The chart presents responses to 3 questions on a survey. For each question, the choices are Approve, Disapprove, and "Neither" (just picking a word since I haven't seen the actual survey question). The overall approval/disapproval rates are presented, and then broken into two subgroups (Democrats and Republicans).

The first hurdle is reading the scale. Because the section from 75% to 100% has been removed, we are left with labels 0, 25, 50, 75, which do not say percentages unless we've consumed the title and subtitle. The Economist style guide places the units of data in the subtitle instead of on
the axis itself.

Our attention is drawn to the thick lines, which represent the differences between approval and disapproval rates. These differences are signed: it matters whether the proportion approving is higher or lower than the proportion disapproving. This means the data are encoded in the order of the dots plus the length of the line segment between them.

The two bottom rows of the Afghanistan question demonstrates this mental challenge. Our brains have to process the following visual cues:

1) the two lines are about the same lengths

2) the Republican dots are shifted to the right by a little

3) the colors of the dots are flipped

What do they all mean?

Econ_theendofforever_subset

A chart runs in trouble when you need a paragraph to explain how to read it.

It's sometimes alright to make complicated data visualization that illustrates complicated concepts. What justifies it is the payoff. I wrote about the concept of return on effort in data visualization here.

The payoff for this chart escaped me. Take the Democratic response to troop withdrawal. About 3/4 of Democrats approve while 15% disapprove. The thick line says 60% more Democrats approve than disapprove.

***

Here, I show the full axis, and add a 50% reference line

Junkcharts_redo_econ_theendofforever_1

Small edits but they help visualize "half of", "three quarters of".

***

Next, I switch to the more conventional stacked bars.

Junkcharts_redo_econ_theendofforever_stackedbars

This format reveals some of the hidden data on the chart - the proportion answering neither approve/disapprove, and neither yes/no.

On the stacked bars visual, the proportions are counted from both ends while in the dot plot above, the proportions are measured from the left end only.

***

Read all my posts about Economist charts here