Students demonstrate how analytics underlie strong dataviz

In today's post, I'm delighted to feature work by several students of Ray Vella's data visualization class at NYU. They have been asked to improve the following Economist chart entitled "The Rich Get Richer".

Economist_richgetricher

In my guest lecture to the class, I emphasized the importance of upfront analytics when constructing data visualizations.

One of the key messages is pay attention to definitions. How does the Economist define "rich" and "poor"? (it's not what you think). Instead of using percentiles (e.g. top 1% of the income distribution), they define "rich" as people living in the richest region by average GDP, and "poor" as people living in the poorest region by average GDP. Thus, the "gap" between the rich and the poor is measured by the difference in GDP between the average persons in those two regions.

I don't like this metric at all but we'll just have to accept that that's the data available for the class assignment.

***

Shulin Huang's work is notable in how she clarifies the underlying algebra.

Shulin_rvella_economist_richpoorgap

The middle section classifies the countries into two groups, those with widening vs narrowing gaps. The side panels show the two components of the gap change. The gap change is the sum of the change in the richest region and the change in the poorest region.

If we take the U.S. as an example, the gap increased by 1976 units. This is because the richest region gained 1777 while the poor region lost 199. Germany has a very different experience: the richest region regressed by 2215 while the poorest region improved by 424, leading to the gap narrowing by 2638.

Note how important it is to keep the order of the countries fixed across all three panels. I'm not sure how she decided the order of these countries, which is a small oversight in an otherwise excellent effort.

Shulin's text is very thoughtful throughout. The chart title clearly states "rich regions" rather than "the rich". Take a look at the bottom of the side panels. The label "national AVG" shows that the zero level is the national average. Then, the label "regions pulled further ahead" perfectly captures the positive direction.

Compared to the original, this chart is much more easily understood. The secret is the clarity of thought, the deep understanding of the nature of the data.

***

Michael Unger focuses his work on elucidating the indexing strategy employed by the Economist. In the original, each value of regional average GDP is indexed to the national average of the relevant year. A number like 150 means the region has an average GDP for the given year that is 50% higher than the national average. It's tough to explain how such indices work.

Michael's revision goes back to the raw data. He presents them in two panels. On the left, the absolute change over time in the average GDPs are presented for each of the richest/poorest region while on the right, the relative change is shown.

Mungar_rvella_economist_richpoorgap

(Some of the country labels are incorrect. I'll replace with a corrected version when I receive one.)

Presenting both sides is not redundant. In France, for example, the richest region improved by 17K while the poorest region went up by not quite 6K. But 6K on a much lower base represents a much higher proportional jump as the right side shows.

***

Related to Michael's work, but even simpler, is Debbie Hsieh's effort.

Debbiehsieh_rayvella_economist_richpoorgap

Debbie reduces the entire exercise to one message - the relative change over time in average GDP between the richest and poorest region in each country. In this simplest presentation, if both columns point up, then both the richest and the poorest region increased their average GDP; if both point down, then both regions suffered GDP drops.

If the GDP increased in the richest region while it decreased in the poorest region, then the gap widened by the most. This is represented by the blue column pointing up and the red column pointing down.

In some countries (e.g. Sweden), the poorest region (orange) got worse while the richest region (blue) improved slightly. In Italy and Spain, both the best and worst regions gained in average GDPs although the richest region attained a greater relative gain.

While Debbie's chart is simpler, it hides something that Michael's work shows more clearly. If both the richest and poorest regions increased GDP by the same percentage amount, the average person in the richest region actually experienced a higher absolute increase because the base of the percentage is higher.

***

The numbers across these charts aren't necessarily well aligned. That's actually one of the challenges of this dataset. There are many ways to process the data, and small differences in how each student handles the data lead to differences in the derived values, resulting in differences in the visual effects.


Out of line

This simple chart showing life expectancies in 10 countries raises one's eyebrows.

Lifeexpectancy_indiatv

The first curiosity is the deliberate placement of Pakistan behind India and China. Every nation is sorted from lowest to highest, except for Pakistan. Is the reason politics? I have no idea. If you have an explanation, please leave a comment.

***
This graphic is an example of data visualization that does not actually show the data.

The positions of the flags do not in fact encode the data! For example, the Indian flag is closer to the Chinese flag than to the Pakistani flag even though the gap between India and China (7) is more than double the gap between India and Pakistan (3).

Here is what it looks like if the gaps encode the data. With this selection of countries, Pakistan and India are separated from the rest. 

Junkcharts_redo_indiatvlifeexpectancy

In the original chart, the readers must read the data labels to understand it, and resist intepreting the visual elements.

I removed the flag poles because they have the unintended consequence of establishing a zero level (where the cartoon characters stand) but the positions of the flags don't reflect a start-at-zero posture.

***

Returning to our first topic for a second. If the message of the chart is to single out Pakistan, it actually works! If all other countries are sorted by value, with Pakistan inserted out of order, it draws our attention.

In a conventional layout, Pakistan is shoved to the left side in the bottom corner. See below:

Junkcharts_redo_indiatvlifeexpectancy_2

 

 


Hammock plots

Prof. Matthias Schonlau gave a presentation about "hammock plots" in New York recently.

Here is an example of a hammock plot that shows the progression of different rounds of voting during the 1903 papal conclave. (These are taken at the event and thus a little askew.)

Hammockplot_conclave

The chart shows how Cardinal Sarto beat the early favorite Rampolla during later rounds of voting. The chart traces the movement of votes from one round to the next. The Vatican destroys voting records, and apparently, records were unexpectedly retained for this particular conclave.

The dataset has several features that brings out the strengths of such a plot.

There is a fixed number of votes, and a fixed number of candidates. At each stage, the votes are distributed across the subset of candidates. From stage to stage, the support levels for candidate shift. The chart brings out the evolution of the vote.

From the "marginals", i.e. the stacked columns shown at each time point, we learn the relative strengths of the candidates, as they evolve from vote to vote.

The links between the column blocks display the evolution of support from one vote to the next. We can see which candidate received more votes, as well as where the additional votes came from (or, to whom some voters have drifted).

The data are neatly arranged in successive stages, resulting in discrete time steps.

Because the total number of votes are fixed, the relative sizes of the marginals are nicely constrained.

The chart is made much more readable because of binning. Only the top three candidates are shown individually with all the others combined into a single category. This chart would have been quite a mess if it showed, say, 10 candidates.

How precisely we can show the intra-stage movement depends on how the data records were kept. If we have the votes for each person in each round, then it should be simple to execute the above! If we only have the marginals (the vote distribution by candidate) at each round, then we are forced to make some assumptions about which voters switched their votes. We'd likely have to rule out unlikely scenarios, such as that in which all of the previous voters for candidate X switched to someone other candidates while another set of voters switched their votes to candidate X.

***

Matthias also showed examples of hammock plots applied to different types of datasets.

The following chart displays data from course evaluations. Unlike the conclave example, the variables tied to questions on the survey are neither ordered nor sequential. Therefore, there is no natural sorting available for the vertical axes.

Hammockplot_evals

Time is a highly useful organizing element for this type of charts. Without such an organizing element, the designer manually customizes an order.

The vertical axes correspond to specific questions on the course evaluation. Students are aggregated into groups based on the "profile" of grades given for the whole set of questions. It's quite easy to see that opinions are most aligned on the "workload" question while most of the scores are skewed high.

Missing values are handled by plotting them as a new category at the bottom of each vertical axis.

This example is similar to the conclave example in that each survey response is categorical, one of five values (plus missing). Matthias also showed examples of hammock plots in which some or all of the variables are numeric data.

***

Some of you will see some resemblance of the hammock plot with various similar charts, such as the profile chart, the alluvial chart, the parallel coordinates chart, and Sankey diagrams. Matthias discussed all those as well.

Matthias has a book out called "Applied Statistical Learning" (link).

Also, there is a Python package for the hammock plot on github.


Logging a sleight of hand

Andrew puts up an interesting chart submitted by one of his readers (link):

Gelman_overnightreturns_tsla

Bruce Knuteson who created this chart is pursuing a theory that there is some fishy going on in the stock markets over night (i.e. between the close of one day and the open of the next day). He split the price data into two interleaving parts: the blue line represents returns overnight and the green line represents returns intraday (from open of one day to the close of the same day). In this example related to Tesla's stock, the overnight "return" is an eyepopping 36850% while the intraday "return" is -46%.

This is an example of an average masking interesting details in the data. One typically looks at the entire sequence of values at once, while this analysis breaks it up into two subsequences. I'll write more about the data analysis at a later point. This post will be purely about the visualization.

***

It turns out that while the chart looks like a standard time series, it isn't. Bruce wrote out the following essential explanation:

Gelman_overnightreturns

The chart can't be interpreted without first reading this note.

The left chart (a) is the standard time-series chart we're thinking about. It plots the relative cumulative percentage change in the value of the investment over time. Imagine one buys $1 of Apple stock on day 1. It shows the cumulative return on day X, expressed as a percent relative to the initial investment amount. As mentioned above, the data series was split into two: the intraday return series (green) is dwarfed by the overnight return series (blue), and is barely visiable hugging the horizontal axis.

Almost without thinking, a graphics designer applies a log transform to the vertical axis. This has the effect of "taming" the extreme values in the blue line. This is the key design change in the middle chart (b). The other change is to switch back to absolute values. The day 1 number is now $1 so the day X number shows the cumulative value of the investment on day X if one started with $1 on day 1.

There's a reason why I emphasized the log transform over the switch to absolute values. That's because the relationship between absolute and relative values here is a linear one. If y(t) is the absolute cumulative value of $1 at time t, then the percent change r(t) = 100(y(t) -1). (Note that y(0) = 1 by definition.)  The shape of the middle chart is primarily conditioned by the log transform.

In the right chart (c), which is the design that Bruce features in all his work, the visual elements of chart (b) are retained while he replaced the vertical axis labels with those from chart (a). In other words, the lines show the cumulative absolute values while the labels show the relative cumulative percent returns.

I left this note on Gelman's blog (corrected a mislabeling of the chart indices):

I'm interested in the the sleight of hand related to the plots, also tying this back to the recent post about log scales. In plot (b) (a) [middle of the panel], he transformed the data to show the cumulative value of the investment assuming one puts $1 in the stock on day 1. He applied a log scale on the vertical axis. This is fine. Then in plot (c) (b), he retained the chart but changed the vertical axis labels so instead of absolute value of the investment, he shows percent changes relative to the initial value.

Why didn't he just plot the relative percent changes? Let y(t) be the absolute values and r(t) = the percent change = 100*(y(t) -1) is a simple linear transformation of y(t). This is where the log transform creates problems! The y(t) series is guaranteed to be positive since hitting y(t) = 0 means the entire investment is lost. However, the r(t) series can hit negative values and also cross over zero many times over time. Thus, log r(t) is inoperable. The problem is using the log transform for data that are not always positive, and the sleight of hand does not fix it!

Just pick any day in which the absolute return fell below $1, e.g. the last day of the plot in which the absolute value of the investment was down to $0.80. In the middle plot (b), the value depicted is ln(0.8) = -0.22. Note that the plot is in log scale, so what is labeled as $1 is really ln(1) = 0. If we instead try to plot the relative percent changes, then the day 1 number should be ln(0) which is undefined while the last number should be ln(-20%) which is also undefined.

This is another example of something umcomfortable about using log scales which I pointed out in this post. It's this idea that when we do log plots, we can freely substitute axis labels which are not directly proportional to the actual labels. It's plotting one thing, and labelling it something else. These labels are then disconnected from the visual encoding. It's against the goal of visualizing data.

 


The message left the visual

The following chart showed up in Princeton Alumni Weekly, in a report about China's population:

Sciam_chinapop_19802020

This chart was one of several that appeared in a related Scientific American article.

The story itself is not surprising. As China develops, its birth rate declines, while the death rate also falls, thus, the population ages. The same story has played out in all advanced economies.

***

From a Trifecta Checkup perspective, this chart suffers from several problems.

The text annotation on the top right suggests what message the authors intended to deliver. Pointing to the group of people aged between 30 and 59 in 2020, they remarked that this large cohort would likely cause "a crisis" when they age. There would be fewer youngsters to support them.

Unfortunately, the data and visual elements of the chart do not align with this message. Instead of looking forward in time, the chart compares the 2020 population pyramid with that from 1980, looking back 40 years. The chart shows an insight from the data, just not the right one.

A major feature of a population pyramid is the split by gender. The trouble is gender isn't part of the story here.

In terms of age groups, the chart treats each subgroup "fairly". As a result, the reader isn't shown which of the 22 subgroups to focus on. There are really 44 subgroups if we count each gender separately, and 88 subgroups if we include the year split.

***

The following redesign traces the "crisis" subgroup (those who were 30-59 in 2020) both backwards and forwards.

Junkcharts_redo_chinapopulationpyramids

The gender split has been removed; here, the columns show the total population. Color is used to focus attention to one cohort as it moves through time.

Notice I switched up the sample times. I pulled the population data for 1990 and 2060 (from this website). The original design used the population data from 1980 instead of 1990. However, this choice is at odds with the message. People who were 30 in 2020 were not yet born in 1980! They started showing up in the 1990 dataset.

At the other end of the "crisis" cohort, the oldest (59 year old in 2020) would have deceased by 2100 as 59+80 = 139. Even the youngest (30 in 2020) would be 110 by 2100 so almost everyone in the pink section of the 2020 chart would have fallen off the right side of the chart by 2100.

These design decisions insert a gap between the visual and the message.

 

 


Aligning the visual and the message

Today's post is about work by Diane Barnhart, who is a product manager at Bloomberg, and is taking Ray Vella's infographics class at NYU. The class is given a chart from the Economist, as well as some data on GDP per capita in selected countries at the regional level. The students are asked to produce data visualization that explores the change in income inequality (as indicated by GDP per capita).

Here is Diane's work:

Diane Barnhart_Rich Get Richer

In this chart, the key measure is the GDP per capita of different regions in Germany relative to the national average GDP. Hamburg, for example, has a GDP per capita that was 80% above the national average in 2000 while Leipzig's GDP per capita was 30% below the national average in 2000. (This metric is a bit of a head scratcher, and forms the basis of the Economist chart.)

***

Diane made several insightful design choices.

The key insight of this graph is also one of the easiest to see. It's the narrowing of the range of possible values. In 2000, the top value is about 90% while the bottom is under -40%, making a range of 130%. In 2020, the range has narrowed to 90%, with the values falling between 60% and -30%. In other words, the gap between rich and poor regions in Germany has reduced over these two decades.

The chosen chart form makes this message come alive.

Diane divided the regions into three groups, mapped to the black, red and yellow colors of the German flag. Black are for those regions that have GDP per capita above the average; yellow for those regions with GDP per capita over 25% below the average.

Instead of applying color to individual lines that trace the GDP metric over time for each region, she divided the area between the lines into three, and painted them. This necessitates a definition of the boundary line between colored areas over time. I gathered that she classified the regions using the latest GDP data (2020) and then traced the GDP trend lines back in time. Other definitions are also possible.

The two-column data table shown on the right provides further details that aren't found in the data visualization. The table is nicely enhanced with colors. They represent an augmentation of the information in the main chart, not a repetition.

All in all, this is a delightful project, and worthy of a top grade!


Coffee in different shapes and sizes: a test of self-sufficiency

Take a look at the following graphic showing top producers of coffee in 2o24:

Junkcharts_voronoicoffeeproduction

Then, try the following tasks:

  • Which country is the top producer?
  • What proportion of the world's production does the top country make?
  • Which countries form the top three?
  • How much is the "Rest of the World" compared to Brazil?
  • How many countries account for the top 50% of the world's production?
  • Does Indonesia or Columbia produce more coffee?
  • Compare India and Uganda
  • How about Honduras vs Peru?

I finished two cups of coffee and still couldn't answer most of these questions. How about you?

***

Now, let's look at the original chart, published by Voronoi, and sent to me by a long-time reader:

Visualcapitalist_coffee

Try those questions again, and the answers seem much more available.

How so?

What we've just demonstrated is that when the reader takes information from this graphic, the reader is consuming the data labels, while the visual encoding of data to shapes has offered zero help.

Given this finding, replacing the above chart with a data table would have achieved the same result, if not expediting understanding.

***

I'm using this graphic to illustrate my "self-sufficiency" test: by removing all data labels from the chart, we reveal how much work the visual elements are doing to enable understanding of the message and the underlying data.

***

Now, our long-time reader has a few comments, with which I agree:

  • what they did right: avoided the "let's just use a choropleth trap"
  • what went wrong? a) using shapes you can't compare at a glance
  • what went wrong? b) no color difference between the shapes
  • what went wrong? c) it looks like larger values are on top, except for Mexico which is squeezed up top for some reason

 

 

 

 

 

 


Making major things easy, revisited

In the prior post, I made a chart that shows the driver license status of British drivers at different ages. The key change unplugs the obsession with a+b+c = 100%. Instead, the revised chart makes it easier to figure out what proportion of which age group holds which type of license.

This is the right-side plot from the panel of two plots:

Junkcharts_redo_significanceolddrivers_male

Looking at this chart, one might think my primary point of interest is the relative proportion with full license vs no license. But on second thought, I'm less interested in this comparison than that between male and female drivers. Does the prevalence of full licenses differ between men and women as they age?

In the original panel, the reader has to run back and forth between the two plots. Why not put that comparison on a single plot?

Like this:

Junkcharts_redo_significanceolderdrivers_fulllicense

This chart surfaces the difference between men and women (at all age groups) in owning full driver's licenses. Women are much more likely to stop driving earlier.

Here is the entire panel:

Junkcharts_redo_significanceolderdrivers_bylicense

Because of this structural choice, it is harder on this panel to learn the distribution of license status.

 

 


Deliberately obstructing chart elements as a plot point

Bbc_globalwarming_ridgeplot smThese "ridge plots" have become quite popular in recent times. The following example, from this BBC report (link), shows the change in global air temperatures over time.

***

This chart is in reality a panel of probability density plots, one for each year of the dataset. The years are arranged with the oldest at the top and the most recent at the bottom. You take those plots and squeeze every ounce of the space out, so that each chart overlaps massively with the ones above it.

The plot at the bottom is the only one that can be seen unobstructed.

Overplotting chart elements, deliberately obstructing them, doesn't sound useful. Is there something gained for what's lost?

***

The appeal of the ridge plot is the metaphor of ridges, or crests if you see ocean waves. What do these features signify?

The legend at the bottom of the chart gives a hint.

The main metric used to describe global warming is the amount of excess temperature, defined as the temperature relative to a historical average, set as the average temperature during the pre-industrial age. In recent years, the average global temperature is about 1.5 degrees Celsius above the reference level.

One might think that the higher the peak in a given plot, the higher the excess temperature. Not so. The heights of those peaks do not indicate temperatures.

What's the scale of the vertical axis? The labels suggest years, but that's a distractor also. If we consider the panel of non-overlapping probability density charts, the vertical axis should show probability density. In such a panel, the year labels should go to the titles of individual plots. On the ridge plot, the density axes are sacrificed, while the year labels are shifted to the vertical axis.

Admittedly, probability density is not an intuitive concept, so not much is lost by its omission.

The legend appears to suggest that the vertical scale is expressed in number of days so that in any given year, the peak of the curve occurs where the most likely excess temperature is found. But the amount of excess is read from the horizontal axis, not the vertical axis - it is encoded as a displacement in location horizontally away from the historical average. In other words, the height of the peak still doesn't correlate with the magnitude of the excess temperature.

The following set of probability density curves (with made-up data) each has the same average excess temperature of 1.5 degrees. Going from top to bottom, the variability of the excess temperatures increases. The height of the peak decreases accordingly because in a density plot, we require the total area under the curve to be fixed. Thus, the higher the peak, the lower the daily variability of the excess temperature.

Kfung_pdf_variances

A problem with this ridge plot is that it draws our attention to the heights of the peaks, which provide information about a secondary metric.

If we want to find the story that the amount of excess temperature has been increasing over time, we would have to trace a curve through the ridges, which strangely enough is a line that moves top to bottom, initially somewhat vertically, then moving sideways to the right. In a more conventional chart, the line that shows growth over time moves from bottom left to top right.

***

The BBC article (link) features several charts. The first one shows how the average excess temperature trends year to year. This is a simple column chart. By supplementing the column chart with the ridge plot, I assume that the designer wants to tell readers that the average annual excess temperature masks daily variability. Therefore, each annual average has been disaggregated into 366 daily averages.

In the column chart, the annual average is compared to the historical average of 50 years. In the ridge plot, the daily average is compared to ... the same historical average of 50 years. That's what the reference line labeled pre-industrial average is saying to me.

It makes more sense to compare the 366 daily averages to 366 daily averages from those 50 years.

But now I've ruined the dataviz because in each probability density plot, there are 366 different reference points. But not really. We just have to think a little more abstractly. These 366 different temperatures are all mapped to the number zero, after adjustment. Thus, they all coincide at the same location on the horizontal axis.

(It's possible that they actually used 366 daily averages as references to construct the ridge plot. I'm guessing not but feel free to comment if you know how these values are computed.)


Don't show everything

There are many examples where one should not show everything when visualizing data.

A long-time reader sent me this chart from the Economist, published around Thanksgiving last year:

Economist_musk

It's a scatter plot with each dot representing a single tweet by Elon Musk against a grid of years (on the horizontal axis) and time of day (on the vertical axis).

The easy messages to pick up include:

  • the increase in frequency of tweets over the years
  • especially, the jump in density after Musk bought Twitter in late 2022 (there is also a less obvious level up around 2018)
  • the almost continuous tweeting throughout 24 hours.

By contrast, it's hard if not impossible to learn the following:

  • how many tweets did he make on average or in total per year, per day, per hour?
  • the density of tweets for any single period of time (i.e., a reference for everything else)
  • the growth rate over time, especially the magnitude of the jumps

The paradox: a chart that is data-dense but information-poor.

***

The designer added gridlines and axis labels to help structure our reading. Specifically, we're cued to separate the 24 hours into four 6-hour chunks. We're also expected to divide the years into two groups (pre- and post- the Musk acquisition), and secondarily, into one-year intervals.

If we accept this analytical frame, then we can divide time into these boxes, and then compute summary statistics within each box, and present those values.  I'm working on some concepts, will show them next time.