Selecting the right analysis plan is the first step to good dataviz

It's a new term, and my friend Ray Vella shared some student projects from his NYU class on infographics. There's always something to learn from these projects.

The starting point is a chart published in the Economist a few years ago.

Economist_richgetricher

This is a challenging chart to read. To save you the time, the following key points are pertinent:

a) income inequality is measured by the disparity between regional averages

b) the incomes are given in a double index, a relative measure. For each country and year combination, the average national GDP is set to 100. A value of 150 means the richest region of Spain has an average income that is 50% higher than Spain's national average in the year 2015.

The original chart - as well as most of the student work - is based on a specific analysis plan. The difference in the index values between the richest and poorest regions is used as a measure of the degree of income inequality, and the change in the difference in the index values over time, as a measure of change in the degree of income inequality over time. That's as big a mouthful as the bag of words sounds.

This analysis plan can be summarized as:

1) all incomes -> relative indices, at each region-year combination
2) inequality = rich - poor region gap, at each region-year combination
3) inequality over time = inequality in 2015 - inequality in 2000, for each country
4) country difference = inequality in country A - inequality in country B, for each year

***

One student, J. Harrington, looks at the data through an alternative lens that brings clarity to the underlying data. Harrington starts with change in income within the richest regions (then the poorest regions), so that a worsening income inequality should imply that the richest region is growing incomes at a faster clip than the poorest region.

This alternative analysis plan can be summarized as:
1) change in income over time for richest regions for each country
2) change in income over time for poorest regions for each country
3) inequality = change in income over time: rich - poor, for each country

The restructuring of the analysis plan makes a big difference!

Here is one way to show this alternative analysis:

Junkcharts_kfung_sixeurocountries_gdppercapita

The underlying data have not changed but the reader's experience is transformed.


Superb tile map offering multiple avenues for exploration

Here's a beauty by WSJ Graphics:

Wsj_powerproduction

The article is here.

This data graphic illustrates the power of the visual medium. The underlying dataset is complex: power production by type of source by state by month by year. That's more than 90,000 numbers. They all reside on this graphic.

Readers amazingly make sense of all these numbers without much effort.

It starts with the summary chart on top.

Wsj_powerproduction_us_summary

The designer made decisions. The data are presented in relative terms, as proportion of total power production. Only the first and last years are labeled, thus drawing our attention to the long-term trend. The order of the color blocks is carefully selected so that the cleaner sources are listed at the top and the dirtier sources at the bottom. The order of the legend labels mirrors the color blocks in the area chart.

It takes only a few seconds to learn that U.S. power production has largely shifted away from coal with most of it substituted by natural gas. Other than wind, the green sources of power have not gained much ground during these years - in a relative sense.

This summary chart serves as a reading guide for the rest of the chart, which is a tile map of all fifty states. Embedded in the tile map is a small-multiples arrangement.

***

The map offers multiple avenues for exploration.

Some readers may look at specific states. For example, California.

Wsj_powerproduction_california

Currently, about half of the power production in California come from natural gas. Notably, there is no coal at all in any of these years. In addition to wind, solar energy has also gained. All of these insights come without the need for any labels or gridlines!

Wsj_powerproduction_westernstatesBrowsing around California, readers find different patterns in other Western states like Oregon and Washington.

Hydroelectric energy is the dominant source in those two states, with wind gradually taking share.

At this point, readers realize that the summary chart up top hides remarkable state-level variations.

***

There are other paths through the map.

Some readers may scan the whole map, seeking patterns that pop out.

One such pattern is the cluster of states that use coal. In most of these states, the proportion of coal has declined.

Yet another path exists for those interested in specific sources of power.

For example, the trend in nuclear power usage is easily followed by tracking the purple. South Carolina, Illinois and New Hampshire are three states that rely on nuclear for more than half of its power.

Wsj_powerproduction_vermontI wonder what happened in Vermont about 8 years ago.

The chart says they renounced nuclear energy. Here is some history. This one-time event caused a disruption in the time series, unique on the entire map.

***

This work is wonderful. Enjoy it!


What do I think about spirals?

A twitter user asked how I feel about this latest effort (from NASA) to illustrate global warming. To see the entire video, go to their website.

Nasa_climatespiral_fullperiod

This video hides the lede so be patient or jump ahead to 0:56 and watch till the end.

Let's first describe what we are seeing.

The dataset consists of monthly average global temperature "anomalies" from 1880 to 2021 - an "anomaly" is the deviation of the average temperature that month from a reference level (seems like this is fixed at the average temperatures by month between 1951 and 1980).

A simple visualization of the dataset is this:

Junkcharts_redo_nasasprials_longline

We see a gradual rise in temperature from the 1980s to today. The front half of this curve is harder to interpret. The negative values suggest that the average temperatures prior to 1951 are generally lower than the temperature in the reference period. Other than 1880-1910, temperatures have generally been rising.

Now imagine chopping up the above chart into yearly increments, 12 months per year. Then wrap each year's line into a circle, and place all these lines onto the following polar grid system.

Junkcharts_redo_nasaspiral_linesandcircles

Close but not quite there. The circles in the NASA video look much smoother. Two possibilities here. First is the aspect ratio. Note that the polar grid stretches the time axis to the full circle while the vertical axis is squashed. Not enough to explain the smoothness, as seen below.

Junkcharts_redo_nasaspirals_unsmoothedwide

The second possibility is additional smoothing between months.

Junkcharts_redo_nasaspirals_smoothedlines

The end result is certainly pretty:

Nasa_climatespiral_fullperiod

***

Is it a good piece of scientific communications?

What is the chart saying?

I see red rings on the outside, white rings in the middle, and blue rings near the center. Red presumably means hotter, blue cooler.

The gridlines are painted over. The 0 degree (green) line is printed over again and again.

The biggest red circles are just beyond the 1 degree line with the excess happening in the January-March months. In making that statement, I'm inferring meaning to excess above 1 degree. This inference is purely based on where the 1-degree line is placed.

I also see in the months of December and January, there may have been "cooling", as the blue circles edge toward the -1 degree gridline. Drawing this inference actually refutes my previous claim. I had said that the bulge beyond the +1 degree line is informative because the designer placed the +1 degree line there. If I applied the same logic, then the location of the -1 degree line implies that only values more negative than -1 matter, which excludes the blue bulge!

Now what years are represented by these circles? Test your intuition. Are you tempted to think that the red lines are the most recent years, and the blue lines are the oldest years? If you think so, like I do, then we fall into a trap. We have now imputed two meanings to color -- temperature and recency, when the color coding can only hold one.

The only way to find out for sure is to rewind the tape and watch from the start. The year dimension is pushed to the background in this spiral chart. Instead, the month dimension takes precedence. Recall that at the start, the circles are white. The bluer circles appear in the middle of the date range.

This dimensional flip flop is a key difference between the spiral chart and the line chart (shown again for comparison).

Junkcharts_redo_nasasprials_longline

In the line chart, the year dimension is primary while the month dimension is pushed to the background.

Now, we have to decide what the message of the chart should be. For me, the key message is that on a time scale of decades, the world has experienced a significant warming to the tune of about 1.5 degrees Celsius (35 F2.7 F). The warming has been more pronounced in the last 40 years. The warming is observed in all twelve months of the year.

Because the spiral chart hides the year dimension, it does not convey the above messages.

The spiral chart shares the same weakness as the energy demand chart discussed recently (link). Our eyes tend to focus on the outer and inner envelopes of these circles, which by definition are extreme values. Those values do not necessarily represent the bulk of the data. The spiral chart in fact tells us that there is not much to learn from grouping the data by month. 

The appeal of a spiral chart for periodic data is similar to a map for spatial data. I don't recommend using maps unless the spatial dimension is where the signal lies. Similarly, the spiral chart is appropriate if there are important deviations from a seasonal pattern.

 

 


Dots, lines, and 2D histograms

Daniel Z. tweeted about my post from last week. In particular, he took a deeper look at the chart of energy demand that put all hourly data onto the same plot, originally published at the StackOverflow blog:

Stackoverflow_variabilitychart

I noted that this is not a great chart particularly since what catches our eyes are not the key features of the underlying data. Daniel made a clearly better chart:

Danielzvinca_densitychart

This is a dot plot, rather than a line chart. The dots are painted in light gray, pushed to the background, because readers should be looking at the orange line. (I'm not sure what is going on with the horizontal scale as I could not get the peaks to line up on the two charts.)

What is this orange line? It's supposed to prove the point that the apparent dark band seen in the line chart does not represent the most frequently occurring values, as one might presume.

Looking closer, we see that the gray dots do not show all the hourly data but binned values.

Danielzvinca_densitychart_inset
We see vertical columns of dots, each representing a bin of values. The size of the dots represents the frequency of values of each bin. The orange line connects the bins with the highest number of values.

Daniel commented that

"The visual aggregation doesn't in fact map to the most frequently occurring values. That is because the ink of almost vertical lines fills in all the space between start and end."

Xan Gregg investigated further, and made a gif to show this effect better. Here is a screenshot of it (see this tweet):

Xangregg_dots_vs_line

The top chart is a true dot plot so that the darker areas are denser as the dots overlap. The bottom chart is the line chart that has the see-saw pattern. As Xan noted, the values shown are strangely very well behaved (aggregated? modeled?) - with each day, it appears that the values sweep up and down consistently.  This means the values are somewhat evenly spaced on the underlying trendline, so I think this dataset is not the best one to illustrate Daniel's excellent point.

It's usually not a good idea to connect lots of dots with a single line.

 

[P.S. 3/21/2022: Daniel clarified what the orange line shows: "In the posted chart, the orange line encodes the daily demand average (the mean of the daily distribution), rounded, for displaying purposes, to the closed bin. Bin size = 1000. Orange could have encode the daily median as well."]

 


The what of visualization, beyond the how

A long-time reader sent me the following chart from a Nature article, pointing out that it is rather worthless.

Nautre_scihub

The simple bar chart plots the number of downloads, organized by country, from the website called Sci-Hub, which I've just learned is where one can download scientific articles for free - working around the exorbitant paywalls of scientific journals.

The bar chart is a good example of a Type D chart (Trifecta Checkup). There is nothing wrong with the purpose or visual design of the chart. Nevertheless, the chart paints a misleading picture. The Nature article addresses several shortcomings of the data.

The first - and perhaps most significant - problem is that many Sci-Hub users are expected to access the site via VPN servers that hide their true countries of origin. If the proportion of VPN users is high, the entire dataset is called into doubt. The data would contain both false positives (in countries with VPN servers) and false negatives (in countries with high numbers of VPN users). 

The second problem is seasonality. The dataset covered only one month. Many users are expected to be academics, and in the southern hemisphere, schools are on summer vacation in January and February. Thus, the data from those regions may convey the wrong picture.

Another problem, according to the Nature article, is that Sci-Hub has many competitors. "The figures include only downloads from original Sci-Hub websites, not any replica or ‘mirror’ site, which can have high traffic in places where the original domain is banned."

This mirror-site problem may be worse than it appears. Yes, downloads from Sci-Hub underestimate the entire market for "free" scientific articles. But these mirror sites also inflate Sci-Hub statistics. Presumably, these mirror sites obtain their inventory from Sci-Hub by setting up accounts, thus contributing lots of downloads.

***

Even if VPN and seasonality problems are resolved, the total number of downloads should be adjusted for population. The most appropriate adjustment factor is the population of scientists, but that statistic may be difficult to obtain. A useful proxy might be the number of STEM degrees by country - obtained from a UNESCO survey (link).

A metric of the type "number of Sci-Hub downloads per STEM degree" sounds odd and useless. I'd argue it's better than the unadjusted total number of Sci-Hub downloads. Just don't focus on the absolute values but the relative comparisons between countries. Even better, we can convert the absolute values into an index to focus attention on comparisons.

 


The envelope of one's data

This post is the second post in response to a blog post at StackOverflow (link) in which the author discusses the "harm" of "aggregating away the signal" in your dataset. The first post appears on my book blog earlier this week (link).

One stop in their exploratory data analysis journey was the following chart:

Stackoverflow_variabilitychart

This chart plots all the raw data, all 8,760 values of electricity consumption in California in 2020. Most analysts know this isn't a nice chart, and it's an abuse of ink. This chart is used as a contrast to the 4-week moving average, which was hoisted up as an example of "over-aggregation".

Why is the above chart bad (aside from the waste of ink)? Think about how you consume the information. For me, I notice these features in the following order:

  1. I see the upper "envelope" of the data, i.e. the top values at each hour of each day throughout the year. This gives me the seasonal pattern with a peak in the summer months.
  2. I see the lower "envelope" of the data
  3. I see the "height" of the data, which is, roughly speaking, the range of values within a day
  4. If I squint hard enough, I see a darker band within the band, which roughly maps to the most frequently occurring values (this feature becomes more prominent if we select a lighter shade of gray)

The chart may not be as bad as it looks. The "moving average" is sort of visible. The variability of consumption is visible. The primary problem is it draws attention to the outliers, rather than the more common values.

The envelope of any dataset is composed of extreme values, by definition. For most analysis objectives, extreme values are "noise". In the chart above, it's hard to tell how common the maximum values are relative to other possible values but it's the upper envelope that captures my attention - simply because it's the easiest trend to make out.

***

The same problem actually surfaces in the "improved" chart:

Stackoverflow_weekofyearchart

As explained in the preceding post, this chart rearranges the data. Instead of a single line, therea are now 52 overlapping lines, one for each week of the year. So each line is much less dense and we can make out the hour of day/day of week pattern.

Notice that the author draws attention to the upper envelope of this chart. They notice the line(s) near the top are from the summer, and this further guides their next analysis.

The reason for focusing on the envelope is the same as in the other chart. Where the lines are dense, it's not easy to make out the pattern.

Even the envelope is not as clear as it seems! There is no reason why the highlighted week (August 16 to 23) should have the highest consumption value each hour of each day of the week. It's possible that the line dips into the middle of the range at various points along the line. In the following chart, I highlight two time points in which lines may or may not have crossed:

Junkcharts_stackoverflow_confusingenvelope

In an interactive chart, each line can be highlighted to resolve the confusion.

Note that the lower envelope is much harder to decipher, given the density of lines.

***
The author then pursues a hypothesis that there are lines (weeks) with one intra-day peak and there are those with two peaks.

I'd propose that those are not discrete states but continuous. The base pattern can be one with two peaks, a higher peak in the evening, and a lower peak in the morning. Now, if you imagine pushing up the evening peak while holding the lower peak at its height, you'd gradually "erase" the lower peak but it's just receded into the background.

Possibly the underlying driver is the total demand for energy. The higher the demand, the more likely it's concentrated in the evening, which causes the lower peak to recede. The lower the demand, the more likely we see both peaks.

In either case, the prior chart drives the direction of the next analysis.

 

 

 

 

 


How does the U.K. vote in the U.N.?

Through my twitter feed, I found my way to this chart, made by jamie_bio.

Jamie_bio_un_votes25032021

This is produced using R code even though it looks like a slide.

The underlying dataset concerns votes at the United Nations on various topics. Someone has already classified these topics. Jamie looked at voting blocs, specifically, countries whose votes agree most often or least often with the U.K.

If you look at his Github, this is one in a series of works he produced to hone his dataviz skills. Ultimately, I think this effort can benefit from some re-thinking. However, I also appreciate the work he has put into this.

Let's start with the things I enjoyed.

Given the dataset, I imagine the first visual one might come up with is a heatmap that shows countries in rows and topics in columns. That would work ok, as any standard chart form would but it would be a data dump that doesn't tell a story. There are almost 200 countries in the entire dataset. The countries can only be ordered in one way so if it's ordered for All Votes, it's not ordered for any of the other columns.

What Jamie attempts here is story-telling. The design leads the reader through a narrative. We start by reading the how-to-read-this box on the top left. This tells us that he's using a lunar eclipse metaphor. A full circle in blue indicates 0% agreement while a full circle in white indicates 100% agreement. The five circles signal that he's binning the agreement percentages into five discrete buckets, which helps simplify our understanding of the data.

Then, our eyes go to the circle of circles, labelled "All votes". This is roughly split in half, with the left side showing mostly blue and the right showing mostly white. That's because he's extracting the top 5 and bottom 5 countries, measured by their vote alignment with the U.K. The countries names are clearly labelled.

Next, we see the votes broken up by topics. I'm assuming not all topics are covered but six key topics are highlighted on the right half of the page.

What I appreciate about this effort is the thought process behind how to deliver a message to the audience. Selecting a specific subset that addresses a specific question. Thinning the materials in a way that doesn't throw the kitchen sink at the reader. Concocting the circular layout that presents a pleasing way of consuming the data.

***

Now, let me talk about the things that need more work.

I'm not convinced that he got his message across. What is the visual telling us? Half of the cricle are aligned with the U.K. while half aren't so the U.K. sits on the fence on every issue? But this isn't the message. It's a bit of a mirage because the designer picked out the top 5 and bottom 5 countries. The top 5 are surely going to be voting almost 100% with the U.K. while the bottom 5 are surely going to be disagreeing with the U.K. a lot.

I did a quick sketch to understand the whole distribution:

Redo_junkcharts_ukvotes_overview_2

This is not intended as a show-and-tell graphic, just a useful way of exploring the dataset. You can see that Arms Race/Disarmament and Economic Development are "average" issues that have the same form as the "All issues" line. There are a small number of countries that are extremely aligned with the UK, and then about 50 countries that are aligned over 50% of the time, then the other 150 countries are within the 30 to 50% aligned. On human rights, there is less alignment. On Palestine, there is more alignment.

What the above chart shows is that the top 5 and bottom 5 countries both represent thin slithers of this distribution, which is why in the circular diagrams, there is little differentiation. The two subgroups are very far apart but within each subgroup, there is almost no variation.

Another issue is the lunar eclipse metaphor. It's hard to wrap my head around a full white circle indicating 100% agreement while a full blue circle shows 0% agreement.

In the diagrams for individual topics, the two-letter acronyms for countries are used instead of the country names. A decoder needs to be provided, or just print the full names.

 

 

 

 

 

 


To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.

***

The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.

***

Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?

***

A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.


Displaying convoluted indices

I reviewed another batch of projects from Ray Vella's class at NYU. The following piece by Carlos Lasso made an impression on me. There are no pyrotechnics but he made one decision that added a lot of clarity to the graphic.

The Rich get Richer - Carlos Lasso

The underlying dataset gauges the income disparity of regions within nine countries. The richest and the poorest regions are selected for each country. Two time points are shown. Altogether, there are 9x2x2 = 36 numbers.

***

Let's take a deeper look at these numbers. Notice they are not in dollars, or any kind of currency, despite being about incomes. The numbers are index values, relative to 100. What does the reference level of 100 represent?

The value of 100 crosses every bar of the chart so that 100 has meaning in each country and each year. In fact, there are 18 definitions of 100 in this chart with 36 numbers, one for each country-year pair. The average national income is set to 100 for each country in each year. This is a highly convoluted indexing strategy.

The following chart is a re-visualization of the bottom part of Carlos' infographic.

Junkcharts_richricher2021_2columns

I shifted the scale of the horizontal axis. The value of zero does not hold special meaning in Carlos' chart. I subtracted 100 from the relative regional income indices, thus all regions with income above the average have positive values while those below the national average have negative values. (There are other challenges with the ratio scale, which I'll skip over in this post. The minimum value is -100 while the maximum value can be very large.)

The rescaling is not really the point of this post. To see what Carlos did, we have to look at the example shown in class. The graphic which the students were asked to improve has the following structure:

Junkcharts_richricher2021_1column

This one-column structure places four bars beside each country, grouped by year. Carlos pulled the year dimension out, and showed the same dataset in two columns.

This small change makes a great difference in ease of comprehension. Carlos' version unpacks the two key types of comparisons one might want to make: trend within a given country (horizontal comparison) and contrast between countries in a given year (vertical comparison).

***

I always try to avoid convoluted indexing. The cost of using such indices is the big how-to-read-this box.


The gift of small edits and subtraction

While making the chart on fertility rates (link), I came across a problem that pops up quite often, and is  ignored by most software programs.

Here is an earlier version of the chart I later discarded:

Junkcharts_redofertilitychart_2

Compare this to the version I published in the blog post:

Junkcharts_redofertilitychart_1

Aside from adding the chart title, there is one major change. I removed the empty plots from the grid. This is a visualization trick that should be called adding by subtracting. The empty scaffolding on the first chart increases our cognitive load without yielding any benefit. The whitespace brings out the message that only countries in Asia and Africa have fertility rates above 5.0. 

This is a small edit. But small edits accumulate and deliver a big impact. Bear this in mind the next time you make a chart.

 

P.S.

(1) You'd have to use a lower-level coding language to execute this small edit. Most software programs are quite rigid when it comes to making small-multiples (facet) charts.

(2) If there is a next iteration, I'd reverse the Asia and Oceania rows.