Visual cues affect how data are perceived

Here's a recent NYT graphic showing California's water situation at different time scales (link to article).

Nyt_california_drought

It's a small multiples display, showing the spatial distribution of the precipitation amounts in California. The two panels show, respectively, the short-term view (past month) and the longer-term view (3 years). Precipitation is measured in relative terms,  so what is plotted is the relative ratio of precipitation in the reference period, with 100 being the 30-year average.

Green is much wetter than average while brown is much drier than average.

The key to making this chart work is a common color scheme across the two panels.

Also, the placement of major cities provides anchor points for our eyes to move back and forth between the two panels.

***

The NYT graphic is technically well executed. I'm a bit unhappy with the headline: "Recent rains haven't erased California's long-term drought".

At the surface, the conclusion seems sensible. Look, there is a lot of green, even deep green, on the left panel, which means the state got lots more rain than usual in the past month. Now, on the right panel, we find patches of brown, and very little green.

But pay attention to the scale. The light brown color, which covers the largest area, has value 70 to 90, thus, these regions have gotten 10-30% less precipitation than average in the past three years relative to the 30-year average.

Here's the question: what does it mean by "erasing California's long-term drought"? Does the 3-year average have to equal or exceed the 30-year average? Why should that be the case?

If we took all 3-year windows within those 30 years, we're definitely not going to find that each such 3-year average falls at or above the 30-year average. To illustrate this, I pulled annual rainfall data for San Francisco. Here is a histogram of 3-year averages for the 30-year period 1991-2020.

Redo_nyt_californiadrought_sfrainfall

For example, the first value is the average rainfall for years 1989, 1990 and 1991, the next value is the average of 1990, 1991, and 1992, and so on. Each value is a relative value relative to the overall average in the 30-year window. There are two more values beyond 2020 that is not shown in the histogram. These are 57%, and 61%, so against the 30-year average, those two 3-year averages were drier than usual.

The above shows the underlying variability of the 3-year averages inside the reference time window. We have to first define "normal", and that might be a value between 70% and 130%.

In the same way, we can establish the "normal" range for the entire state of California. If it's also 70% to 130%, then the last 3 years as shown in the map above should be considered normal.

 

 


Energy efficiency deserves visual efficiency

Long-time contributor Aleksander B. found a good one, in the World Energy Outlook Report, published by IEA (International Energy Agency).

Iea_balloonchart_emissions

The use of balloons is unusual, although after five minutes, I decided I must do some research to have any hope of understanding this data visualization.

A lot is going on. Below, I trace my own journey through this chart.

The text on the top left explains that the chart concerns emissions and temperature change. The first set of balloons (the grey ones) includes helpful annotations. The left-right position of the balloons indicates time points, in 10-year intervals except for the first.

The trapezoid that sits below the four balloons is more mysterious. It's labelled "median temperature rise in 2100". I debate two possibilities: (a) this trapezoid may serve as the fifth balloon, extending the time series from 2050 to 2100. This interpretation raises a couple of questions: why does the symbol change from balloon to trapezoid? why is the left-right time scale broken? (b) this trapezoid may represent something unrelated to the balloons. This interpretation also raises questions: its position on the horizontal axis still breaks the time series; and  if the new variable is "median temperature rise", then what determines its location on the chart?

That last question is answered if I move my glance all the way to the right edge of the chart where there are vertical axis labels. This axis is untitled but the labels shown in degree Celsius units are appropriate for "median temperature rise".

Turning to the balloons, I wonder what the scale is for the encoded emissions data. This is also puzzling because only a few balloons wear data labels, and a scale is nowhere to be found.

Iea_balloonchart_emissions_legend

The gridlines suggests that the vertical location of the balloons is meaningful. Tracing those gridlines to the right edge leads me back to the Celsius scale, which seems unrelated to emissions. The amount of emissions is probably encoded in the sizes of the balloons although none of these four balloons have any data labels so I'm rather flustered. My attention shifts to the colored balloons, a few of which are labelled. This confirms that the size of the balloons indeed measures the amount of emissions. Nevertheless, it is still impossible to gauge the change in emissions for the 10-year periods.

The colored balloons rising above, way above, the gridlines is an indication that the gridlines may lack a relationship with the balloons. But in some charts, the designer may deliberately use this device to draw attention to outlier values.

Next, I attempt to divine the informational content of the balloon strings. Presumably, the chart is concerned with drawing the correlation between emissions and temperature rise. Here I'm also stumped.

I start to look at the colored balloons. I've figured out that the amount of emissions is shown by the balloon size but I am still unclear about the elevation of the balloons. The vertical locations of these balloons change over time, hinting that they are data-driven. Yet, there is no axis, gridline, or data label that provides a key to its meaning.

Now I focus my attention on the trapezoids. I notice the labels "NZE", "APS", etc. The red section says "Pre-Paris Agreement" which would indicate these sections denote periods of time. However, I also understand the left-right positions of same-color balloons to indicate time progression. I'm completely lost. Understanding these labels is crucial to understanding the color scheme. Clearly, I have to read the report itself to decipher these acronyms.

The research reveals that NZE means "net zero emissions", which is a forecasting scenario - an utterly unrealistic one - in which every country is assumed to fulfil fully its obligations, a sort of best-case scenario but an unattainable optimum. APS and STEPS embed different assumptions about the level of effort countries would spend on reducing emissions and tackling global warming.

At this stage, I come upon another discovery. The grey section is missing any acronym labels. It's actually the legend of the chart. The balloon sizes, elevations, and left-right positions in the grey section are all arbitrary, and do not represent any real data! Surprisingly, this legend does not contain any numbers so it does not satisfy one of the traditional functions of a legend, which is to provide a scale.

There is still one final itch. Take a look at the green section:

Iea_balloonchart_emissions_green

What is this, hmm, caret symbol? It's labeled "Net Zero". Based on what I have been able to learn so far, I associate "net zero" to no "emissions" (this suggests they are talking about net emissions not gross emissions). For some reason, I also want to associate it with zero temperature rise. But this is not to be. The "net zero" line pins the balloon strings to a level of roughly 2.5 Celsius rise in temperature.

Wait, that's a misreading of the chart because the projected net temperature increase is found inside the trapezoid, meaning at "net zero", the scientists expect an increase in 1.5 degrees Celsius. If I accept this, I come face to face with the problem raised above: what is the meaning of the vertical positioning of the balloons? There must be a reason why the balloon strings are pinned at 2.5 degrees. I just have no idea why.

I'm also stealthily presuming that the top and bottom edges of the trapezoids represent confidence intervals around the median temperature rise values. The height of each trapezoid appears identical so I'm not sure.

I have just learned something else about this chart. The green "caret" must have been conceived as a fully deflated balloon since it represents the value zero. Its existence exposes two limitations imposed by the chosen visual design. Bubbles/circles should not be used when the value of zero holds significance. Besides, the use of balloon strings to indicate four discrete time points breaks down when there is a scenario which involves only three buoyant balloons.

***

The underlying dataset has five values (four emissions, one temperature rise) for four forecasting scenarios. It's taken a lot more time to explain the data visualization than to just show readers those 20 numbers. That's not good!

I'm sure the designer did not set out to confuse. I think what happened might be that the design wasn't shown to potential readers for feedback. Perhaps they were shown only to insiders who bring their domain knowledge. Insiders most likely would not have as much difficulty with reading this chart as did I.

This is an important lesson for using data visualization as a means of communications to the public. It's easy for specialists to assume knowledge that readers won't have.

For the IEA chart, here is a list of things not found explicitly on the chart that readers have to know in order to understand it.

  • Readers have to know about the various forecasting scenarios, and their acronyms (APS, NZE, etc.). This allows them to interpret the colors and section titles on the chart, and to decide whether the grey section is missing a scenario label, or is a legend.
  • Since the legend does not contain any scale information, neither for the balloon sizes nor for the temperatures, readers have to figure out the scales on their own. For temperature, they first learn from the legend that the temperature rise information is encoded in the trapezoid, then find the vertical axis on the right edge, notice that this axis has degree Celsius units, and recognize that the Celsius scale is appropriate for measuring median temperature rise.
  • For the balloon size scale, readers must resist the distracting gridlines around the grey balloons in the legend, notice the several data labels attached to the colored balloons, and accept that the designer has opted not to provide a proper size scale.

Finally, I still have several unresolved questions:

  • The horizontal axis may have no meaning at all, or it may only have meaning for emissions data but not for temperature
  • The vertical positioning of balloons probably has significance, or maybe it doesn't
  • The height of the trapezoids probably has significance, or maybe it doesn't

 

 


Painting the corner

Found an old one sitting in my folder. This came from the Wall Street Journal in 2018.

At first glance, the chart looks like a pretty decent effort.

The scatter plot shows Ebitda against market value, both measured in billions of dollars. The placement of the vertical axis title on the far side is a little unusual.

Ebitda is a measure of business profit (something for a different post on the sister blog: the "b" in Ebitda means "before", and allows management to paint a picture of profits without accounting for the entire cost of running the business). In the financial markets, the market value is claimed to represent a "fair" assessment of the value of the business. The ratio of the market value to Ebitda is known as the "Ebitda multiple", which describes the number of dollars the "market" places on each dollar of Ebitda profit earned by the company.

Almost all scatter plots suffer from xyopia: the chart form encourages readers to take an overly simplistic view in which the market cares about one and only one business metric (Ebitda). The reality is that the market value contains information about Ebitda plus lots of other factors, such as competitors, growth potential, etc.

Consider Alphabet vs AT&T. On this chart, both companies have about $50 billion in Ebitda profits. However, the market value of Alphabet (Google's mother company) is about four times higher than that of AT&T. This excess valuation has nothing to do with profitability but partly explained by the market's view that Google has greater growth potential.

***

Unusually, the desginer chose not to utilize the log scale. The right side of the following display is the same chart with a log horizontal axis.

The big market values are artificially pulled into the middle while the small values are plied apart. As one reads from left to right, the same amount of distance represents more and more dollars. While all data visualization books love log scales, I am not a big fan of it. That's because the human brain doesn't process spatial information this way. We don't tend to think in terms of continuously evolving scales. Thus, presenting the log view causes readers to underestimate large values and overestimate small differences.

Now let's get to the main interest of this chart. Notice the bar chart shown on the top right, which by itself is very strange. The colors of the bar chart is coordinated with those on the scatter plot, as the colors divide the companies into two groups; "media" companies (old, red), and tech companies (new, orange).

Scratch that. Netflix is found in the scatter plot but with a red color while AT&T and Verizon appear on the scatter plot as orange dots. So it appears that the colors mean different things on different plots. As far as I could tell, on the scatter plot, the orange dots are companies with over $30 billion in Ebitda profits.

At this point, you may have noticed the stray orange dot. Look carefully at the top right corner, above the bar chart, and you'll find the orange dot representing Apple. It is by far the most important datum, the company that has the greatest market value and the largest Ebitda.

I'm not sure burying Apple in the corner was a feature or a bug. It really makes little sense to insert the bar chart where it is, creating a gulf between Apple and the rest of the companies. This placement draws the most attention away from the datum that demands the most attention.

 

 

 


Finding the right context to interpret household energy data

Bloomberg_energybillBloomberg's recent article on surging UK household energy costs, projected over this winter, contains data about which I have long been intrigued: how much energy does different household items consume?

A twitter follower alerted me to this chart, and she found it informative.

***
If the goal is to pick out the appliances and estimate the cost of running them, the chart serves its purpose. Because the entire set of data is printed, a data table would have done equally well.

I learned that the mobile phone costs almost nothing to charge: 1 pence for six hours of charging, which is deemed a "single use" which seems double what a full charge requires. The games console costs 14 pence for a "single use" of two hours. That might be an underestimate of how much time gamers spend gaming each day.

***

Understanding the design of the chart needs a bit more effort. Each appliance is measured by two metrics: the number of hours considered to be "single use", and a currency value.

It took me a while to figure out how to interpret these currency values. Each cost is associated with a single use, and the duration of a single use increases as we move down the list of appliances. Since the designer assumes a fixed cost of electicity (shown in the footnote as 34p per kWh), at first, it seems like the costs should just increase from top to bottom. That's not the case, though.

Something else is driving these numbers behind the scene, namely, the intensity of energy use by appliance. The wifi router listed at the bottom is turned on 24 hours a day, and the daily cost of running it is just 6p. Meanwhile, running the fridge and freezer the whole day costs 41p. Thus, the fridge&freezer consumes electricity at a rate that is almost 7 times higher than the router.

The chart uses a split axis, which artificially reduces the gap between 8 hours and 24 hours. Here is another look at the bottom of the chart:

Bloomberg_energycost_bottom

***

Let's examine the choice of "single use" as a common basis for comparing appliances. Consider this:

  • Continuous appliances (wifi router, refrigerator, etc.) are denoted as 24 hours, so a daily time window is also implied
  • Repeated-use appliances (e.g. coffee maker, kettle) may be run multiple times a day
  • Infrequent use appliances may be used less than once a day

I prefer standardizing to a "per day" metric. If I use the microwave three times a day, the daily cost is 3 x 3p = 9 p, which is more than I'd spend on the wifi router, run 24 hours. On the other hand, I use the washing machine once a week, so the frequency is 1/7, and the effective daily cost is 1/7 x 36 p = 5p, notably lower than using the microwave.

The choice of metric has key implications on the appearance of the chart. The bubble size encodes the relative energy costs. The biggest bubbles are in the heating category, which is no surprise. The next largest bubbles are tumble dryer, dishwasher, and electric oven. These are generally not used every day so the "per day" calculation would push them lower in rank.

***

Another noteworthy feature of the Bloomberg chart is the split legend. The colors divide appliances into five groups based on usage category (e.g. cleaning, food, utility). Instead of the usual color legend printed on a corner or side of the chart, the designer spreads the category labels around the chart. Each label is shown the first time a specific usage category appears on the chart. There is a presumption that the reader scans from top to bottom, which is probably true on average.

I like this arrangement as it delivers information to the reader when it's needed.

 

 

 


Trying too hard

Today, I return to the life expectancy graphic that Antonio submitted. In a previous post, I looked at the bumps chart. The centerpiece of that graphic is the following complicated bar chart.

Aburto_covid_lifeexpectancy

Let's start with the dual axes. On the left, age, and on the right, year of birth. I actually like this type of dual axes. The two axes present two versions of the same scale so the dual axes exist without distortion. It just allows the reader to pick which scale they want to use.

It baffles me that the range of each bar runs from 2.5 years to 7.5 years or 7.5 years to 2.5 years, with 5 or 10 years situated in the middle of each bar.

Reading the rest of the chart is like unentangling some balled up wires. The author has created a statistical model that attributes cause of death to male life expectancy in such a way that you can take the difference in life expectancy between two time points, and do a kind of waterfall analysis in which each cause of death either adds to or subtracts from the prior life expectancy, with the sum of these additions and substractions leading to the end-of-period life expectancy.

The model is complicated enough, and the chart doesn't make it any easier.

The bars are rooted at the zero value. The horizontal axis plots addition or substraction to life expectancy, thus zero represents no change during the period. Zero does not mean the cause of death (e.g. cancer) does not contribute to life expectancy; it just means the contribution remains the same.

The changes to life expectancy are shown in units of months. I'd prefer to see units of years because life expectancy is almost always given in years. Using years turn 2.5 months into 0.2 years which is a fraction, but it allows me to see the impact on the reported life expectancy without having to do a month-to-year conversion.

The chart highlights seven causes of death with seven different colors, plus gray for others.

What really does a number on readers is the shading, which adds another layer on top of the hues. Each color comes in one of two shading, referencing two periods of time. The unshaded bar segments concern changes between 2010 and "2019" while the shaded segments concern changes between "2019" and 2020. The two periods are chosen to highlight the impact of COVID-19 (the red-orange color), which did not exist before "2019".

Let's zoom in on one of the rows of data - the 72.5 to 77.5 age group.

Screen Shot 2022-09-14 at 1.06.59 PM

COVID-19 (red-orange) has a negative impact on life expectancy and that's the easy one to see. That's because COVID-19's contribution as a cause of death is exactly zero prior to "2019". Thus, the change in life expectancy is a change from zero. This is not how we can interpret any of the other colors.

Next, we look at cancer (blue). Since this bar segment sits on the right side of zero, cancer has contributed positively to change in life expectancy between 2010 and 2020. Practically, that means proportionally fewer people have died from cancer. Since the lengths of these bar segments correspond to the relative value, not absolute value, of life expectancy, longer bars do not necessarily indicate more numerous deaths.

Now the blue segment is actually divided into two parts, the shaded and not shaded. The not-shaded part is for the period "2019" to 2020 in the first year of the COVID-19 pandemic. The shaded part is for the period 2010 to "2019". It is a much wider span but it also contains 9 years of changes versus "1 year" so it's hard to tell if the single-year change is significantly different from the average single-year change of the past 9 years. (I'm using these quotes because I don't know whether they split the year 2019 in the middle since COVID-19 didn't show up till the end of that year.)

Next, we look at the yellow-brown color correponding to CVD. The key feature is that this block is split into two parts, one positive, one negative. Prior to "2019", CVD has been contributing positively to life expectancy changes while after "2019", it has contributed negatively. This observation raises some questions: why would CVD behave differently with the arrival of the pandemic? Are there data problems?

***

A small multiples design - splitting the period into two charts - may help here. To make those two charts comparable, I'd suggest annualizing the data so that the 9-year numbers represent the average annual values instead of the cumulative values.

 

 


Two uses of bumps charts

Long-time reader Antonio R. submitted the following chart, which illustrates analysis from a preprint on the effect of Covid-19 on life expectancy in the U.S. (link)

Aburto_covid_lifeexpectancy

Aburto_lifeexpectancyFor this post, I want to discuss the bumps chart on the lower right corner. Bumps charts are great at showing change over time. In this case, the authors are comparing two periods "2010-2019" and "2019-2020". By glancing at the chart, one quickly divides the causes of death into three groups: (a) COVID-19 and CVD, which experienced a big decline (b) respiratory, accidents, others ("rest"), and despair, which experienced increases, and (c) cancer and infectious, which remained the same.

And yet, something doesn't seem right.

What isn't clear is the measured quantity. The chart title says "months gained or lost" but it takes a moment to realize the plotted data are not number of months but ranks of the effects of the causes of deaths on life expectancy.

Observe that the distance between each cause of death is the same. Look at the first rising line (respiratory): the actual values went from 0.8 months down to 0.2.

***

While the canonical bumps chart plots ranks, the same chart form can be used to show numeric data. I prefer to use the same term for both charts. In recent years, the bumps chart showing numeric data has been called "slopegraph".

Here is a side-by-side comparison of the two charts:

Redo_aburto_covidlifeexpectancy

The one on the left is the same as the original. The one on the right plots the number of months increased or decreased.

The choice of chart form paints very different pictures. There are four blue lines on the left, indicating a relative increase in life expectancy - these causes of death contributed more to life expectancy between the two periods. Three of the four are red lines on the right chart. Cancer was shown as a flat line on the left - because it was the highest ranked item in both periods. The right chart shows that the numeric value for cancer suffered one of the largest drops.

The left chart exaggerates small numeric changes while it condenses large numeric changes.

 

 


Visualizing the impossible

Note [July 6, 2022]: Typepad's image loader is broken yet again. There is no way for me to fix the images right now. They are not showing despite being loaded properly yesterday. I also cannot load new images. Apologies!

Note 2: Manually worked around the automated image loader.

Note 3: Thanks Glenn for letting me about the image loading problem. It turns out the comment approval function is also broken, so I am not able to approve the comment.

***

A twitter user sent me this chart:

twitter_greatreplacement

It's, hmm, mystifying. It performs magic, as I explain below.

What's the purpose of the gridlines and axis labels? Even if there is a rationale for printing those numbers, they make it harder, not easier, for readers to understand the chart!

I think the following chart shows the main message of this poll result. Democrats are much more likely to think of immigration as a positive compared to Republicans, with Independents situated in between.

Redo_greatreplacement

***

The axis title gives a hint as to what the chart designer was aiming for with the unconventional axis. It reads "Overall Percentage for All Participants". It appears that the total length of the stacked bar is the weighted aggregate response rate. Roughly 17% of Americans thought this development to be "very positive" which include 8% of Republicans, 27% of Democrats and 12% of Independents. Since the three segments are not equal in size, 17% is a weighted average of the three proportions.

Within each of the three political affiliations, the data labels add to 100%. These numbers therefore are unweighted response rates for each segment. (If weighted, they should add up to the proportion of each segment.)

This sets up an impossible math problem. The three segments within each bar then represent the sum of three proportions, each unweighted within its segment. Adding these unweighted proportions does not yield the desired weighted average response rate. To get the weighted average response rate, we need to sum the weighted segment response rates instead.

This impossible math problem somehow got resolved visually. We can see that each bar segment faithfully represent the unweighted response rates shown in the respective data labels. Summing them would not yield the aggregate response rates as shown on the axis title. The difference is not a simple multiplicative constant because each segment must be weighted by a different multiplier. So, your guess is as good as mine: what is the magic that makes the impossible possible?

[P.S. Another way to see this inconsistency. The sum of all the data labels is 300% because the proportions of each segment add up to 100%. At the same time, the axis title implies that the sum of the lengths of all five bars should be 100%. So, the chart asserts that 300% = 100%.]

***

This poll question is a perfect classroom fodder to discuss how wording of poll questions affects responses (something called "response bias"). Look at the following variants of the same questions. Are we likely to get answers consistent with the above question?

As you know, the demographic makeup of America is changing and becoming more diverse, while the U.S. Census estimates that white people will still be the largest race in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that black people will still be a minority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that Hispanic, black, Asian and other non-white people together will be a majority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

What is also amusing is that in the world described by the pollster in 25 years, every race will qualify as a "minority". There will be no longer majority since no race will constitute at least 50% of the U.S. population. So at that time, the word "minority" will  have lost meaning.


Best chart I have seen this year

Marvelling at this chart:

 

***

The credit ultimately goes to a Reddit user (account deleted). I first saw it in this nice piece of data journalism by my friends at System 2 (link). They linked to Visual Capitalism (link).

There are so many things on this one chart that makes me smile.

The animation. The message of the story is aging population. Average age is moving up. This uptrend is clear from the chart, as the bulge of the population pyramid is migrating up.

The trend happens to be slow, and that gives the movement a mesmerizing, soothing effect.

Other items on the chart are synced to the time evolution. The year label on the top but also the year labels on the right side of the chart, plus the counts of total population at the bottom.

OMG, it even gives me average age, and life expectancy, and how those statistics are moving up as well.

Even better, the designer adds useful context to the data: look at the names of the generations paired with the birth years.

This chart is also an example of dual axes that work. Age, birth year and current year are connected to each other, and given two of the three, the third is fixed. So even though there are two vertical axes, there is only one scale.

The only thing I'm not entirely convinced about is placing the scroll bar on the very top. It's a redundant piece that belongs to a less prominent part of the chart.


Think twice before you spiral

After Nathan at FlowingData sang praises of the following chart, a debate ensued on Twitter as others dislike it.

Nyt_spiral_covidcases

The chart was printed in an opinion column in the New York Times (link).

I have found few uses for spiral charts, and this example has not changed my mind.

The canonical time-series chart is like this:

Junkcharts_redo_nyt_covidcasesspiral_1

 

***

The area chart takes no effort to understand. We can see when the peaks occurred. We notice that the current surge is already double the last peak seen a year ago.

It's instructive to trace how one gets from the simple area chart to the spiral chart.

Junkcharts_redo_nyt_covidcasesspiral_2

Step 1 is to center the area on the zero baseline, instead of having the zero baseline as the baseline. While this technique frequently makes for a more pleasant visual (because of our preference for symmetry), it actually makes it harder to see the trend over time. Effectively, any change is split in half, which is why the envelope of the area is less sharp.

Junkcharts_redo_nyt_covidcasesspiral_3

In Step 2, I massively compress the vertical scale. That's because when you plot a spiral, you are forced to fit each cycle of data into a much shorter range. Such compression causes the year on year doubling of cases to appear less dramatic. (Actually, the aspect ratio is devastated because while the vertical scale is hugely compressed, the horizontal scale is dramatically stretched out due to the curled up design)

Junkcharts_redo_nyt_covidcasesspiral_4

Step 3 may elude your attention. If you simply curl up the compressed, centered area chart, you don't get the spiral chart. The key is to ask about the radius of the spiral. As best I can tell, the radius has no meaning; it is gradually increased so that each year of data has its own "orbit". What would the change in radius translate to on our non-circular chart? It should mean that the center of the area is gradually lifted away from the zero line. On the right chart, I mimic this effect (I only measured the change in radius every 3 months so the change is more angular than displayed in the spiral chart.) The problem I have with this Step is that it serves no purpose, while it complicates cognition,

In Step 4, just curl up the object into a ball based on aligning months of the year.

Junkcharts_redo_nyt_covidcasesspiral_5

This is the point when I realized I missed a Step 2B. I carefully aligned the scales of both charts so that the 150K cases shown in the legend on the right have the same vertical representation as on the left. This exposes a severe horizontal rescaling. The length of the horizontal axis on the left chart is many times smaller than the circumference of the spiral! That's why earlier, I said one of the biggest feature of this spiral chart is that it imposes a dubious aspect ratio, that is extremely wide and extremely short.

As usual, think twice before you spiral.

 

 


To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.

***

The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.

***

Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?

***

A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.