The elusive meaning of black paintings and red blocks

Joe N, a longtime reader, tweeted about the following chart, by the People's Policy Project:

3p_oneyearinonemonth_laborflow

This is a simple column chart containing only two numbers, far exceeded by the count of labels and gridlines.

I look at charts like the lady staring at these Ad Reinhardts:

 

SUBJPREINHARDT2-videoSixteenByNine1050

My artist friends say the black squares are not the same, if you look hard enough.

Here is what I learned after one such seating:

The tiny data labels sitting on the inside top edges of the columns hint that the right block is slightly larger than the left block.

The five labels of the vertical axis serve no purpose, nor the gridlines.

The horizontal axis for time is reversed, with 2019 appearing after 2020 (when read left to right).

The left block has one month while the right block has 12 months. This is further confused by the word "All" which shares the same starting and ending letters as "April".

As far as I can tell, the key message of this chart is that the month of April has the impact of a full year. It's like 12 months of outflows from employment hitting the economy in one month.

***

My first response is this chart:

Junkcharts_oneyearinonemonth_laborflow_1

Breaking the left block into 12 pieces, and color-coding the April piece brings out the comparison. You can also see that in 2019, the outflows from employment to unemployment were steady month to month.

Next, I want to see what happens if I restored the omitted months of Jan to March, 2020.

Junkcharts_oneyearinonemonth_laborflow_2

The story changes slightly. Now, the chart says that the first four months have already exceeded the full year of 2019.

Since the values hold steady month to month, with the exception of April 2020, I make a monthly view:

Junkcharts_oneyearinonemonth_laborflow_monthly_bar_1

You can see the slight nudge-up in March 2020 as well. This draws more attention to the break in pattern.

For time-series data, I prefer to look at line charts:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_1

As I explained in this post about employment statistics (or Chapter 6 of Numbersense (link)), the Bureau of Labor Statistics classifies people into three categories: Employed, Unemployed and Not in Labor Force. Exits from Employed to Unemployed status contribute to unemployment in the U.S. To depict a negative trend, it's often natural to use negative numbers:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_neg_1

You may realize that this data series paints only a partial picture of the health of the labor market. While some people exit the Employed status each month, there are others who re-enter or enter the Employed status. We should really care about net flows.

Junkcharts_oneyearinonemonth_laborflow_net_lines

In all of 2019, there were more entrants than exits, leading to a slightly positive net inflow to the Employed status from Unemployed (blue line). In April 2020, the red line (exits) drags the blue line dramatically.

Of course, even this chart is omitting important information. There are also flows from Employed to and from Not in Labor Force.

 

 

 

 

 


Comparing chance of death of coronavirus and flu

The COVID-19 charts are proving one thing. When the topic of a dataviz is timely and impactful, readers will study the graphics and ask questions. I've been sent some of these charts lately, and will be featuring them here.

A former student saw this chart from Business Insider (link) and didn't like it.

Businesinsider_coronavirus_flu_compare

My initial reaction was generally positive. It's clear the chart addresses a comparison between death rates of the flu and COVID19, an important current question. The side-by-side panel is effective at allowing such a comparison. The column charts look decent, and there aren't excessive gridlines.

Sure, one sees a few simple design fixes, like removing the vertical axis altogether (since the entire dataset has already been printed). I'd also un-slant the age labels.

***

I'd like to discuss some subtler improvements.

A primary challenge is dealing with the different definitions of age groups across the two datasets. While the side-by-side column charts prompt readers to go left-right, right-left in comparing death rates, it's not easy to identify which column to compare to which. This is not fixable in the datasets because the organizations that compile them define their own age groups.

Also, I prefer to superimpose the death rates on the same chart, using something like a dot plot rather than a column chart. This makes the comparison even easier.

Here is a revised visualization:

Redo_businessinsider_covid19fatalitybyage

The contents of this chart raise several challenges to public health officials. Clearly, hospital resources should be preferentially offered to older patients. But young people could be spreading the virus among the community.

Caution is advised as the data for COVID19 suffers from many types of inaccuracies, as outlined here.


Who is a millennial? An example of handling uncertainty

I found this fascinating chart from CNBC, which attempts to nail down the definition of a millennial.

Millennials2-01

It turns out everyone defines "millennials" differently. They found 23 different definitions. Some media outlets apply different definitions in different items.

I appreciate this effort a lot. The design is thoughtful. In making this chart, the designer added the following guides:

  • The text draws attention to the definition with the shortest range of birth years, and the one with the largest range.
  • The dashed gray gridlines help with reading the endpoints of each bar.
  • The yellow band illustrates the so-called average range. It appears that this average range is formed by taking the average of the beginning years and the average of the ending years. This indicates a desire to allow comparisons between each definition and the average range.
  • The bars are ordered by the ending birth year (right edge).

The underlying issue is how to display uncertainty. The interest here is not just to feature the "average" definition of a millennial but to show the range of definitions.

***

In making my chart, I apply a different way to find the "average" range. Given any year, say 1990, what is the chance that it is included in any of the definitions? In other words, what proportion of the definitions include that year? In the following chart, the darker the color, the more likely that year is included by the "average" opinion.

Redo_junkcharts_cnbcmillennials

I ordered the bars from shortest to the longest so there is no need to annotate them. Based on this analysis, 90 percent (or higher) of the sources list 19651985 to 1993 as part of the range while 70 percent (or higher) list 19611981 to 1996 as part of the range.

 

 


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.

 

 


The time of bird seeds and chart tuneups

The recent post about multi-national companies reminded me of an older post, in which I stepped through data table enhancements.

Here is a video of the process. You can use any tool to implement the steps; even Excel is good enough.

 

 

The video is part of a series called "Data science: the Missing Pieces". In these episodes, I cover the parts of data science that are between the cracks, the little things that textbooks and courses do not typically cover - the things that often block students from learning efficiently.

If you have encountered such things, please comment below to suggest future topics. What is something about visualizing data you wish you learned formally?

***

P.S. Placed here to please the twitter-bot

DSTMP2_goodchart_thumb

 

 


Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.

Redo_multinat_1

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

Redo_multinat_2

The horizontal gridlines attached to the zero line can also be removed:

Redo_multinat_3

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

Redo_multinat_4

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Redo_multinat_5

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

Redo_multinat_6

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Redo_multinat_7

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

Redo_multinat_8

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

Redo_junkcharts_econmultinat

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.


Clarifying comparisons in censored cohort data: UK housing affordability

If you're pondering over the following chart for five minutes or more, don't be ashamed. I took longer than that.

Ft_ukgenerationalhousing

The chart accompanied a Financial Times article about inter-generational fairness in the U.K. To cut to the chase, a recently released study found that younger generations are spending substantially higher proportions of their incomes to pay for housing costs. The FT article is here (behind paywall). FT actually slightly modified the original chart, which I pulled from the Home Affront report by the Intergenerational Commission.

Uk_generational_propincomehousing

One stumbling block is to figure out what is plotted on the horizontal axis. The label "Age" has gone missing. Even though I am familiar with cohort analysis (here, generational analysis), it took effort to understand why the lines are not uniformly growing in lengths. Typically, the older generation is observed for a longer period of time, and thus should have a longer line.

In particular, the orange line, representing people born before 1895 only shows up for a five-year range, from ages 70 to 75. This was confusing because surely these people have lived through ages 20 to 70. I'm assuming the "left censoring" (missing data on the left side) is because of non-existence of old records.

The dataset is also right-censored (missing data on the right side). This occurs with the younger generations (the top three lines) because those cohorts have not yet reached certain ages. The interpretation is further complicated by the range of birth years in each cohort but let me not go there.

TL;DR ... each line represents a generation of Britons, defined by their birth years. The generations are compared by how much of their incomes did they spend on housing costs. The twist is that we control for age, meaning that we compare these generations at the same age (i.e. at each life stage).

***

Here is my version of the same chart:

Junkcharts_redo_ukgenerationalhousing_1

Here are some of the key edits:

  • Vertical blocks are introduced to break up the analysis by life stage. These guide readers to compare the lines vertically i.e. across generations
  • The generations are explicitly described as cohorts by birth years
  • The labels for the generations are placed next to the lines
  • Gridlines are pushed to the back
  • The age axis is explicitly labeled
  • Age labels are thinned
  • A hierarchy on colors
  • The line segments with incomplete records are dimmed

The harmful effect of colors can be seen below. This chart is the same as the one above, except for retaining the colors of the original chart:

Junkcharts_redo_ukgenerationalhousing_2

 

 


Re-thinking a standard business chart of stock purchases and sales

Here is a typical business chart.

Cetera_amd_chart

A possible story here: institutional investors are generally buying AMD stock, except in Q3 2018.

Let's give this chart a three-step treatment.

STEP 1: The Basics

Remove the data labels, which stand sideways awkwardly, and are redundant given the axis labels. If the audience includes people who want to take the underlying data, then supply a separate data table. It's easier to copy and paste from, and doing so removes clutter from the visual.

The value axis is probably created by an algorithm - hard to imagine someone deliberately placing axis labels  $262 million apart.

The gridlines are optional.

Redo_amdinstitution_1

STEP 2: Intermediate

Simplify and re-organize the time axis labels; show the quarter and year structure. The years need not repeat.

Align the vocabulary on the chart. The legend mentions "inflows and outflows" while the chart title uses the words "buying and selling". Inflows is buying; outflows is selling.

Redo_amdinstitution_2

STEP 3: Advanced

This type of data presents an interesting design challenge. Arguably the most important metric is the net purchases (or the net flow), i.e. inflows minus outflows. And yet, the chart form leaves this element in the gaps, visually.

The outflows are numerically opposite to inflows. The sign of the flow is encoded in the color scheme. An outflow still points upwards. This isn't a criticism, but rather a limitation of the chart form. If the red bars are made to point downwards to indicate negative flow, then the "net flow" is almost impossible to visually compute!

Putting the columns side by side allows the reader to visually compute the gap, but it is hard to visually compare gaps from quarter to quarter because each gap is hanging off a different baseline.

The following graphic solves this issue by focusing the chart on the net flows. The buying and selling are still plotted but are deliberately pushed to the side:

Redo_amd_1

The structure of the data is such that the gray and pink sections are "symmetric" around the brown columns. A purist may consider removing one of these columns. In other words:

Redo_amd_2

Here, the gray columns represent gross purchases while the brown columns display net purchases. The reader is then asked to infer the gross selling, which is the difference between the two column heights.

We are almost back to the original chart, except that the net buying is brought to the foreground while the gross selling is pushed to the background.

 


This chart advises webpages to add more words

A reader sent me the following chart. In addition to the graphical glitch, I was asked about the study's methodology.

Serp-iq-content-length

I was able to trace the study back to this page. The study uses a line chart instead of the bar chart with axis not starting at zero. The line shows that web pages ranked higher by Google on the first page tend to have more words, i.e. longer content may help with Google ranking.

Backlinko_02_Content-Total-Word-Count_line

On the bar chart, Position 1 is more than 6 times as big as Position 10, if one compares the bar areas. But it's really only 20% larger in the data.

In this case, even the line chart is misleading. If we extend the Google Position to 20, the line would quickly dip below the horizontal axis if the same trend applies.

The line chart includes too much grid, one of Tufte's favorite complaints. The Google position is an integer and yet the chart's gridlines imply that 0.5 rank is possible.

Any chart of this data should supply information about the variance around these average word counts. Would like to see a side-by-side box plot, for example.

Another piece of context is the word counts for results on the second or third pages of Google results. Where are the short pages?

***

Turning to methodology, we learn that the research team analyzed 1 million pages of Google search results, and they also "removed outliers from our data (pages that contained fewer than 51 words and more than 9999 words)."

When you read a line like this, you have to ask some questions:

How do they define "outlier"? Why do they choose 51 and 9,999 as the cut-offs?

What proportion of the data was removed at either end of the distribution?

If these proportions are small, then the outliers are not going to affect that average word count by much, and thus there is no point to their removal. If they are large, we'd like to see what impact removing them might have.

In any case, the median is a better number to use here, or just show us the distribution, not just the average number.

It could well be true that Google's algorithm favors longer content, but we need to see more of the data to judge.

 

 


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)