On data volume, reliability, uncertainty and confidence bands

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.

Economist_lifequalitywealth1

The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.

Redo_economistlifesatisfaction_linear

One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.

***

The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.

***

The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.

Redo_economistlifesatisfaction_nohex

Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.

***

An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)

Redo_economistlifesatisfaction_confband

The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


How the pandemic affected employment of men and women

In the last post, I looked at the overall employment situation in the U.S. Here is the trend of the "official" unemployment rate since 1990.

Junkcharts_kfung_unemployment_apr20

I was talking about the missing 100 million. These are people who are neither employed nor unemployed in the eyes of the Bureau of Labor Statistics (BLS). They are simply unrepresented in the numbers shown in the chart above.

This group is visualized in my scatter plot as "not in labor force", as a percent of the employment-age population. The horizontal axis of this scatter plot shows the proportion of employed people who hold part-time jobs. Anyone who worked at least one hour during the month is counted as employed part-time.

***

Today, I visualize the differences between men and women.

The first scatter plot shows the situation for men:

Junkcharts_unemployment_scatter_men

This plot reveals a long-term structural problem for the U.S. economy. Regardless of the overall economic health, more and more men have been declared not in labor force each year. Between 2007, the start of the Great Recession to 2019, the proportion went up from 27% to 31%, and the pandemic has pushed this to almost 34%. As mentioned in the last post, this sharp rise in April raises concern that the criteria for "not in labor force" capture a lot of people who actually want a job, and therefore should be counted as part of the labor force but unemployed.

Also, as seen in the last post, the severe drop in part-time workers is unprecedented during economic hardship. As dots turn from blue to red, they typically are moving right, meaning more part-time workers. Since the pandemic, among those people still employed, the proportion holding full-time jobs has paradoxically exploded.

***

The second scatter plot shows the situation with women:

Junkcharts_unemployment_scatter_women

Women have always faced a tougher job market. If they are employed, they are more likely to be holding part-time jobs relative to employed men; and a significantly larger proportion of women are not in the labor force. Between 1990 and 2001, more women entered the labor force. Just like men, the Great Recession resulted in a marked jump in the proportion out of labor force. Since 2014, a positive trend emerged, now interrupted by the pandemic, which has pushed both metrics to levels never seen before.

The same story persists: the sharp rise in women "not in labor force" exposes a problem with this statistic - as it apparently includes people who do want to work, not as intended. In addition, unlike the pattern in the last 30 years, the severe economic crisis is coupled with a shift toward full-time employment, indicating that part-time jobs were disappearing much faster than full-time jobs.


The missing 100 million: how the pandemic reveals the fallacy of not in labor force

Last Friday, the U.S. published the long-feared employment situation report. It should come as no surprise to anyone since U.S. businesses were quick to lay off employees since much of the economy was shut down to abate the spread of the coronavirus.

Numbersense_coverI've been following employment statistics for a while. Chapter 6 of Numbersense (link) addresses the statistical aspects of how the unemployment rate is computed. The title of the chapter is "Are they new jobs when no one can apply?" What you learn is that the final number being published starts off as survey tallies, which then undergo a variety of statistical adjustments.

One such adjustment - which ought to be controversial - results in the disappearance of 100 million Americans. I mean, that they are invisible to the Bureau of Labor Statistics (BLS), considered neither employed nor unemployed. You don't hear about them because the media report the "headline" unemployment rate, which excludes these people. They are officially designated "not in the labor force". I'll come back to this topic later in the post.

***

Last year, I used a pair of charts to visualize the unemployment statistics. I have updated the charts to include all of 2019 and 2020 up to April, the just released numbers.

The first chart shows the trend in the official unemployment rate ("U3") from 1990 to present. It's color-coded so that the periods of high unemployment are red, and the periods of low unemployment are blue. This color code will come in handy for the next chart.

Junkcharts_kfung_unemployment_apr20

The time series is smoothed. However, I had to exclude the April 2020 outlier from the smoother.

The next plot, a scatter plot, highlights two of the more debatable definitions used by the BLS. On the horizontal axis, I plot the proportion of employed people who have part-time jobs. People only need to have worked one hour in a month to be counted as employed. On the vertical axis, I plot the proportion of the population who are labeled "not in labor force". These are people who are not employed and not counted in the unemployment rate.

Junkcharts_kfung_unemployment_apr20_2

The value of data visualization is its ability to reveal insights about the data. I'm happy to report that this design succeeded.

Previously, we learned that (a) part-timers as a proportion of employment tend to increase during periods of worsening unemployment (red dots moving right) while decreasing during periods of improving employment (blue dots moving left); and (b) despite the overall unemployment rate being about the same in 2007 and 2017, the employment situation was vastly different in the sense that the labor force has shrunk significantly during the recession and never returned to normal. These two insights are still found at the bottom right corner of the chart. The 2019 situation did not differ much from 2018.

What is the effect of the current Covid-19 pandemic?

On both dimensions, we have broken records since 1990. The proportion of people designated not in labor force was already the worst in three decades before the pandemic, and now it has almost reached 40 percent of the population!

Remember these people are invisible to the media, neither employed nor unemployed. Back in February 2020, with unemployment rate at around 4 percent, it's absolutely not the case that 96 pecent of the employment-age population was employed. The number of employed Americans was just under 160 million. The population 16 years and older at the time was 260 million.

Who are these 100 million people? BLS says all but 2 million of these are people who "do not want a job". Some of them are retired. There are about 50 million Americans above 65 years old although 25 percent of them are still in the labor force, so only 38 million are "not in labor force," according to this Census report.

It would seem like the majority of these people don't want to work, are not paid enough to work, etc. Since part-time workers are counted as employed, with as little as one working hour per month, these are not the gig workers, not Uber/Lyft drivers, and not college students who has work-study or part-time jobs.

This category has long been suspect, and what happened in April isn't going to help build its case. There is no reason why the "not in labor force" group should spike immediately as a result of the pandemic. It's not plausible to argue that people who lost their jobs in the last few weeks suddenly turned into people who "do not want a job". I think this spike is solid evidence that the unemployed have been hiding inside the not in labor force number.

The unemployment rate has under-reported unemployment because many of the unemployed have been taken out of the labor force based on BLS criteria. The recovery of jobs since the Great Recession is partially nullified since the jump in "not in labor force" never returned to the prior level.

***

The other dimension, part-time employment, also showed a striking divergence from the past behavior. Typically, when the unemployment rate deteriorates, the proportion of employed people who have part-time jobs increases. However, in the current situation, not only is that not happening, but the proportion of part-timers plunged to a level not seen in the last 30 years.

This suggests that employers are getting rid of their part-time work force first.

 

 


How to read this chart about coronavirus risk

In my just-published Long Read article at DataJournalism.com, I touched upon the subject of "How to Read this Chart".

Most data graphics do not come with directions of use because dataviz designers follow certain conventions. We do not need to tell you, for example, that time runs left to right on the horizontal axis (substitute right to left for those living in right-to-left countries). It's when we deviate from the norms that calls for a "How to Read this Chart" box.

***
A discussion over Twitter during the weekend on the following New York Times chart perfectly illustrates this issue. (The article is well worth reading to educate oneself on this red-hot public-health issue. I made some comments on the sister blog about the data a few days ago.)

Nyt_coronavirus_scatter

Reading this chart, I quickly grasp that the horizontal axis is the speed of infection and the vertical axis represents the deadliness. Without being told, I used the axis labels (and some of you might notice the annotations with the arrows on the top right.) But most people will likely miss - at a glance - that the vertical axis utilizes a log scale while the horizontal axis is linear (regular).

The effect of a log scale is to pull the large numbers toward the average while spreading the smaller numbers apart - when compared to a linear scale. So when we look at the top of the coronavirus box, it appears that this virus could be as deadly as SARS.

The height of the pink box is 3.9, while the gap between the top edge of the box and the SARS dot is 6. Yet our eyes tell us the top edge is closer to the SARS dot than it is to the bottom edge!

There is nothing inaccurate about this chart - the log scale introduces such distortion. The designer has to make a choice.

Indeed, there were two camps on Twitter, arguing for and against the log scale.

***

I use log scales a lot in analyzing data, but tend not to use log scales in a graph. It's almost a given that using the log scale requires a "How to Read this Chart" message. And the NY Times crew delivers!

Right below the chart is a paragraph:

Nyt_coronavirus_howtoreadthis

To make this even more interesting, the horizontal axis is a hidden "log" scale. That's because infections spread exponentially. Even though the scale is not labeled "log", think as if the large values have been pulled toward the middle.

Here is an over-simplified way to see this. A disease that spreads at a rate of fifteen people at a time is not 3 times worse than one that spreads five at a time. In the former case, the first sick person transmits it to 15, and then each of the 15 transmits the flu to 15 others, thus after two steps, 241 people have been infected (225 + 15 + 1). In latter case, it's 5x5 + 5 + 1 = 31 infections after two steps. So at this point, the number of infected is already 8 times worse, not 3 times. And the gap keeps widening with each step.

P.S. See also my post on the sister blog that digs deeper into the metrics.

 


The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:

20160402_WOC883_icecream_PISA

The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)

Redo_econ_icecreamIQ_1A

Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.

***

If the convention is respected, then the following scatter plot results:

Redo_econ_icecreamIQ_2

The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).

***

When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!

 

[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]


Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:

Axios_newstopics

From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)

***

The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!

**

Here, I've put the data onto a scatter plot:

Redo_junkcharts_aiosnewstopics_1

The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.

Redo_junkcharts_aiosnewstopics_2

Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.

 

 

 

 

 


Clearing a forest of labels

This chart by the Financial Times has a strong message, and I like a lot about it:

Ft-europe-growth

The countries are by and large aligned along a diagonal, with the poorer countries growing strongly between 2007-2019 while the richer countries suffered negative growth.

A small issue with the chart is the thick forest of text - redundant text. The sub-title, the axis titles, the quadrant labels, and the left-right-half labels all repeat the same things. In the following chart, I simplify the text:

Redo_fteuropegrowth_text

Typically, I don't put axis titles as a sub-header (or, header of the graphic) but as this may be part of the FT style, I respected it.


A data graphic that solves a consumer problem

Saw this great little sign at Ippudo, the ramen shop, the other day:

Ippudo_board

It's a great example of highly effective data visualization. The names on the board are sake brands. 

The menu (a version of a data table) is the conventional way of displaying this information.

The Question

Customers are selecting a sake. They don't have a favorite, or don't recognize many of these brands. They know a bit about their preferences: I like full-bodied, or I want the dry one. 

The Data

On a menu, the key data are missing. So the first order of business is to find data on full- and light-bodied, and dry and sweet. The pricing data are omitted, possibly because it clutters up the design, or because the shop doesn't want customers to focus on price - or both.

The Visual

The design uses a scatter plot. The customer finds the right quartet, thus narrowing the choices to three or four brands. Then, the positions on the two axes allow the customer to drill down further. 

This user experience is leaps and bounds above scanning a list of names, and asking someone who may or may not be an expert.

Back to the Data

The success of the design depends crucially on selecting the right data. Baked into the scatter plot is the assumption that the designer knows the two factors most influential to the customer's decision. Technically, this is a "variable selection" problem: of all factors determining the brand choice, which two are the most important? 

Think about the downside of selecting the wrong factors. Then, the scatter plot makes it harder to choose the sake compared to the menu.