When the pie chart is more complex than the data

The trading house, Charles Schwab, included the following graphic in a recent article:

Charleschwab_portfolio_1000

This graphic is more complicated than the story that it illustrates. The author describes a simple scenario in which an investor divides his investments into stocks, bonds and cash. After a stock crash, the value of the portfolio declines.

The graphic is a 3-D pie chart, in which the data are encoded twice, first in the areas of the sectors and then in the heights of the part-cylinders.

As readers, we perceive the relative volumes of the part-cylinders. Volume is the cross-sectional area (i.e. of the base) multipled by the height. Since each component holds the data, the volumes are proportional to the squares of the data.

Here is a different view of the same data:

Redo_junkcharts_schwab_portfolio

This "bumps chart" (also called a slopegraph) shows clearly the only thing that drives the change is the drop in stock prices. Because the author assumes no change in bonds or cash, the drop in the entire portfolio is completely accounted for by the decline in stocks. Of course, this scenario seems patently unrealistic - different investment asset classes tend to be correlated.

***

A cardinal rule of data visualization is that the visual should be less complex than the data.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

Visualcapitalist_spending_generalcommerce

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

Visualcapitalist_spending_fooddelivery

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

Visualcapitalist_spending_travel

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.


The elusive meaning of black paintings and red blocks

Joe N, a longtime reader, tweeted about the following chart, by the People's Policy Project:

3p_oneyearinonemonth_laborflow

This is a simple column chart containing only two numbers, far exceeded by the count of labels and gridlines.

I look at charts like the lady staring at these Ad Reinhardts:

 

SUBJPREINHARDT2-videoSixteenByNine1050

My artist friends say the black squares are not the same, if you look hard enough.

Here is what I learned after one such seating:

The tiny data labels sitting on the inside top edges of the columns hint that the right block is slightly larger than the left block.

The five labels of the vertical axis serve no purpose, nor the gridlines.

The horizontal axis for time is reversed, with 2019 appearing after 2020 (when read left to right).

The left block has one month while the right block has 12 months. This is further confused by the word "All" which shares the same starting and ending letters as "April".

As far as I can tell, the key message of this chart is that the month of April has the impact of a full year. It's like 12 months of outflows from employment hitting the economy in one month.

***

My first response is this chart:

Junkcharts_oneyearinonemonth_laborflow_1

Breaking the left block into 12 pieces, and color-coding the April piece brings out the comparison. You can also see that in 2019, the outflows from employment to unemployment were steady month to month.

Next, I want to see what happens if I restored the omitted months of Jan to March, 2020.

Junkcharts_oneyearinonemonth_laborflow_2

The story changes slightly. Now, the chart says that the first four months have already exceeded the full year of 2019.

Since the values hold steady month to month, with the exception of April 2020, I make a monthly view:

Junkcharts_oneyearinonemonth_laborflow_monthly_bar_1

You can see the slight nudge-up in March 2020 as well. This draws more attention to the break in pattern.

For time-series data, I prefer to look at line charts:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_1

As I explained in this post about employment statistics (or Chapter 6 of Numbersense (link)), the Bureau of Labor Statistics classifies people into three categories: Employed, Unemployed and Not in Labor Force. Exits from Employed to Unemployed status contribute to unemployment in the U.S. To depict a negative trend, it's often natural to use negative numbers:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_neg_1

You may realize that this data series paints only a partial picture of the health of the labor market. While some people exit the Employed status each month, there are others who re-enter or enter the Employed status. We should really care about net flows.

Junkcharts_oneyearinonemonth_laborflow_net_lines

In all of 2019, there were more entrants than exits, leading to a slightly positive net inflow to the Employed status from Unemployed (blue line). In April 2020, the red line (exits) drags the blue line dramatically.

Of course, even this chart is omitting important information. There are also flows from Employed to and from Not in Labor Force.

 

 

 

 

 


How the pandemic affected employment of men and women

In the last post, I looked at the overall employment situation in the U.S. Here is the trend of the "official" unemployment rate since 1990.

Junkcharts_kfung_unemployment_apr20

I was talking about the missing 100 million. These are people who are neither employed nor unemployed in the eyes of the Bureau of Labor Statistics (BLS). They are simply unrepresented in the numbers shown in the chart above.

This group is visualized in my scatter plot as "not in labor force", as a percent of the employment-age population. The horizontal axis of this scatter plot shows the proportion of employed people who hold part-time jobs. Anyone who worked at least one hour during the month is counted as employed part-time.

***

Today, I visualize the differences between men and women.

The first scatter plot shows the situation for men:

Junkcharts_unemployment_scatter_men

This plot reveals a long-term structural problem for the U.S. economy. Regardless of the overall economic health, more and more men have been declared not in labor force each year. Between 2007, the start of the Great Recession to 2019, the proportion went up from 27% to 31%, and the pandemic has pushed this to almost 34%. As mentioned in the last post, this sharp rise in April raises concern that the criteria for "not in labor force" capture a lot of people who actually want a job, and therefore should be counted as part of the labor force but unemployed.

Also, as seen in the last post, the severe drop in part-time workers is unprecedented during economic hardship. As dots turn from blue to red, they typically are moving right, meaning more part-time workers. Since the pandemic, among those people still employed, the proportion holding full-time jobs has paradoxically exploded.

***

The second scatter plot shows the situation with women:

Junkcharts_unemployment_scatter_women

Women have always faced a tougher job market. If they are employed, they are more likely to be holding part-time jobs relative to employed men; and a significantly larger proportion of women are not in the labor force. Between 1990 and 2001, more women entered the labor force. Just like men, the Great Recession resulted in a marked jump in the proportion out of labor force. Since 2014, a positive trend emerged, now interrupted by the pandemic, which has pushed both metrics to levels never seen before.

The same story persists: the sharp rise in women "not in labor force" exposes a problem with this statistic - as it apparently includes people who do want to work, not as intended. In addition, unlike the pattern in the last 30 years, the severe economic crisis is coupled with a shift toward full-time employment, indicating that part-time jobs were disappearing much faster than full-time jobs.


The missing 100 million: how the pandemic reveals the fallacy of not in labor force

Last Friday, the U.S. published the long-feared employment situation report. It should come as no surprise to anyone since U.S. businesses were quick to lay off employees since much of the economy was shut down to abate the spread of the coronavirus.

Numbersense_coverI've been following employment statistics for a while. Chapter 6 of Numbersense (link) addresses the statistical aspects of how the unemployment rate is computed. The title of the chapter is "Are they new jobs when no one can apply?" What you learn is that the final number being published starts off as survey tallies, which then undergo a variety of statistical adjustments.

One such adjustment - which ought to be controversial - results in the disappearance of 100 million Americans. I mean, that they are invisible to the Bureau of Labor Statistics (BLS), considered neither employed nor unemployed. You don't hear about them because the media report the "headline" unemployment rate, which excludes these people. They are officially designated "not in the labor force". I'll come back to this topic later in the post.

***

Last year, I used a pair of charts to visualize the unemployment statistics. I have updated the charts to include all of 2019 and 2020 up to April, the just released numbers.

The first chart shows the trend in the official unemployment rate ("U3") from 1990 to present. It's color-coded so that the periods of high unemployment are red, and the periods of low unemployment are blue. This color code will come in handy for the next chart.

Junkcharts_kfung_unemployment_apr20

The time series is smoothed. However, I had to exclude the April 2020 outlier from the smoother.

The next plot, a scatter plot, highlights two of the more debatable definitions used by the BLS. On the horizontal axis, I plot the proportion of employed people who have part-time jobs. People only need to have worked one hour in a month to be counted as employed. On the vertical axis, I plot the proportion of the population who are labeled "not in labor force". These are people who are not employed and not counted in the unemployment rate.

Junkcharts_kfung_unemployment_apr20_2

The value of data visualization is its ability to reveal insights about the data. I'm happy to report that this design succeeded.

Previously, we learned that (a) part-timers as a proportion of employment tend to increase during periods of worsening unemployment (red dots moving right) while decreasing during periods of improving employment (blue dots moving left); and (b) despite the overall unemployment rate being about the same in 2007 and 2017, the employment situation was vastly different in the sense that the labor force has shrunk significantly during the recession and never returned to normal. These two insights are still found at the bottom right corner of the chart. The 2019 situation did not differ much from 2018.

What is the effect of the current Covid-19 pandemic?

On both dimensions, we have broken records since 1990. The proportion of people designated not in labor force was already the worst in three decades before the pandemic, and now it has almost reached 40 percent of the population!

Remember these people are invisible to the media, neither employed nor unemployed. Back in February 2020, with unemployment rate at around 4 percent, it's absolutely not the case that 96 pecent of the employment-age population was employed. The number of employed Americans was just under 160 million. The population 16 years and older at the time was 260 million.

Who are these 100 million people? BLS says all but 2 million of these are people who "do not want a job". Some of them are retired. There are about 50 million Americans above 65 years old although 25 percent of them are still in the labor force, so only 38 million are "not in labor force," according to this Census report.

It would seem like the majority of these people don't want to work, are not paid enough to work, etc. Since part-time workers are counted as employed, with as little as one working hour per month, these are not the gig workers, not Uber/Lyft drivers, and not college students who has work-study or part-time jobs.

This category has long been suspect, and what happened in April isn't going to help build its case. There is no reason why the "not in labor force" group should spike immediately as a result of the pandemic. It's not plausible to argue that people who lost their jobs in the last few weeks suddenly turned into people who "do not want a job". I think this spike is solid evidence that the unemployed have been hiding inside the not in labor force number.

The unemployment rate has under-reported unemployment because many of the unemployed have been taken out of the labor force based on BLS criteria. The recovery of jobs since the Great Recession is partially nullified since the jump in "not in labor force" never returned to the prior level.

***

The other dimension, part-time employment, also showed a striking divergence from the past behavior. Typically, when the unemployment rate deteriorates, the proportion of employed people who have part-time jobs increases. However, in the current situation, not only is that not happening, but the proportion of part-timers plunged to a level not seen in the last 30 years.

This suggests that employers are getting rid of their part-time work force first.

 

 


Graphing the extreme

The Covid-19 pandemic has brought about extremes. So many events have never happened before. I doubt The Conference Board has previously seen the collapse of confidence in the economy by CEOs. Here is their graphic showing this extreme event:

Tcb_COVID-19-CEO-confidence-1170

To appreciate this effort, you have to see the complexity of the underlying data. There is a CEO Confidence Measure. The measure has three components. Each component is scored on a scale probably from 0 to 100, with 5o as the middle. Then, the components are aggregated into an overall score. The measure is repeatedly estimated over time, and they did two surveys during the Pandemic, pre and post the lockdown in the U.S. And then, there's the rightmost column, which provides another reference point for one of the components of the measure.

One can easily get one's limbs tied up in knots trying to tame this beast.

Of course, the tiny square stands out. CEOs have a super pessimistic outlook for the next 6 months for overall economy. The number 3 on this scale probably means almost every respondent has a negative view. 

The grid arrangement does not appear attractive but it is terrifically functional. The grid delivers horizontal and vertical comparisons. Moving vertically, we learn that even at the start of the year, the average sentiment was negative (9 points below 50), then it lost another 10 points, and finally imploded.

Moving horizontally, we can compare related metrics since everything is conveniently expressed in the same scale. While CEOs are depressed about the overall economy, they have slightly more faith about their own industry. And then moving left, we learn that many CEOs expect a V-shaped recovery, a really fast bounceback within 6 months. 

As the Conference Board surveys this group again in the near future, I wonder if the optimism still holds. 

The Conference Board has an entire set of graphics about the economic crisis of Covid-19 here. For some reason, they don't let me link to a specific chart so I can't directly link to the chart. 
 


An exposed seam in the crystal ball of coronavirus recovery

One of the questions being asked by the business community is when the economy will recover and how. The Conference Board has offered their outlook in this new article. (This link takes you to the collection of Covid-19 related graphics. You have to find the right one from the carousel. I can't seem to find the direct link to that page.)

This chart summarizes their viewpoint:

TCB-COVID-19-US-level-of-GDP-1170

They considered three scenarios, starting the recovery in May, over the summer, and in the Fall. In all scenarios, the GDP of the U.S. will contract in 2020 relative to 2019. The faster the start of the recovery, the lower the decline.

My reaction to the map icon is different from the oil-drop icon in the previously-discussed chart (link). I think here, the icon steals too much attention. The way lines were placed on the map initially made me think the chart is about cross-country travel.

On the other hand, I love the way he did the horizontal axis / time-line. It elegantly tells us which numbers are actual and which numbers are projected, without explicitly saying so.

Tcb_timelineaxis

Also notice through the use of color, font size and bolding, he organizes the layers of detail, and conveys which items are more important to read first.

***

Trifectacheckup_imageAs I round out the Trifecta Checkup, I found a seam in the Data.

On the right edge, the number for December 2020 is 100.6 which is 0.6 above the reference level. But this number corresponds to a 1.6% reduction. How so?

This seam exposes a gap between how modelers and decision-makers see the world. Evidently, the projections by the analyst are generated using Q3 2019's GDP as baseline (index=100). I'm guessing the analyst chose that quarter because at the time of analysis, the Q4 data have not reached the final round of revision (which came out at the end of March).

A straight-off-the-report conclusion of the analysis is that the GDP would be just back to Q3 2019 level by December 2020 in the most optimistic scenario. (It's clear to me that the data series has been seasonally adjusted as well so that we can compare any month to any month. Years ago, I wrote this primer to understand seasonal adjustments.)

Decision-makers might push back on that conclusion because the reference level of Q3 2019 seems arbitrary. Instead, what they like to know is the year-on-year change to GDP. A small calculation is completed to bridge between the two numbers.

The decision-makers are satisfied after finding the numbers they care about. They are not curious about how the sausage is made, i.e., how the monthly numbers result in the year-on-year change. So the seam is left on the chart.

 


Graphing the economic crisis of coronavirus 2

Last week, I discussed Ray's chart that compares the S&P 500 performance in this crisis against previous crises.

A reminder:

Tcb_stockmarketindices_fourcrises

Another useful feature is the halo around the right edge of the COVID-19 line. This device directs our eyes to where he wants us to look.

In the same series, he made the following for The Conference Board (link):

TCB-COVID-19-impact-oil-prices-640

Two things I learned from this chart:

The oil market takes a much longer time to recover after crises, compared to the S&P. None of these lines reached above 100 in the first 150 days (5 months).

Just like the S&P, the current crisis is most similar in severity to the 2008 Great Recession, only worse, and currently, the price collapse in oil is quite a bit worse than in 2008.

***
The drop of oil is going to be contentious. This is a drop too many for a Tufte purist. It might as well symbolize a tear shed.

The presence of the icon tells me these lines depict the oil market without having to read text. And I approve.