The elusive meaning of black paintings and red blocks

Joe N, a longtime reader, tweeted about the following chart, by the People's Policy Project:

3p_oneyearinonemonth_laborflow

This is a simple column chart containing only two numbers, far exceeded by the count of labels and gridlines.

I look at charts like the lady staring at these Ad Reinhardts:

 

SUBJPREINHARDT2-videoSixteenByNine1050

My artist friends say the black squares are not the same, if you look hard enough.

Here is what I learned after one such seating:

The tiny data labels sitting on the inside top edges of the columns hint that the right block is slightly larger than the left block.

The five labels of the vertical axis serve no purpose, nor the gridlines.

The horizontal axis for time is reversed, with 2019 appearing after 2020 (when read left to right).

The left block has one month while the right block has 12 months. This is further confused by the word "All" which shares the same starting and ending letters as "April".

As far as I can tell, the key message of this chart is that the month of April has the impact of a full year. It's like 12 months of outflows from employment hitting the economy in one month.

***

My first response is this chart:

Junkcharts_oneyearinonemonth_laborflow_1

Breaking the left block into 12 pieces, and color-coding the April piece brings out the comparison. You can also see that in 2019, the outflows from employment to unemployment were steady month to month.

Next, I want to see what happens if I restored the omitted months of Jan to March, 2020.

Junkcharts_oneyearinonemonth_laborflow_2

The story changes slightly. Now, the chart says that the first four months have already exceeded the full year of 2019.

Since the values hold steady month to month, with the exception of April 2020, I make a monthly view:

Junkcharts_oneyearinonemonth_laborflow_monthly_bar_1

You can see the slight nudge-up in March 2020 as well. This draws more attention to the break in pattern.

For time-series data, I prefer to look at line charts:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_1

As I explained in this post about employment statistics (or Chapter 6 of Numbersense (link)), the Bureau of Labor Statistics classifies people into three categories: Employed, Unemployed and Not in Labor Force. Exits from Employed to Unemployed status contribute to unemployment in the U.S. To depict a negative trend, it's often natural to use negative numbers:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_neg_1

You may realize that this data series paints only a partial picture of the health of the labor market. While some people exit the Employed status each month, there are others who re-enter or enter the Employed status. We should really care about net flows.

Junkcharts_oneyearinonemonth_laborflow_net_lines

In all of 2019, there were more entrants than exits, leading to a slightly positive net inflow to the Employed status from Unemployed (blue line). In April 2020, the red line (exits) drags the blue line dramatically.

Of course, even this chart is omitting important information. There are also flows from Employed to and from Not in Labor Force.

 

 

 

 

 


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


How the pandemic affected employment of men and women

In the last post, I looked at the overall employment situation in the U.S. Here is the trend of the "official" unemployment rate since 1990.

Junkcharts_kfung_unemployment_apr20

I was talking about the missing 100 million. These are people who are neither employed nor unemployed in the eyes of the Bureau of Labor Statistics (BLS). They are simply unrepresented in the numbers shown in the chart above.

This group is visualized in my scatter plot as "not in labor force", as a percent of the employment-age population. The horizontal axis of this scatter plot shows the proportion of employed people who hold part-time jobs. Anyone who worked at least one hour during the month is counted as employed part-time.

***

Today, I visualize the differences between men and women.

The first scatter plot shows the situation for men:

Junkcharts_unemployment_scatter_men

This plot reveals a long-term structural problem for the U.S. economy. Regardless of the overall economic health, more and more men have been declared not in labor force each year. Between 2007, the start of the Great Recession to 2019, the proportion went up from 27% to 31%, and the pandemic has pushed this to almost 34%. As mentioned in the last post, this sharp rise in April raises concern that the criteria for "not in labor force" capture a lot of people who actually want a job, and therefore should be counted as part of the labor force but unemployed.

Also, as seen in the last post, the severe drop in part-time workers is unprecedented during economic hardship. As dots turn from blue to red, they typically are moving right, meaning more part-time workers. Since the pandemic, among those people still employed, the proportion holding full-time jobs has paradoxically exploded.

***

The second scatter plot shows the situation with women:

Junkcharts_unemployment_scatter_women

Women have always faced a tougher job market. If they are employed, they are more likely to be holding part-time jobs relative to employed men; and a significantly larger proportion of women are not in the labor force. Between 1990 and 2001, more women entered the labor force. Just like men, the Great Recession resulted in a marked jump in the proportion out of labor force. Since 2014, a positive trend emerged, now interrupted by the pandemic, which has pushed both metrics to levels never seen before.

The same story persists: the sharp rise in women "not in labor force" exposes a problem with this statistic - as it apparently includes people who do want to work, not as intended. In addition, unlike the pattern in the last 30 years, the severe economic crisis is coupled with a shift toward full-time employment, indicating that part-time jobs were disappearing much faster than full-time jobs.


The missing 100 million: how the pandemic reveals the fallacy of not in labor force

Last Friday, the U.S. published the long-feared employment situation report. It should come as no surprise to anyone since U.S. businesses were quick to lay off employees since much of the economy was shut down to abate the spread of the coronavirus.

Numbersense_coverI've been following employment statistics for a while. Chapter 6 of Numbersense (link) addresses the statistical aspects of how the unemployment rate is computed. The title of the chapter is "Are they new jobs when no one can apply?" What you learn is that the final number being published starts off as survey tallies, which then undergo a variety of statistical adjustments.

One such adjustment - which ought to be controversial - results in the disappearance of 100 million Americans. I mean, that they are invisible to the Bureau of Labor Statistics (BLS), considered neither employed nor unemployed. You don't hear about them because the media report the "headline" unemployment rate, which excludes these people. They are officially designated "not in the labor force". I'll come back to this topic later in the post.

***

Last year, I used a pair of charts to visualize the unemployment statistics. I have updated the charts to include all of 2019 and 2020 up to April, the just released numbers.

The first chart shows the trend in the official unemployment rate ("U3") from 1990 to present. It's color-coded so that the periods of high unemployment are red, and the periods of low unemployment are blue. This color code will come in handy for the next chart.

Junkcharts_kfung_unemployment_apr20

The time series is smoothed. However, I had to exclude the April 2020 outlier from the smoother.

The next plot, a scatter plot, highlights two of the more debatable definitions used by the BLS. On the horizontal axis, I plot the proportion of employed people who have part-time jobs. People only need to have worked one hour in a month to be counted as employed. On the vertical axis, I plot the proportion of the population who are labeled "not in labor force". These are people who are not employed and not counted in the unemployment rate.

Junkcharts_kfung_unemployment_apr20_2

The value of data visualization is its ability to reveal insights about the data. I'm happy to report that this design succeeded.

Previously, we learned that (a) part-timers as a proportion of employment tend to increase during periods of worsening unemployment (red dots moving right) while decreasing during periods of improving employment (blue dots moving left); and (b) despite the overall unemployment rate being about the same in 2007 and 2017, the employment situation was vastly different in the sense that the labor force has shrunk significantly during the recession and never returned to normal. These two insights are still found at the bottom right corner of the chart. The 2019 situation did not differ much from 2018.

What is the effect of the current Covid-19 pandemic?

On both dimensions, we have broken records since 1990. The proportion of people designated not in labor force was already the worst in three decades before the pandemic, and now it has almost reached 40 percent of the population!

Remember these people are invisible to the media, neither employed nor unemployed. Back in February 2020, with unemployment rate at around 4 percent, it's absolutely not the case that 96 pecent of the employment-age population was employed. The number of employed Americans was just under 160 million. The population 16 years and older at the time was 260 million.

Who are these 100 million people? BLS says all but 2 million of these are people who "do not want a job". Some of them are retired. There are about 50 million Americans above 65 years old although 25 percent of them are still in the labor force, so only 38 million are "not in labor force," according to this Census report.

It would seem like the majority of these people don't want to work, are not paid enough to work, etc. Since part-time workers are counted as employed, with as little as one working hour per month, these are not the gig workers, not Uber/Lyft drivers, and not college students who has work-study or part-time jobs.

This category has long been suspect, and what happened in April isn't going to help build its case. There is no reason why the "not in labor force" group should spike immediately as a result of the pandemic. It's not plausible to argue that people who lost their jobs in the last few weeks suddenly turned into people who "do not want a job". I think this spike is solid evidence that the unemployed have been hiding inside the not in labor force number.

The unemployment rate has under-reported unemployment because many of the unemployed have been taken out of the labor force based on BLS criteria. The recovery of jobs since the Great Recession is partially nullified since the jump in "not in labor force" never returned to the prior level.

***

The other dimension, part-time employment, also showed a striking divergence from the past behavior. Typically, when the unemployment rate deteriorates, the proportion of employed people who have part-time jobs increases. However, in the current situation, not only is that not happening, but the proportion of part-timers plunged to a level not seen in the last 30 years.

This suggests that employers are getting rid of their part-time work force first.