Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

Germanextremists_bars

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Redo_junkcharts_germanextremists_sidebysidelines

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

Redo_junkcharts_germanextremists_dots

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.

 

[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


How the pandemic affected employment of men and women

In the last post, I looked at the overall employment situation in the U.S. Here is the trend of the "official" unemployment rate since 1990.

Junkcharts_kfung_unemployment_apr20

I was talking about the missing 100 million. These are people who are neither employed nor unemployed in the eyes of the Bureau of Labor Statistics (BLS). They are simply unrepresented in the numbers shown in the chart above.

This group is visualized in my scatter plot as "not in labor force", as a percent of the employment-age population. The horizontal axis of this scatter plot shows the proportion of employed people who hold part-time jobs. Anyone who worked at least one hour during the month is counted as employed part-time.

***

Today, I visualize the differences between men and women.

The first scatter plot shows the situation for men:

Junkcharts_unemployment_scatter_men

This plot reveals a long-term structural problem for the U.S. economy. Regardless of the overall economic health, more and more men have been declared not in labor force each year. Between 2007, the start of the Great Recession to 2019, the proportion went up from 27% to 31%, and the pandemic has pushed this to almost 34%. As mentioned in the last post, this sharp rise in April raises concern that the criteria for "not in labor force" capture a lot of people who actually want a job, and therefore should be counted as part of the labor force but unemployed.

Also, as seen in the last post, the severe drop in part-time workers is unprecedented during economic hardship. As dots turn from blue to red, they typically are moving right, meaning more part-time workers. Since the pandemic, among those people still employed, the proportion holding full-time jobs has paradoxically exploded.

***

The second scatter plot shows the situation with women:

Junkcharts_unemployment_scatter_women

Women have always faced a tougher job market. If they are employed, they are more likely to be holding part-time jobs relative to employed men; and a significantly larger proportion of women are not in the labor force. Between 1990 and 2001, more women entered the labor force. Just like men, the Great Recession resulted in a marked jump in the proportion out of labor force. Since 2014, a positive trend emerged, now interrupted by the pandemic, which has pushed both metrics to levels never seen before.

The same story persists: the sharp rise in women "not in labor force" exposes a problem with this statistic - as it apparently includes people who do want to work, not as intended. In addition, unlike the pattern in the last 30 years, the severe economic crisis is coupled with a shift toward full-time employment, indicating that part-time jobs were disappearing much faster than full-time jobs.


The missing 100 million: how the pandemic reveals the fallacy of not in labor force

Last Friday, the U.S. published the long-feared employment situation report. It should come as no surprise to anyone since U.S. businesses were quick to lay off employees since much of the economy was shut down to abate the spread of the coronavirus.

Numbersense_coverI've been following employment statistics for a while. Chapter 6 of Numbersense (link) addresses the statistical aspects of how the unemployment rate is computed. The title of the chapter is "Are they new jobs when no one can apply?" What you learn is that the final number being published starts off as survey tallies, which then undergo a variety of statistical adjustments.

One such adjustment - which ought to be controversial - results in the disappearance of 100 million Americans. I mean, that they are invisible to the Bureau of Labor Statistics (BLS), considered neither employed nor unemployed. You don't hear about them because the media report the "headline" unemployment rate, which excludes these people. They are officially designated "not in the labor force". I'll come back to this topic later in the post.

***

Last year, I used a pair of charts to visualize the unemployment statistics. I have updated the charts to include all of 2019 and 2020 up to April, the just released numbers.

The first chart shows the trend in the official unemployment rate ("U3") from 1990 to present. It's color-coded so that the periods of high unemployment are red, and the periods of low unemployment are blue. This color code will come in handy for the next chart.

Junkcharts_kfung_unemployment_apr20

The time series is smoothed. However, I had to exclude the April 2020 outlier from the smoother.

The next plot, a scatter plot, highlights two of the more debatable definitions used by the BLS. On the horizontal axis, I plot the proportion of employed people who have part-time jobs. People only need to have worked one hour in a month to be counted as employed. On the vertical axis, I plot the proportion of the population who are labeled "not in labor force". These are people who are not employed and not counted in the unemployment rate.

Junkcharts_kfung_unemployment_apr20_2

The value of data visualization is its ability to reveal insights about the data. I'm happy to report that this design succeeded.

Previously, we learned that (a) part-timers as a proportion of employment tend to increase during periods of worsening unemployment (red dots moving right) while decreasing during periods of improving employment (blue dots moving left); and (b) despite the overall unemployment rate being about the same in 2007 and 2017, the employment situation was vastly different in the sense that the labor force has shrunk significantly during the recession and never returned to normal. These two insights are still found at the bottom right corner of the chart. The 2019 situation did not differ much from 2018.

What is the effect of the current Covid-19 pandemic?

On both dimensions, we have broken records since 1990. The proportion of people designated not in labor force was already the worst in three decades before the pandemic, and now it has almost reached 40 percent of the population!

Remember these people are invisible to the media, neither employed nor unemployed. Back in February 2020, with unemployment rate at around 4 percent, it's absolutely not the case that 96 pecent of the employment-age population was employed. The number of employed Americans was just under 160 million. The population 16 years and older at the time was 260 million.

Who are these 100 million people? BLS says all but 2 million of these are people who "do not want a job". Some of them are retired. There are about 50 million Americans above 65 years old although 25 percent of them are still in the labor force, so only 38 million are "not in labor force," according to this Census report.

It would seem like the majority of these people don't want to work, are not paid enough to work, etc. Since part-time workers are counted as employed, with as little as one working hour per month, these are not the gig workers, not Uber/Lyft drivers, and not college students who has work-study or part-time jobs.

This category has long been suspect, and what happened in April isn't going to help build its case. There is no reason why the "not in labor force" group should spike immediately as a result of the pandemic. It's not plausible to argue that people who lost their jobs in the last few weeks suddenly turned into people who "do not want a job". I think this spike is solid evidence that the unemployed have been hiding inside the not in labor force number.

The unemployment rate has under-reported unemployment because many of the unemployed have been taken out of the labor force based on BLS criteria. The recovery of jobs since the Great Recession is partially nullified since the jump in "not in labor force" never returned to the prior level.

***

The other dimension, part-time employment, also showed a striking divergence from the past behavior. Typically, when the unemployment rate deteriorates, the proportion of employed people who have part-time jobs increases. However, in the current situation, not only is that not happening, but the proportion of part-timers plunged to a level not seen in the last 30 years.

This suggests that employers are getting rid of their part-time work force first.

 

 


Reviewing the charts in the Oxford Covid-19 study

On my sister (book) blog, I published a mega-post that examines the Oxford study that was cited two weeks ago as a counterpoint to the "doomsday" Imperial College model. These studies bring attention to the art of statistical modeling, and those six posts together are designed to give you a primer, and you don't need math to get a feel.

One aspect that didn't make it to the mega-post is the data visualization. Sad to say, the charts in the Oxford study (link) are uniformly terrible. Figure 3 is typical:

Oxford_covidmodel_fig3

There are numerous design decisions that frustrate readers.

a) The graphic contains two charts, one on top of the other. The left axis extends floor-to-ceiling, giving the false impression that it is relevant to both charts. In fact, the graphic uses dual axes. The bottom chart references the axis shown in the bottom right corner; the left axis is meaningless. The two charts should be drawn separately.

For those who have not read the mega-post about the Oxford models, let me give a brief description of what these charts are saying. The four colors refer to four different models - these models have the same structure but different settings. The top chart shows the proportion of the population that is still susceptible to infection by a certain date. In these models, no one can get re-infected, and so you see downward curves. The bottom chart displays the growth in deaths due to Covid-19. The first death in the UK was reported on March 5.  The black dots are the official fatalities.

b) The designer allocates two-thirds of the space to the top chart, which has a much simpler message. This causes the bottom chart to be compressed beyond cognition.

c) The top chart contains just five lines, smooth curves of the same shape but different slopes. The designer chose to use thick colored lines with black outlines. As a result, nothing precise can be read from the chart. When does the yellow line start dipping? When do the two orange lines start to separate?

d) The top chart should have included margins of error. These models are very imprecise due to the sparsity of data.

e) The bottom chart should be rejected by peer reviewers. We are supposed to judge how well each of the five models fits the cumulative death counts. But three design decisions conspire to prevent us from getting the answer: (i) the vertical axis is severely compressed by tucking this chart underneath the top chart (ii) the vertical axis uses a log scale which compresses large values and (iii) the larger-than-life dots.

As I demonstrated in this post also from the sister blog, many models especially those assuming an exponential growth rate has poor fits after the first few days. Charting in log scale hides the degree of error.

f) There is a third chart squeezed into the same canvass. Notice the four little overlapping hills located around Feb 1. These hills are probability distributions, which are presented without an appropriate vertical axis. Each hill represents a particular model's estimate of the date on which the novel coronavirus entered the UK. But that date is unknowable. So the model expresses this uncertainty using a probability distribution. The "peak" of the distribution is the most likely date. The spread of the hill gives the range of plausible dates, and the height at a given date indicates the chance that that is the date of introduction. The missing axis is a probability scale, which is neither the left nor the right axis.

***

The bottom chart shows up in a slightly different form as Figure 1(A).

Oxford_covidmodels_Fig1A

Here, the green, gray (blocked) and red thick lines correspond to the yellow/orange/red diamonds in Figure 3. The thin green and red lines show the margins of error I referred to above (these lines are not explicitly explained in the chart annotation.) The actual counts are shown as white rather than black diamonds.

Again, the thick lines and big diamonds conspire to swamp the gaps between model fit and actual data. Again, notice the use of a log scale. This means that the same amount of gap signifies much bigger errors as time moves to the right.

When using the log scale, we should label it using the original units. With a base 10 logarithm, the axis should have labels 1, 10, 100, 1000 instead of 0, 1, 2, 3. (This explains my previous point - why small gaps between a model line and a diamond can mean a big error as the counts go up.)

Also notice how the line of white diamonds makes it impossible to see what the models are doing prior to March 5, the date of the first reported death. The models apparently start showing fatalities prior to March 5. This is a key part of their conclusion - the Oxford team concluded that the coronavirus has been circulating in the U.K. even before the first infection was reported. The data visualization should therefore bring out the difference in timing.

I hope by the time the preprint is revised, the authors will have improved the data visualization.

 

 

 


Comparing chance of death of coronavirus and flu

The COVID-19 charts are proving one thing. When the topic of a dataviz is timely and impactful, readers will study the graphics and ask questions. I've been sent some of these charts lately, and will be featuring them here.

A former student saw this chart from Business Insider (link) and didn't like it.

Businesinsider_coronavirus_flu_compare

My initial reaction was generally positive. It's clear the chart addresses a comparison between death rates of the flu and COVID19, an important current question. The side-by-side panel is effective at allowing such a comparison. The column charts look decent, and there aren't excessive gridlines.

Sure, one sees a few simple design fixes, like removing the vertical axis altogether (since the entire dataset has already been printed). I'd also un-slant the age labels.

***

I'd like to discuss some subtler improvements.

A primary challenge is dealing with the different definitions of age groups across the two datasets. While the side-by-side column charts prompt readers to go left-right, right-left in comparing death rates, it's not easy to identify which column to compare to which. This is not fixable in the datasets because the organizations that compile them define their own age groups.

Also, I prefer to superimpose the death rates on the same chart, using something like a dot plot rather than a column chart. This makes the comparison even easier.

Here is a revised visualization:

Redo_businessinsider_covid19fatalitybyage

The contents of this chart raise several challenges to public health officials. Clearly, hospital resources should be preferentially offered to older patients. But young people could be spreading the virus among the community.

Caution is advised as the data for COVID19 suffers from many types of inaccuracies, as outlined here.


All these charts lament the high prices charged by U.S. hospitals

Nyt_medicalprocedureprices

A former student asked me about this chart from the New York Times that highlights much higher prices of hospital procedures in the U.S. relative to a comparison group of seven countries.

The dot plot is clearly thought through. It is not a default chart that pops out of software.

Based on its design, we surmise that the designer has the following intentions:

  1. The names of the medical procedures are printed to be read, thus the long text is placed horizontally.

  2. The actual price is not as important as the relative price, expressed as an index with the U.S. price at 100%. These reference values are printed in glaring red, unignorable.

  3. Notwithstanding the above point, the actual price is still of secondary importance, and the values are provided as a supplement to the row labels. Getting to the actual prices in the comparison countries requires further effort, and a calculator.

  4. The primary comparison is between the U.S. and the rest of the world (or the group of seven countries included). It is less important to distinguish specific countries in the comparison group, and thus the non-U.S. dots are given pastels that take some effort to differentiate.

  5. Probably due to reader feedback, the font size is subject to a minimum so that some labels are split into two lines to prevent the text from dominating the plotting region.

***

In the Trifecta Checkup view of the world, there is no single best design. The best design depends on the intended message and what’s in the available data.

To illustate this, I will present a few variants of the above design, and discuss how these alternative designs reflect the designer's intentions.

Note that in all my charts, I expressed the relative price in terms of discounts, which is the mirror image of premiums. Instead of saying Country A's price is 80% of the U.S. price, I prefer to say Country A's price is a 20% saving (or discount) off the U.S. price.

First up is the following chart that emphasizes countries instead of hospital procedures:

Redo_medicalprice_hor_dot

This chart encourages readers to draw conclusions such as "Hospital prices are 60-80 percent cheaper in Holland relative to the U.S." But it is more taxing to compare the cost of a specific procedure across countries.

The indexing strategy already creates a barrier to understanding relative costs of a specific procedure. For example, the value for angioplasty in Australia is about 55% and in Switzerland, about 75%. The difference 75%-55% is meaningless because both numbers are relative savings from the U.S. baseline. Comparing Australia and Switzerland requires a ratio (0.75/0.55 = 1.36): Australia's prices are 36% above Swiss prices, or alternatively, Swiss prices are a 64% 26% discount off Australia's prices.

The following design takes it even further, excluding details of individual procedures:

Redo_medicalprice_hor_bar

For some readers, less is more. It’s even easier to get a rough estimate of how much cheaper prices are in the comparison countries, for now, except for two “outliers”, the chart does not display individual values.

The widths of these bars reveal that in some countries, the amount of savings depends on the specific procedures.

The bar design releases the designer from a horizontal orientation. The country labels are shorter and can be placed at the bottom in a vertical design:

Redo_medicalprice_vert_bar

It's not that one design is obviously superior to the others. Each version does some things better. A good designer recognizes the strengths and weaknesses of each design, and selects one to fulfil his/her intentions.

 

P.S. [1/3/20] Corrected a computation, explained in Ken's comment.


Revisiting global car sales

We looked at the following chart in the previous blog. The data concern the growth rates of car sales in different regions of the world over time.

Cnbc zh global car sales

Here is a different visualization of the same data.

Redo_cnbc_globalcarsales

Well, it's not quite the same data. I divided the global average growth rate by four to yield an approximation of the true global average. (The reason for this is explained in the other day's post.)

The chart emphasizes how each region was helping or hurting the global growth. It also features the trend in growth within each region.