Losses trickle down while gains trickle up

In a rich dataset, it's hard to convey all the interesting insights on a single chart. Following up on the previous post, I looked further at the wealth distribution dataset. In the previous post, I showed this chart, which indicated that the relative wealth of the super-rich (top 1%) rose dramatically around 2011.

Redo_bihouseholdwealth_legend

As a couple of commenters noticed, that's relative wealth. I indiced everything to the Bottom 50%.

In this next chart, I apply a different index. Each income segment is set to 100 at the start of the time period under study (2000), and I track how each segment evolved in the last two decades.

Junkcharts_redo_bihouseholdwealth_2

This chart offers many insights.

The Bottom 50% have been left far, far behind in the last 20 years. In fact, from 2000-2018, this segment's wealth never once reached the 2000 level. At its worst, around 2010, the Bottom 50% found themselves 80% poorer than they were 10 years ago!

In the meantime, the other half of the population has seen their wealth climb continuously through the 20 years. This is particularly odd because the major crisis of these two decades was the Too Big to Fail implosion of financial instruments, which the Bottom 50% almost surely did not play a part in. During that crisis, the top 50% were 30-60% better off than they were in 2000. Is this the "trickle-down" economy in which losses are passed down (but gains are passed up)?

The chart also shows how the recession hit the bottom 50% much deeper, and how the recovery took more than a decade. For the top half, the recovery came between 2-4 years.

It also appears that top 10% are further peeling off from the rest of the population. Since 2009, the top 11-49% have been steadily losing ground relative to the top 10%, while the gap between them and the Bottom 50% has narrowed.

***

This second chart is not nearly as dramatic as the first one but it reveals much more about the data.

 


Atypical time order and bubble labeling

This chart appeared in a Charles Schwab magazine in Summer, 2019.

Schwab_volatility2018

This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:

Junkcharts_redo_schwabvolatility_columnrank

Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.

***

In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.

Junkcharts_redo_schwabvolatility_hashyear

What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .

Junkcharts_redo_schwabvolatility_hashyearrelative

Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)

 

 


This holiday retailers hope it will snow dollars

According to the Conference Board, the pandemic will not deter U.S. consumers from emptying their wallets this holiday season. Here's a chart that shows their expectation (link):

COVID-19-Holiday-Spend-847

 

A few little things make this chart work:

The "More" category is placed on the left, as English-speaking countries tend to be read Left-to-Right, and it is also given the deepest green, drawing our attention.

Only the "More" segments have data labels. I'd have omitted the decimals. I suspect they are added because financial analysts may be multiplying these percentages to yield dollar amounts, in which case the extra precision helps.

The categories are ordered by the decreasing propensity of increased spending this year relative to last year. (The business community has an optimism bias.)

The choice of three shades of one color instead of three different colors keeps the chart clean.

***

The use of snowflakes surely infuriates a hardcore Tufte fan although I like that they add a festive note to the presentation. The large snowflake isn't randomly positioned but placed exactly where it causes the least interference with the bar chart.

 


Book Review: Visualizing with Text by Richard Brath

Richardbarth_bookcoverThe creative process is sometimes described in terms of diverge-converge cycles. The diverge step involves experimentation and rewards suspending disbelief, while excesses are curbed and concepts refined during the converge step. Richard Brath's just-released book Visualizing with Text is an important resource that expands our appreciation for the place of text in visual displays.

Books on data visualization fall into recognizable types, of which two popular ones are the style guide, such as Edward Tufte, Dona Wong, and Alberto Cairo, and the coding manual, such as Ben Fry (processing) and Hadley Wickham (ggplot, Shiny). Brath's volume belongs to neither of those - it reads more like an encyclopedic catalog of how text can be incorporated into charts and graphs. He challenges us to blow up our imaginative space for characters, words, sentences, paragraphs and prose. It is a valuable aid for the diverge step of our creative process.

In modern data visualization, text is treated as an accessory, frequently found in titles, labels, legends, footnotes or surrounding text. Brath wants us to elevate text to the starring attraction. Starting with baby steps, such as direct labeling of lines and objects, and coordinating colors between chart elements and words, he experiments with inserting text into unlikely crannies, not shying away from ideas that even he admits may be somewhat of a dead-end.

One of the more immediately useful examples is the use of text labels that hug the lines on a line chart, similar to how roads and rivers are labeled on maps. I wish all software developers implement this function without delay.

Barth_riverlabelsonlines

A more esoteric example is to replace these lines with small-size text, as Brath makes an analogy between sentences and lines.

Barth_textinlines

I am still deciding if this is a gold mine or a minefield. It is thought-provoking nonetheless.

Finally, the book includes some flights of fancy, like this one:

Barth_french_departments

The red superscripts are numeric codes for French departments (provinces), arranged in ascending order of a given metric, and placed in proportional distance within the prose!

The converge step is left to the reader, as Brath refrains from bullhorning his opinions about chart types, which is why readers should not expect a style guide. He includes many experimental graphics, and may provide the pros and cons of a form without registering a judgement.

Because many of these ideas have yet to enter the mainstream, we'd need to implement these ideas on our own, which is why readers will not find a coding manual. As mentioned above, even the simplest and least controversial tactic of directly labeling lines is not available in Excel, let alone text that hugs or replaces lines. (This proves Brath's point that our community has done text a disservice.) Other ideas explored in later chapters require such features as italicizing numeric proportions of a word, rather than the entire word.

Recently, text has become a mainstay of Big Data. Visualizing with Text is timely, relevant and provocative. It is also clearly written, and tightly organized. Chapter 13 neatly summarizes the key concepts that have appeared along the way. There are plenty of use cases, primarily derived from research or business. After reading this book, you'll revel in the new sandbox of text, and long to free yourself from the constraints of your tool.


***

I recommend that you get the paper copy of the book. I reviewed the electronic version, and what irony! As you may have guessed, the electronic version ruins the typesetting. On every page, certain paragraphs show up in tiny font that resist all attempts to magnify, making Brath's case that legibility is an important metric for text visualization. Some of the more unusual fonts are dropped. The images are too small, even when popped up.

[P.S. Richard has a webpage where he included larger images and some code.]


Locating the political center

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

Bloomberg_politicalcenter_print_sm

***

Here are the rightmost two charts.

Bloomberg_politicalcenter_rightside Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

Junkcharts_redo_bloombergpoliticalcenterrightsideOn these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

 


Making better pie charts if you must

I saw this chart on an NYU marketing twitter account:

LATAMstartupCEO_covidimpact

The graphical design is not easy on our eyes. It's just hard to read for various reasons.

The headline sounds like a subject line from an email.

The subheaders are long, and differ only by a single word.

Even if one prefers pie charts, they can be improved by following a few guidelines.

First, start the first sector at the 12-oclock direction. Like this:

Redo_junkcharts_latamceo_orientation

The survey uses a 5-point scale from "Very Good" to "Very Bad". Instead of using five different colors, it's better to use two extreme colors and shading. Like this:

Redo_junkcharts_latamceo_color

I also try hard to keep all text horizontal.

Redo_junkcharts_latamceo_labels

For those who prefers not to use pie charts, a side-by-side bar chart works well.

Redo_junkcharts_latamceo_bars

In my article for DataJournalism.com, I outlined "unspoken rules" for making various charts, including pie charts.

 

 

 


Putting vaccine trials in boxes

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

Bloombergbw_vaccinetrials_sm

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes. 

Redo_junkcharts_bloombergbw_vaccinetrials_0

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Redo_junkcharts_bloombergbw_vaccinetrials_2

 

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.

Bloombergbwonline_vaccinetrials

If any reader can figure out the logic of the ordering of the vaccine strategies, please leave a comment below.


When the pie chart is more complex than the data

The trading house, Charles Schwab, included the following graphic in a recent article:

Charleschwab_portfolio_1000

This graphic is more complicated than the story that it illustrates. The author describes a simple scenario in which an investor divides his investments into stocks, bonds and cash. After a stock crash, the value of the portfolio declines.

The graphic is a 3-D pie chart, in which the data are encoded twice, first in the areas of the sectors and then in the heights of the part-cylinders.

As readers, we perceive the relative volumes of the part-cylinders. Volume is the cross-sectional area (i.e. of the base) multipled by the height. Since each component holds the data, the volumes are proportional to the squares of the data.

Here is a different view of the same data:

Redo_junkcharts_schwab_portfolio

This "bumps chart" (also called a slopegraph) shows clearly the only thing that drives the change is the drop in stock prices. Because the author assumes no change in bonds or cash, the drop in the entire portfolio is completely accounted for by the decline in stocks. Of course, this scenario seems patently unrealistic - different investment asset classes tend to be correlated.

***

A cardinal rule of data visualization is that the visual should be less complex than the data.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

Visualcapitalist_spending_generalcommerce

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

Visualcapitalist_spending_fooddelivery

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

Visualcapitalist_spending_travel

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.