Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.


This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.


Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.


The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).


What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.






Two good charts can use better titles

NPR has this chart, which I like:


It's a small multiples of bumps charts. Nice, clear labels. No unnecessary things like axis labels. Intuitive organization by Major Factor, Minor Factor, and Not a Factor.

Above all, the data convey a strong, surprising, message - despite many high-profile gun violence incidents this year, some Democratic voters are actually much less likely to see guns as a "major factor" in deciding their vote!

Of course, the overall importance of gun policy is down but the story of the chart is really about the collapse on the Democratic side, in a matter of two months.

The one missing thing about this chart is a nice, informative title: In two months, gun policy went from a major to a minor issue for some Democratic voters.


 I am impressed by this Financial Times effort:


The key here is the analysis. Most lazy analyses compare millennials to other generations but at current ages but this analyst looked at each generation at the same age range of 18 to 33 (i.e. controlling for age).

Again, the data convey a strong message - millennials have significantly higher un(der)employment than previous generations at their age range. Similar to the NPR chart above, the overall story is not nearly as interesting as the specific story - it is the pink area ("not in labour force") that is driving this trend.

Specifically, millennial unemployment rate is high because the proportion of people classified as "not in labour force" has doubled in 2014, compared to all previous generations depicted here. I really like this chart because it lays waste to a prevailing theory spread around by reputable economists - that somehow after the Great Recession, demographics trends are causing the explosion in people classified as "not in labor force". These people are nobodies when it comes to computing the unemployment rate. They literally do not count! There is simply no reason why someone just graduated from college should not be in the labour force by choice. (Dean Baker has a discussion of the theory that people not wanting to work is a long term trend.)

The legend would be better placed to the right of the columns, rather than the top.

Again, this chart benefits from a stronger headline: BLS Finds Millennials are twice as likely as previous generations to have dropped out of the labour force.





Several problems with stacked bar charts, as demonstrated by a Delta chart designer

In the Trifecta Checkup (link), I like to see the Question and the Visual work well together. Sometimes, you have a nice message but you just pick the wrong Visual.

An example is the following stacked column chart, used in an investor presentation by Delta.


From what I can tell, the five types of aircraft are divided into RJ (regional jet) and others (perhaps, larger jets). With each of those types, there are two or three subtypes. The primary message here is the reduction in the RJ fleet and the expansion of Small/Medium/Large.

One problem with a stacked column chart with five types is that it takes too much effort to understand the trends of the middle types.

The two types on the edges are not immune to confusion either. As shown below, both the dark blue (Large) type and the dark red (50-seat RJ) type are associated with downward sloping lines except that the former type is growing rapidly while the latter is vanishing from the mix!


 In this case, the slopegraph (Bumps-type chart) can overcome some of the limitations.



This example was used in my new dataviz workshop, launched in St. Louis yesterday. Thank you to the participants for making it a lively session!

Fifty-nine intersections supporting forty dots of data

My friend Ray V. asked how this chart can be improved:


Let's try to read this chart. The Economist is always the best at writing headlines, and this one is simple and to the point: the rich get richer. This is about inequality but not just inequality - the growth in inequality over time.

Each country has four dots, divided into two pairs. From the legend, we learn that the line represents the gap between the rich and the poor. But what is rich and what is poor? Looking at the sub-header, we learn that the population is divided by domicile, and the per-capita GDP of the poorest and richest regions are drawn. This is a indirect metric, and may or may not be good, depending on how many regions a country is divided into, the dispersion of incomes within each region, the distribution of population between regions, and so on.

Now, looking at the axis labels, it's pretty clear that the data depicted are not in dollars (or currency), despite the reference to GDP in the sub-header. The numbers represent indices, relative to the national average GDP per head. For many of the countries, the poorest region produces about half of the per-capita GDP as the richest region.

Back to the orginal question. A growing inequality would be represented by a longer line below a shorter line within each country. That is true in some of these countries. The exceptions are Sweden, Japan, South Korea.

It doesn't jump out that the key task requires comparing the lengths of the two lines. Another issue is the outdated convention of breaking up a line (Britian) when the line is of extreme length - particularly unwise given that the length of the line encodes the key metric in the chart.

Further, it has low data-ink ratio a la Tufte. The gridlines, reference lines, and data lines weave together in a complex pattern creating 59 intersections in a chart that contains only 40  36 numbers.


 I decided to compute a simpler metric - the ratio of rich to poor.  For example, in the UK, the richest area produces about 20 times as much GDP per capita as the poorest one in 2015.  That is easier to understand than an index to the average region.

I had fun making the following chart, although many standard forms like the Bumps chart (i.e. slopegraph) or paired columns and so on also work.


This chart is influenced by Ed Tufte, who spent a good number of pages in his first book advocating stripping even the standard column chart to its bare essence. The chart also acknowledges the power of design to draw attention.



PS. Sorry I counted incorrectly. The chart has 36 dots not 40. 

Making people jump over hoops

Take a look at the following chart, and guess what message the designer wants to convey:


This chart accompanied an article in the Wall Street Journal about Wells Fargo losing brokers due to the fake account scandal, and using bonuses to lure them back. Like you, my first response to the chart was that little has changed from 2015 to 2017.

It is a bit mysterious the intention of the whitespace inserted to split the four columns into two pairs. It's not obvious that UBS and Merrill are different from Wells Fargo and Morgan Stanley. This device might have been used to overcome the difficulty of reading four columns side by side.

The additional challenge of this dataset is the outlier values for UBS, which elongates the range of the vertical axis, squeezing together the values of the other three banks.

In this first alternative version, I play around with irregular gridlines.


Grouped column charts are not great at conveying changes over time, as they cause our eyes to literally jump over hoops. In the second version, I use a bumps chart to compactly highlight the trends. I also zoom in on the quarterly growth rates.


The rounded interpolation removes the sharp angles from the typical bumps chart (aka slopegraph) but it does add patterns that might not be there. This type of interpolation however respects the values at the "knots" (here, the quarterly values) while a smoother may move those points. On balance, I like this treatment.


PS. [6/2/2017] Given the commentary below, I am including the straight version of the chart, so you can compare. The straight-line version is more precise. One aspect of this chart form I dislike is the sharp angles. When there are more lines, it gets very entangled.


Sorting out the data, and creating the head-shake manual

Yesterday's post attracted a few good comments.

Several readers don't like the data used in the NAEP score chart. The authors labeled the metric "gain in NAEP scale scores" which I interpreted to be "gain scores," a popular way of evaluating educational outcomes. A gain score is the change in test score between (typically consecutive) years. I also interpreted the label "2000-2009" as the average of eight gain scores, in other words, the average year-on-year change in test scores during those 10 years.

After thinking about what reader mankoff wrote, which prompted me to download the raw data, I realized that the designer did not compute gain scores. "2000-2009" really means the difference between the 2009 score and the 2000 score, ignoring all values between those end points. So mankoff is correct in saying that the 2009 number was used in both "2000-2009" and "2009-2015" computations.

This treatment immediately raises concerns. Why is a 10-year period compared to a 7-year period?

Andrew prefers to see the raw scores ("scale scores") instead of relative values. Here is the corresponding chart:


I placed a line at 2009, just to see if there is a reason for that year to be a special year. (I don't think so.) The advantage of plotting raw scores is that it is easier to interpret. As Andrew said, less abstraction. It also soothes the nerves of those who are startled that the lines for white students appear at the bottom of the chart of gain scores.

I suppose the reason why the original designer chose to use score differentials is to highlight their message concerning change in scores. One can nitpick that their message isn't particularly cogent because if you look at 8th grade math or reading scores, comparing 2009 and 2015, there appeared to be negligible change, and yet between those end-points, the scores did spike and then drop back to the 2009 level.

One way to mitigate the confusion that mankoff encountered in interpreting my gain-score graphic is to use "informative" labels, rather than "uninformative" labels.


Instead of saying the vertical axis plots "gain scores" or "change in scores," directly label one end as "no progress" and the other end as "more progress."

Everything on this chart is progress over time, and the stalling of progress is their message. This chart requires more upfront learning, after which the message jumps out. The chart of raw scores shown above has almost no perceptive overhead but the message has to be teased out. I prefer the chart of raw scores in this case.


Let me now address another objection, which pops up every time I convert a bar chart to a line chart (a type of Bumps chart, which has been called slope graphs by Tufte followers). The objection is that the line chart causes readers to see a trend when there isn't one.

So let me make the case one more time.

Start with the original column chart. If you want to know that Hispanic students have seen progress in their 4th grade math scores grind to a halt, you have to shake your head involuntarily in the following manner:


(Notice how the legend interferes with your line of sight.)

By the time you finish interpreting this graphic, you would have shaken your head in all of the following directions:


Now, I am a scavenger. I collect all these lines and rearrange them into four panels of charts. That becomes the chart I showed in yesterday's post. All I have done is to bring to the surface the involuntary motions readers were undertaking. I didn't invent any trends.

Involuntary head-shaking is probably not an intended consequence of data visualization

This chart is in the Sept/Oct edition of Harvard Magazine:

Naep scores - Nov 29 2016 - 4-21 PM

Pretty standard fare. It even is Tufte-sque in the sparing use of axes, labels, and other non-data-ink.

Does it bug you how much work you need to do to understand this chart?

Here is the junkchart version:


In the accompanying article, the journalist declared that student progress on NAEP tests came to a virtual standstill, and this version highlights the drop in performance between the two periods, as measured by these "gain scores."

The clarity is achieved through proximity as well as slopes.

The column chart form has a number of deficiencies when used to illustrate this data. It requires too many colors. It induces involuntary head-shaking.

Most unforgivingly, it leaves us with a puzzle: does the absence of a column means no progress or unknown?


PS. The inclusion of 2009 on both time periods is probably an editorial oversight.



Political winds and hair styling

Washington Post (link) and New York Times (link) published dueling charts last week, showing the swing-swang of the political winds in the U.S. Of course, you know that the pendulum has shifted riotously rightward towards Republican red in this election.

The Post focused its graphic on the urban / not urban division within the country:


Over Twitter, Lazaro Gamio told me they are calling these troll-hair charts. You certainly can see the imagery of hair blowing with the wind. In small counties (right), the wind is strongly to the right. In urban counties (left), the straight hair style has been in vogue since 2008. The numbers at the bottom of the chart drive home the story.

Previously, I discussed the Two Americas map by the NY Times, which covers a similar subject. The Times version emphasizes the geography, and is a snapshot while the Post graphic reveals longer trends.

Meanwhile, the Times published its version of a hair chart.


This particular graphic highlights the movement among the swing states. (Time moves bottom to top in this chart.) These states shifted left for Obama and marched right for Trump.

The two sets of charts have many similarities. They both use curvy lines (hair) as the main aesthetic feature. The left-right dimension is the anchor of both charts, and sways to the left or right are important tropes. In both presentations, the charts provide visual aid, and are nicely embedded within the story. Neither is intended as exploratory graphics.

But the designers diverged on many decisions, mostly in the D(ata) or V(isual) corner of the Trifecta framework.


The Times chart is at the state level while the Post uses county-level data.

The Times plots absolute values while the Post focuses on relative values (cumulative swing from the 2004 position). In the Times version, the reader can see the popular vote margin for any state in any election. The middle vertical line is keyed to the electoral vote (plurality of the popular vote in most states). It is easy to find the crossover states and times.

The Post's designer did some data transformations. Everything is indiced to 2004. Each number in the chart is the county's current leaning relative to 2004. Thus, left of vertical means said county has shifted more blue compared to 2004. The numbers are cumulative moving top to bottom. If a county is 10% left of center in the 2016 election, this effect may have come about this year, or 4 years ago, or 8 years ago, or some combination of the above. Again, left of center does not mean the county voted Democratic in that election. So, the chart must be read with some care.

One complaint about anchoring the data is the arbitrary choice of the starting year. Indeed, the Times chart goes back to 2000, another arbitrary choice. But clearly, the two teams were aiming to address slightly different variations of the key question.

There is a design advantage to anchoring the data. The Times chart is noticeably more entangled than the Post chart. There are tons more criss-crossing. This is particularly glaring given that the Times chart contains many fewer lines than the Post chart, due to state versus county.

Anchoring the data to a starting year has the effect of combing one's unruly hair. Mathematically, they are just shifting the lines so that they start at the same location, without altering the curvature. Of course, this is double-edged: the re-centering means the left-blue / right-red interpretation is co-opted.

On the Times chart, they used a different coping strategy. Each version of their charts has a filter: they highlight the set of lines to demonstrate different vignettes: the swing states moved slightly to the right, the Republican states marched right, and the Democratic states also moved right. Without these filters, the readers would be winking at the Times's bad-hair day.


Another decision worth noting: the direction of time. The Post's choice of top to bottom seems more natural to me than the Times's reverse order but I am guessing some of you may have different inclinations.

Finally, what about the thickness of the lines? The Post encoded population (voter) size while the Times used electoral votes. This decision is partly driven by the choice of state versus county level data.

One can consider electoral votes as a kind of log transformation. The effect of electorizing the popular vote is to pull the extreme values to the center. This significantly simplifies the designer's life. To wit, in the Post chart (shown nbelow), they have to apply a filter to highlight key counties, and you notice that those lines are so thick that all the other countries become barely visible.



Bumps chart goes mainstream

It’s a happy day when one of my favorite chart types, the Bumps chart, makes it to the Wall Street Journal, and the front page no less! (Link to article)

This chart shows the ground shifting in global auto production in the next five years, with Mexico and India gaining in rank over Germany and South Korea.


The criss-crossing of lines is key to reading these charts. A crossing ("bump") necessarily means one entity has surpassed the other entity in absolute terms, even though we are looking at the relative rank.

Of course, there is no Swiss Army Knife of charts. This graphic provides no clue as to the share of world production. It's quite possible that the first few countries account for the majority of the world's producction, so that the rank shifts toward the bottom of the chart are relatively inconsequential. Wikipedia says that the top player (China) produces a quarter of the world's vehicles, and twice as many as the next biggest producer. Any country ranked below 4 accounts for less than 5 percent of global volume.


I made a few minor edits in this version below. Fro example, it's unclear why both 2014 and 2015 are depicted since there were no rank shifts and also the 2015 data is a projection. (I don't have any problem with the two red lines even though I didn't carry over the color scheme.)


A startling chart about income inequality, with interpretative difficulties

Reader Robbi B. submitted the following chart posted to Twitter by Branko Milanovic:


The chart took a little time to figure out. This isn't a bad chart. Robbi wondered if there are alternative ways to plot this information.

The U.S. population is divided into percentiles across the horizontal axis, presumably based on the income distribution in some year (I'm guessing 2007, the start of the recession). For each percentile of people, the real per capita growth (decline) in disposable income is computed for two periods: the blue line shows the decline during the recession (2007-2010) and the orange shows the growth (in some cases further decline) during the recovery (2010-2013).

This chart draws attention to the two tails of the distibution, namely, the bottom 10 percent, and the top 5 percent. At one level, these two groups (excepting the bottom 2%) experienced the best of the recovery. But then, they also suffered the worst declines during the recession.


Here is one possible view of the same data, in a format with which I have been experimenting recently. You might call this a Bumps panel or a slopegraph panel.


The slopes draw attention to the relative magnitude of the declines and the subsequent recoveries. (I thinned the middle 80% substantially because there isn't much going on in that part of the dataset.) If I have more time, I'd have chosen a different color instead of grayscale for those lines.

I ignored any questions I have about the underlying data. How is disposable income defined and measured? Does it carry the same meaning across the entire spectrum of income distribution? etc. (Milanovic points to the Survey of Consumer Fiannces as the source.)


One reason for the reading difficulty is the absence of a reference point. It's unclear how to judge the orange line. Two answers are suggestive (but problematic). One is the zero line: which segments of the population experienced a recovery and which didn't? Another is the mirror image of the blue line: how much of what one lost during the recession did one recover by 2013 (roughly speaking)?

Both of these easy interpretations worry me because they carry an assumption of equal guilt (blue line) and/or equal spoils (orange line). It is very possible that the unwarranted risk-taking or fraud was not evenly spread out amongst the percentiles, and if so, it is impossible to judge whether the distribution exhibited in the blue line was "fair". It is then also impossible to know if the distribution contained in the orange line was "fair". Indeed, if the orange line mirrored the blue line, then all segments recovered similarly what they lost--this would only make sense if all segments are equally culpable in the recession.