Quick example of layering

The New York Times uses layering to place the Alabama tornadoes in context. (link)

Today's wide availability of detailed data allows designers to create dense data graphics like this:


The graphic shows the starting and ending locations and trajectory of each tornado, as well as the wind speeds (shown in color).

Too much data slows down our understanding of the visual message. The remedy is to subtract. Here is a second graphic that focuses only on the strongest tornadoes (graded 4 or 5 on a 5-point scale):


Another goal of the data visualization is to place in context the tornado that hit Beauregard:


The area around Beauregard is not typically visited by strong tornadoes. Also, the tornadoes were strong but there have been stronger ones.


The designer unfolds the story in three stages. There are no knobs and sliders and arrows, and that's a beauty. It's usually not a good idea to make readers find the story themselves.

NYT hits the trifecta with this market correction chart

Yesterday, in the front page of the Business section, the New York Times published a pair of charts that perfectly captures the story of the ongoing turbulence in the stock market.

Here is the first chart:


Most market observers are very concerned about the S&P entering "correction" territory, which the industry arbitrarily defines as a drop of 10% or more from a peak. This corresponds to the shortest line on the above chart.

The chart promotes a longer-term reflection on the recent turbulence, using two reference points: the index has returned to the level even with that at the start of 2018, and about 16 percent higher since the beginning of 2017.

This is all done tastefully in a clear, understandable graphic.

Then, in a bit of a rhetorical flourish, the bottom of the page makes another point:


When viewed back to a 10-year period, this chart shows that the S&P has exploded by 300% since 2009.

A connection is made between the two charts via the color of the lines, plus the simple, effective annotation "Chart above".

The second chart adds even more context, through vertical bands indicating previous corrections (drops of at least 10%). These moments are connected to the first graphic via the beige color. The extra material conveys the message that the market has survived multiple corrections during this long bull period.

Together, the pair of charts addresses a pressing current issue, and presents a direct, insightful answer in a simple, effective visual design, so it hits the Trifecta!


There are a couple of interesting challenges related to connecting plots within a multiple-plot framework.

While the beige color connects the concept of "market correction" in the top and bottom charts, it can also be a source of confusion. The orientation and the visual interpretation of those bands differ. The first chart uses one horizontal band while the chart below shows multiple vertical bands. In the first chart, the horizontal band refers to a definition of correction while in the second chart, the vertical bands indicate experienced corrections.

Is there a solution in which the bands have the same orientation and same meaning?


These graphs solve a visual problem concerning the visualization of growth over time. Growth rates are anchored to some starting time. A ten-percent reduction means nothing unless you are told ten-percent of what.

Using different starting times as reference points, one gets different values of growth rates. With highly variable series of data like stock prices, picking starting times even a day apart can lead to vastly different growth rates.

The designer here picked several obvious reference times, and superimposes multiple lines on the same plotting canvass. Instead of having four lines on one chart, we have three lines on one, and four lines on the other. This limits the number of messages per chart, which speeds up cognition.

The first chart depicts this visual challenge well. Look at the start of 2018. This second line appears as if you can just reset the start point to 0, and drag the remaining portion of the line down. The part of the top line (to the right of Jan 2018) looks just like the second line that starts at Jan 2018.


However, a closer look reveals that the shape may be the same but the magnitude isn't. There is a subtle re-scaling in addition to the re-set to zero.

The same thing happens at the starting moment of the third line. You can't just drag the portion of the first or second line down - there is also a needed re-scaling.

Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.


This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.


Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.


The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).


What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.






Visualizing the Thai cave rescue operation

The Thai cave rescue was a great story with a happy ending. It's also one that lends itself to visualization. A good visualization can explain the rescue operation more efficiently than mere words.

A good visual should bring out the most salient features of the story, such as:

  • Why the operation was so daunting?
  • What were the tactics used to overcome those challenges?
  • How long did it take?
  • What were the specific local challenges that must be overcome?
  • Were there any surprises?

In terms of what made the rescue challenging, some of the following are pertinent:

  • How far in they were?
  • How deep were they trapped?
  • How much of the caves were flooded? Why couldn't they come out by themselves?
  • How much headroom was there in different sections of the cave "tunnel"?

There were many attempts at visualizing the Thai cave rescue operation. The best ones I saw were: BBC (here, here), The New York Times (here), South China Morning Post (here) and Straits Times (here). It turns out each of these efforts focuses on some of the aspects above, and you have to look at all of them to get the full picture.


BBC's coverage began with a top-down view of the route of the rescue, which seems to be the most popular view adopted by news organizations. This is easily understood because of the standard map aesthetic.


The BBC map is missing a smaller map of Thailand to place this in a geographical context.

While this map provides basic information, it doesn't address many of the elements that make the Thai cave rescue story compelling. In particular, human beings are missing from this visualization. The focus is on the actions ("diving", "standing"). This perspective also does not address the water level, the key underlying environmental factor.


Another popular perspective is the sideway cross-section. The Straits Times has one:

Straittimes_thai rescue_part

The excerpt of the infographic presents a nice collection of data that show the effort of the rescue. The sideway cross-sectional section shows the distance and the up-and-down nature of the journey, the level of flooding along the route, plus a bit about the headroom available at different points. Most of these diagrams bring out the "horizontal" distance but somehow ignore the "vertical" distance. One possibility is that the real trajectory is curvy - but if we can straighten out the horizontal, we should be able to straighten out the vertical too.

The NYT article gives a more detailed view of the same perspective, with annotations that describe key moments along the rescue route.


If, like me, you like to place humans into this picture, then you have to go back to the Straits Times, where they have an expanded version of the sideway cross-section.


This is probably my most favorite single visualization of the rescue operation.

There are better cartoons of the specific diving actions, though. For example, the BBC has this visual that shows the particularly narrow part of the route, corresponding to the circular inset in the Straits Times version above.


The drama!

NYT also has a set of cartoons. Here's one:



There is one perspective that curiously has been underserved in all of the visualizations - this is the first-person perspective. Imagine the rescuer (or the kids) navigating the rescue route. It's a cross-section from the front, not from the side.

Various publications try to address this by augmenting the top-down route view with sporadic cross-sectional diagrams. Recall the first map we showed from the BBC. On the right column are little annotations of this type (here):


I picked out this part of the map because it shows that the little human figure serves two potentially conflicting purposes. In the bottom diagram, the figurine shows that there is limited headroom in this part of the cave, plus the actual position of the figurine on the ledge conveys information about where the kids were. However, on the top cross-section, the location of the figure conveys no information; the only purpose of the human figure is to show how tall the cave is at that site.

The South China Morning Post (here - site appears to be down when I wrote this) has this wonderful animation of how the shape of the headroom changed as they navigated the route. Please visit their page to see the full animation. Here are two screenshots:



This little clip adds a lot to the story! It'd be even better if the horizontal timeline at the bottom is replaced by the top-down route map.

Thank you all the various dataviz teams for these great efforts.




Checking the scale on a chart

Dot maps, and by extension, bubble maps are popular options for spatial data; but the scale of these maps can be deceiving. Here is an example of a poorly-scaled dot map:

Farm-Dot Density

The U.S. was primarily an agrarian economy in 1997, if you believe your eyes.

Here is a poorly-scaled bubble map:

image from junkcharts.typepad.com

New Yorkers have all become Citibikers, if you believe what you see.

Last week, I saw a nice dot map embedded inside this New York Times Graphics feature on the destruction of the Syrian city of Raqqa.


Before I conclude that the destruction was broadly felt, I'd like to check the scale on the map to make sure it doesn't have the problem seen above. What is helpful here is the scale provided on the map itself.


That line segment representing a quarter mile fits about 15 dots side by side. Then, I found out that a Manhattan avenue (longer) block is roughly a quarter mile. That means the map places about 15 buildings to an avenue block. In my experience, that sounds about right: I'd imagine 15-20 buildings per block.

So I'm convinced that the designer chose an appropriate scale to display the data. It is actually true that the entire city of Raqqa was pretty much annihilated by U.S. bombs.

Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)


Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:



I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.


When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 




A look at how the New York Times readers look at the others


The above chart, when it was unveiled at the end of November last year, got some mileage on my Twitter feed so it got some attention. A reader, Eric N., didn't like it at all, and I think he has a point.

Here are several debatable design decisions.

The chart uses an inverted axis. A tax cut (negative growth) is shown on the right while a tax increase is shown on the left. This type of inversion has gotten others in trouble before, namely, the controversy over the gun deaths chart (link). The green/red color coding is used to signal the polarity although some will argue this is bad for color-blind readers. The annotation below the axis is probably the reason why I wasn't confused in the first place but the other charts further down the page do not repeat the annotation, and that's where the interpretation of -$2,000 as a tax increase is unnatural!

The chart does not aggregate the data. It plots 25,000 households with 25,000 points. Because of the variance of the data, it's hard to judge trends. It's easy enough to see that there are more green dots than red but how many more? 10 percent, 20 percent, 40 percent? It's also hard to answer any specific questions, say, about households with a certain range of incomes. There are various ways to aggregate the data, such as heatmaps, histograms, and so on.

For those used to looking at scientific charts, the x- and y-axes are reversed. By convention, we'd have put the income ranges on the horizontal axis and the tax changes (the "outcome" variable) on the vertical axis.


The text labels do not describe the data patterns on the chart so much as they offer additional information. To see this, remove the labels as I have done below. Try adding the labels based on what is shown on the chart.


Perhaps it's possible to illustrate those insights with a set of charts.


While reading this chart, I kept wondering how those 25,000 households were chosen. This is a sample of  households. The methodology is explained in a footnote, which describes the definition of "middle class" but unfortunately, they forgot to tell us how the 25,000 households were chosen from all such middle-class households.


The decision to omit the households with income below $40,000 needs more explanation as it usurps the household-size adjustment. Also, it's not clear that the impact of the tax bill on the households with incomes between $20-40K can be assumed the same as for those above $40K.

Are the 25,000 households is a simple random sample of all "middle class" households or are they chosen in some ways to represent the relative counts? It's also useful to know if they applied the $40K cutoff before or after selecting the 25,000 households. 

Ironically, the media kit of the Times discloses an affluent readership with median household income of almost $190K so it appears that the majority of readers are not represented in the graphic at all!


Storm story, a masterpiece

The visual story published by the New York Times on hurricane Irma is a masterpiece. See the presentation here.

The story starts with the standard presentation of the trajectories of past hurricane on a map:


Maps are great at conveying location and direction but much is lost in this rendering - wind speeds, time, strength, energy, to name but a few.

The Times then switches to other chart forms to convey some of the other data. A line chart is used to convey the strength of wind speeds as the storms shake through the Atlantic. Some kind of approximation is used to straighten the trajectories along an east-west orientation.


The key insight here is how strong Irma was pretty far out in the Atlantic. The lines in the background can be brought to live by clicking on them. This view omits some details - the passage of time is ignored, and location has been reduced to one dimension.

The display then switches again, and this time it shows time and wind speed.


This shows Irma's strength, sustaining Category 5 level windss for three days. This line chart ignores location completely.

Finally, a composite metric called cyclone energy is introduced.


This chart also ignores location. It does show Irma as a special storm. The storm that has reached the maximum energy by far is Ivan. Will Irma beat that standard? I am not so sure.

Each chart form has limitations. The use of multiple charts helps convey a story from multiple perspectives. A very nice example indeed.


Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)


This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.


One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.


The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:


The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.


In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.


A more subtle issue can be seen when comparing San Francisco and Boston regions:


One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.


Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 


If you haven't already, see this related post on my sister blog for a discussion of the data analysis.





Attractive, interactive graphic challenges lazy readers

The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.


The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.

This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.


One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.

As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.

(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)


Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers.