A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.



When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":


The color scheme across all three maps shows a keen awareness of background/foreground concerns. 

Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)


Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:



I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.


When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 




Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.


This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:


The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.

Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)


This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.


One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.


The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:


The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.


In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.


A more subtle issue can be seen when comparing San Francisco and Boston regions:


One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.


Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 


If you haven't already, see this related post on my sister blog for a discussion of the data analysis.





Attractive, interactive graphic challenges lazy readers

The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.


The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.

This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.


One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.

As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.

(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)


Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers. 


Denver outspends everyone on this

Someone at the Wall Street Journal noticed that Denver's transit agency has outspent other top transit agencies, after accounting for number of rides -- and by a huge margin.

But the accompanying graphic conspires against the journalist.


For one thing, Denver is at the bottom of the page. Denver's two bars do not stand out in any way. New York's transit system dwarfs everyone else in both number of rides and total capital expenses and funding. And the division into local, state, and federal sources of funds is on the page, absorbing readers' mindspace for unknown reasons.

But Denver is an outlier, as can be seen here:



A multidimensional graphic that holds a number of surprises, via NYT

The New York Times has an eye-catching graphic illustrating the Amtrak crash last year near Philadelphia. The article is here.

The various images associated with this article vary in the amount of contextual details offered to readers.

This graphic provides an overview of the situation:


Initially, I had a fair amount of trouble deciphering this chart. I was searching hard to find the contrast between the orange (labeled RECENT TRAINS) and the red (labeled TRAIN # 188). The orange color forms a wavy area akin to a river on a map. The red line segments suggest bridges that span the river bank. The visual cues kept telling me train #188 is a typical train but that conclusion was obviously wrong.

The confusion went away after I read the next graphic:


This zoomed-in view offered some helpful annotation. The data came from three days of trains prior to the accident. Surprisingly, the orange band does not visualize a range of speeds. The width of the orange band fluctuates with the median speed over those three days. And then, the red line segments represent the speed of train #188 as it passed through specific points on the itinerary.

The key visual element to look for is the red lines exceeding the width of the orange band as train #188 rounds Frankford Junction.


In the second graphic, the speeding is more visible. But it can be made even more prominent. For example, instead of line segments, use the same curvy element to portray the speed of train #188. Then through line width or color, emphasize train #188 and push the average train to the background.


Notice that there is an additional line snaking through the middle of the orange band. The data have been centered around this line. This type of centering is problematic: the excess speed relative to the median train has been split into halves. The reader must mentally reassemble the halves. The impact of the speeding has therefore been artificially muted.

 In this next version, I keep that midpoint line and use it to indicate the median speed of the trains. Then, I show how train #188 diverged from the median speed as it neared the Junction.



 This version brings out one other confusing element of the original. This line that traces the median speed is also tracing the path of the train (geographically). Actually, the line does not encode speed--it just encodes the reference level of speed. The graphic above creates an impression that train #188 "ran off track" if the reader interprets the green line as a railroad track on a map. But it is off in speed, not in physical location.




When in Seattle, don't look for the bus map

The past week in Seattle, I was blessed with amazing weather. The city has great coffee and restaurants, so pleased me alright.

But Seattle-ites, please tell your government to burn your transit map presto!  I tried looking at the map three or four times, and each time, my eyes were burning so much from the colors, the details, the lack of labels, the general confusion that I gave up. Yes, that's the worst thing an information graphics designer wants to hear - the reader waves the white flag.


How do you make sense of that? In the excerpt below, I labeled with black boxes my desired origin and destination.


There are many obstacles to figuring out a route. Firstly, the precise locations of bus stops are not indicated on the map. From the black box up top, if I wanted to catch a bus, I wasn't even sure which corner to go to! Seattle, by the way, is full of one-way streets. Eventually, you realize that different lines have different operators, and they don't use a common ticket.

I ended up at the Westlake Station wanting to take public transit to the International District. I purchased a ticket from the machine. Then I boarded a bus seemingly heading in the right direction. The bus driver stared me down as if I just stepped into disputed territory. She told me my ticket was for a train. I asked her how I'd catch a train. Her eyes told me to get off quickly or else...

I too thought I bought a train ticket but it turns out the train and buses share the same platform.

Back to the map, it would appear that the green line labeled 40 would be useful to me. I tried to trace the green line but it started looping around and I gave up.


Milan EXPO: further thoughts

I promised to blog more about the Milan EXPO so this is it.

My first reactions were recorded here. (link)

This post is primarily intended for those who are planning a visit.



One of the smartest design decisions is to line everything up along one street (the Decumano). It will take some genius to get lost even though there are many dozens of buildings. Once you get to the far end of the Decumano, there is a smaller road that runs perpendicular to it, which houses the buildings that showcase individual regions of Italy. This smaller road leads to the Tree of Life structure, where I found those delightful, swirling chairs. Here they are again:




The EXPO site is in the Milan suburbs. It is easily accessible by the Metro (subway) or by train. Either means of transportation takes about 20 minutes. The train takes riders right to the entrance, saving 10 minutes of walking from the subway stop, but depending on your origin, the train may be inconvenient. I later discovered that there are two subway exits: one exit links to an overpass while the other one to an underpass. Choose carefully if under/over makes a difference for you.

You need to carry a printed copy of your ticket. Your bags will be scanned. Liquids are allowed and are also scanned. This process is painless unless you fight with the crowds that appear at 7 pm because of reduced-price entry. Most pavilions close by 9 pm, leaving only restaurants open.



The food is great if you bring realistic expectations. You’re at a fair, not a gourmet food market. I was very happy with what I ate, and here are some highlights.

Eataly is there in a big way. They have 10 or 12 restaurants, representing different regions in Italy. Eataly is this high-end supermarket / restaurant chain that started in Italy and also now have stores in New York, Boston and Chicago. Not spectacular but way better than your average meal. If you want Italian food, you won’t go wrong here. I particularly like the Tuscany (Toscana) menu, serving two of my favorites: panzarella (bread salad), and pici (an extra-thick spaghetti) with duck ragu. You have to walk all the way to the back of the Eataly row to find the Toscana section.



Inside the Pavilions. You can fill yourself by sampling snacks as you run around the pavilions. I recommend this strategy because your schedule will be dominated by trying to get into certain pavilions (or more pavilions). The food is going to be hit or miss. Austria (left) has great stuff. France looks good. Belgium serves pub grub and beer. Holland has food trucks, mostly fast food. I liked the summer rolls in the Vietnam pavilion (right).



Vietnam and Belgium

Russia was giving away caviar on toast, which attracted a mob. Heard Chile has good food. Mexico has a food line. If you like cannoli, go to the “Civil Society” building and visit the Sicilian vendor.


You can always go to McDonald’s for American fast food. There are also various places where you can get Italian fast food, such as simple pastas and pizzas.

Several pavilions have proper sit-down restaurants. I can’t vouch for them as I didn’t try them. The French pavilion for example has a restaurant upstairs. I think Russia also has a restaurant.

Gelato. When I am in Italy, I am eating gelato every day. Gelato is godsend on these hot summer days. There are many places to get gelato at the EXPO. My favorite is Pernigotti, which has a booth in the chocolate area. I also got gelato behind the Israel pavilion. There is a small stand outside the Italy Pavilion. Also across from the Italy Pavilion, the Love It food store serves gelato on the far side. Granita (slushed ice drinks) would have been even better but I didn’t find any worth mentioning here.


Espresso. The safe and great options include Lavazza and Illy. Lavazza is in the Italian regions street, which runs perpendicular to the Decumano. Lavazza has some great-looking tarts and cakes, in addition to coffee. Illy is in the coffee exhibition area.


Other Pavilions

I also enjoyed France (most on-subject), Morocco, Slow Food, and especially the chocolate area.

I didn’t make it to Japan, Kazakhstan, China and Italy. Those attracted excellent reviews but the lines were too long. Several countries (Japan, Kazakhstan, etc.) produce staged experiences, which means once you are inside, you have to spend at least 30-45 minutes.

Have fun!