« March 2017 | Main | May 2017 »

Reorientation in the French election

Financial Times has this chart up about the voters for the National Front, which is Marie Le Pen's party.


I find the chart very hard to decipher, even though I usually like the dot plot format.

The first thing to figure out is not visual. It's a definition of the data. The average voter represents those who voted in the 2015 regional election. The National Front voters are those who intended to vote in 2015, and these are sub-divided into "loyal" and "new" voters. All it takes one to be "loyal" is to have voted for the National Front in 2012; all others are "new."

All of the above information you pick up primarily from the footnotes, combined with various parts of the title, and legend. Similarly, you also learn that FN is the acronym for National Front.


 This following version is clearer:


The new version mostly just re-orients the original chart, turning it on its side. It's quite surprising how much better I feel about it. I think it's because the message is primarily about the relative ages, and in the original chart, aging is portrayed downwards, which is not natural.

Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)


This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.


One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.


The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:


The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.


In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.


A more subtle issue can be seen when comparing San Francisco and Boston regions:


One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.


Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 


If you haven't already, see this related post on my sister blog for a discussion of the data analysis.





Light entertainment: is it safe for our eyes?

There is a brewing controversy over ads shown on video websites. Because of the automation, and generally opacity of the online advertising market, advertisers sometimes find their ads next to undesirable content, such as extremist videos.

This chart analyzes the situation, but it is also an extremist assault on our eyes:


Brought to you by Business Insider (link).

Confuse, confuses, confused, confusing

Via Twitter, @Stoltzmaniac sent me this chart, from the Economist (link to article):


There is simply too much going on on the right side of the chart. The designer seems not to be able to decide which metric is more important, the cumulative growth rate of vehicles in use from 2005 to 2014, or the vehicles per 1,000 people in 2014. So both set of numbers are placed on the chart, regrettably in close proximity.

In the meantime, the other components of the chart, such as the gridlines and the red line indicating 2005 = 100 are only relevant to the cumulative vehicle growth metric. Perhaps noticing the imbalance, the designer then paints the other data series in rainbow-colored boxes, and prints the label for this data series in a big white box. This decision tilts the chart towards the vehicle per capita metric, as our eyes now cannot help but stare at the white box.


There are really three trends: the growth in population, the growth in vehicles, and the resultant growth in vehicle per capita. They are all be accommodated in a small-multiples setting, as follows:


There are some curious angular trends revealed here. The German population somehow dipped into negative territory around 2007-8 but since then has turned around. Nigeria's vehicle growth declined sharply after 2006 so that the density of vehicles has stabilized.


Attractive, interactive graphic challenges lazy readers

The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.


The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.

This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.


One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.

As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.

(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)


Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers.