Financial Times has this chart up about the voters for the National Front, which is Marie Le Pen's party.
I find the chart very hard to decipher, even though I usually like the dot plot format.
The first thing to figure out is not visual. It's a definition of the data. The average voter represents those who voted in the 2015 regional election. The National Front voters are those who intended to vote in 2015, and these are sub-divided into "loyal" and "new" voters. All it takes one to be "loyal" is to have voted for the National Front in 2012; all others are "new."
All of the above information you pick up primarily from the footnotes, combined with various parts of the title, and legend. Similarly, you also learn that FN is the acronym for National Front.
This following version is clearer:
The new version mostly just re-orients the original chart, turning it on its side. It's quite surprising how much better I feel about it. I think it's because the message is primarily about the relative ages, and in the original chart, aging is portrayed downwards, which is not natural.
A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)
This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).
This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.
One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.
Boy, those lines are puzzling.
Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.
The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.
The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:
The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.
Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.
In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.
In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.
A more subtle issue can be seen when comparing San Francisco and Boston regions:
One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.
But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.
If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.
Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable).
If you haven't already, see this related post on my sister blog for a discussion of the data analysis.
What made this infographic from South Carolina Ports is the choice of contextual comparisons. The simple animation also helps. (Original here if the animated gif isn't working.) The random colors mean nothing but they did make me look at the graphic in the first place.
There is a brewing controversy over ads shown on video websites. Because of the automation, and generally opacity of the online advertising market, advertisers sometimes find their ads next to undesirable content, such as extremist videos.
This chart analyzes the situation, but it is also an extremist assault on our eyes:
Via Twitter, @Stoltzmaniac sent me this chart, from the Economist (link to article):
There is simply too much going on on the right side of the chart. The designer seems not to be able to decide which metric is more important, the cumulative growth rate of vehicles in use from 2005 to 2014, or the vehicles per 1,000 people in 2014. So both set of numbers are placed on the chart, regrettably in close proximity.
In the meantime, the other components of the chart, such as the gridlines and the red line indicating 2005 = 100 are only relevant to the cumulative vehicle growth metric. Perhaps noticing the imbalance, the designer then paints the other data series in rainbow-colored boxes, and prints the label for this data series in a big white box. This decision tilts the chart towards the vehicle per capita metric, as our eyes now cannot help but stare at the white box.
There are really three trends: the growth in population, the growth in vehicles, and the resultant growth in vehicle per capita. They are all be accommodated in a small-multiples setting, as follows:
There are some curious angular trends revealed here. The German population somehow dipped into negative territory around 2007-8 but since then has turned around. Nigeria's vehicle growth declined sharply after 2006 so that the density of vehicles has stabilized.
The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.
The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.
This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.
One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.
As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.
(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)
Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers.
Michael Bales and his associates at Cornell are working on a new visual tool for citations data. This is an area that is ripe for some innovation. There is a lot of data available but it seems difficult to gain insights from them. The prototypical question is how authoritative is a particular researcher or research group, judging from his or her or their publications.
A proxy for "quality" is the number of times the paper is cited by others. More sophisticated metrics take into account the quality of the researchers who cite one's work. There are various summary statistics e.g. h-index that attempts to capture the data distribution but reducing to a single number may remove too much context.
Contextual information is very important for interpretation: certain disciplines might enjoy higher average numbers of citations because researchers tend to list more references, or that papers typically have large numbers of co-authors; individual researchers may have a few influential papers, or a lot of rarely-cited papers or anything in between.
A good tool should be able to address a number of such problems.
Michael was a former student who attended the Data Visualization workshop at NYU (syllabus here), and the class spent some time discussing his citations impact tool. He contacted me to let me know that what we did during the workshop has now reached the research conferences.
Here is a wireframe of the visual form we developed:
This particular chart shows the evolution in citations data over three time periods for a specific sub-field of study. The vertical scale is a percentile ranking based on some standard used in the citations industry. We grouped the data into deciles (and within each deciles, into thirds) to facilitate understanding. The median rank is highlighted - we can see that in this sub-field, the publications have both increased in quantity but also in quality with the median rank showing improvement over the three periods of time. Because "review articles" are interpreted differently by some, those are highlighted in purple.
One of the key strengths of this design is the filter mechanism shown on the right. The citations researcher can customize comparisons. This is really important because the citations data are meaningless by themselves; they only attain meaning when compared to peer groups.
Here is an even rougher sketch of the design:
For a single researcher, this view will list all of his or her papers, ordered by each paper's percentile rank, with review papers given a purple color.
The entire VIVO dashboard project by Weill Cornell Medicine has a github page, but the citation impact tool does not seem to be there at the moment. Michael tells me the citation impact tool is found here.