A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)
This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).
This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.
One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.
Boy, those lines are puzzling.
Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.
The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.
The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:
The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.
Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.
In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.
In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.
A more subtle issue can be seen when comparing San Francisco and Boston regions:
One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.
But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.
If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.
Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable).
If you haven't already, see this related post on my sister blog for a discussion of the data analysis.