Jeff W made some astute comments on the New York Times Netflix visualization, which I praised in the last post. He pointed out that there is so much more to the underlying data than what can be shown within the confines of these maps. For example, he wanted to know the relationship between Metacritic scores and Netflix ranks (or rentals), explore the heavy-tailed distribution of titles, expose regional differences, etc.

What he is hitting on is the shortcoming of the current approach to infographics... an approach which is about putting order to messy data, rather than summarizing, extracting and generalizing. And it is also the difference between "data graphics" and "statistical graphics".

This is related to the modelers versus non-modelers dichotomy Andrew Gelman just discussed in this blog post. (He cites Hal Stern as the source of the quote.)

Basically, non-modelers have the same philosophy as infographics designers - they want to make as few assumptions as possible, to rely exclusively on the data set. By contrast, modelers want to reduce the data, their instinct is to generalize. The stuff that Jeff wanted all require statistical modeling. As I mentioned before (say, here), I believe infographics has to eventually move in this direction to be successful.

Take the correlation betwen Metacritic score and Netflix ranking... the designers actually thought about this and they tried to surface the correlation, in a way that is strait-jacketed by the infographics aesthetics. What they did was to allow the movies to be sorted by Netflix ranking, or by Metacritic score, using the controls on the top right. And when the Netflix ranking is chosen for sorting, the Metacritic score is printed next to the map, so as the reader scrolls along, he or she can mentally evaluate the correlation. Of course, this is very inefficient and error-prone but we should give the designers props for trying.

Building a model for this data is no simple matter either because multiple factors are at play to determine the Netflix ranking. A good model is one that can somewhat accurately predict the Netflix ranking (color) based on various factors included in the model, such as the type of movie, the cost of movie, the number of screens it's played, any affinity of a movie to a locale (witness "New in Town"), regions (at different levels of specificity), recency of the movie, whether it's been released on multiple format, etc. etc.

Jeff's other point about ranking vs number of rentals raises another interesting statistical issue. I suspect that it is precisely because the number of rentals is highly skewed with a long tail that the analyst chose to use rank orders. If an untransformed number of rentals is used, the top few blockbuster films will dominate pretty much every map.

Keep the comments coming!

## Recent Comments