How effective visualization brings data alive
May 08, 2014
Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:
These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)
The entire set of maps can be found here.
What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!
Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:
And this is the "caramel" question:
The set of maps referred to in the 2009 post can be found here.
Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.
Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.
The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.
Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.
Don't the white areas take care of the case where one choice does not dominate utterly? It's what I'd do.
Posted by: derek | May 08, 2014 at 01:30 PM
Derek: those are two different things. Think of the original data as a vector at every location. What Katz did is to pick out the maximum element of each vector and then take the average of these max-elements spatially. White areas on these maps would indicate that there is large dispersion in the max-elements in those spatial locations.
Posted by: Kaiser | May 08, 2014 at 04:25 PM
A possibility is to use an additional category of "No clear choice" for when the most common is not say 20% greater than the next. This is just one of several questions that are still there. There are probably other ways of dealing with the paucity of data in some areas and with dealing with the boundaries. Many of them are problems which are shared with mapping problems in geospatial mapping and ecology. I went to a talk a few years ago on geospatial and it was surprising how only recently they had developed models for a lot of their mapping, and still had much to do.
Posted by: Ken | May 09, 2014 at 04:33 AM
This is kernel density estimation, right?
Posted by: omegatron | May 09, 2014 at 08:10 AM