This is a continuation of my previous post on the map of the age of Brooklyn's buildings, in which I suggested that aggregating the data would bring out the geographical patterns better.
For its map illustrating the pattern of insurance coverage in several large cities across America, the New York Times team produced two versions, one using dots to plot the raw data (at the finest level, each dot represents 40 residents) and another showing aggregate data to the level of Census tracts.
We can therefore compare the two views side-by-side.
The structure of this data is similar to that of the Brooklyn map. Where Rhiel has age of buildings as the third dimension, the NYT has the insurance status of people living in Census tracts. (Given that the Census does not disclose individual responses, we know that the data is really tract-level. The "persons" being depicted can be thought of as simulated.) The NYT data poses a greater challange because it is categorical. Each "person" has one of four statuses: "uninsured", "public insurance", "private insurance" and "both public and private insurance". The last category is primarily due to aggregation to the tract level. By contrast, the Brooklyn data is "continuous" (ordinal, to be specific) in the year of construction.
The aggregated chart at the bottom speaks to me much more loudly. What it gives up in granularity, both at the geographical level and at the metric level, it gains in clarity and readability. The dots on the top chart end up conveying mostly information about population density across Census tracts, which distracts readers from taking in the spatial pattern of the uninsured. The chart in the bottom aggregates the data to the level of a tract. Also, instead of showing all four levels of insuredness, the chart in the bottom concentrates its energy on showing the proportion of uninsured.
In short, the chart that uses fewer elements (areas rather than dots), fewer colors, fewer individual data points ends up answering the question of "mapping uninsured Americans" more effectively. (It is a common misunderstanding that aggregation throws away data -- in fact, aggregation consumes the data.)
When designers choose to plot raw data, they often find a need to compensate for its weakness of losing the signal in the noise. One of the strategies is to produce a hover-over effect that shows aggregated statistics, like this:
Notice the connection between this and my previous comment. What the aggregated map displays are two elements of the hover-over: the boundary of the Census tract, and the first statistic (the proportion of uninsured).
In addition to the hassle of having to hover over different tracts asynchronously, the reader also loses the ability to interpret the statistics. For example, is the proportion of uninsured (21.4%) a good or bad number? The reader can't tell unless he or she has an understanding of the full range of possibilities. In the other chart, this task has been performed by the designer when constructing the legend:
This trade-off between relative and absolute metrics is one of the key decisions designers have to make all the time. Relative metrics also have problems. For instance, on the bottom chart, the reader loses the understanding of the relative population density between different Census tracts.
A similar design problem faced by Rhiel in the Brooklyn chart is whether to use the year of construction (e.g. 2003) as the metric or the age of buildings (10 years old). Rhiel chose the former while some other designer would have selected the latter.
Again, thanks for reading, and see you next year!