« November 2013 | Main | January 2014 »

Two good maps, considered part 2

This is a continuation of my previous post on the map of the age of Brooklyn's buildings, in which I suggested that aggregating the data would bring out the geographical patterns better.

For its map illustrating the pattern of insurance coverage in several large cities across America, the New York Times team produced two versions, one using dots to plot the raw data (at the finest level, each dot represents 40 residents) and another showing aggregate data to the level of Census tracts.

We can therefore compare the two views side-by-side.


The structure of this data is similar to that of the Brooklyn map. Where Rhiel has age of buildings as the third dimension, the NYT has the insurance status of people living in Census tracts. (Given that the Census does not disclose individual responses, we know that the data is really tract-level. The "persons" being depicted can be thought of as simulated.) The NYT data poses a greater challange because it is categorical. Each "person" has one of four statuses: "uninsured", "public insurance", "private insurance" and "both public and private insurance". The last category is primarily due to aggregation to the tract level. By contrast, the Brooklyn data is "continuous" (ordinal, to be specific) in the year of construction.

 The aggregated chart at the bottom speaks to me much more loudly. What it gives up in granularity, both at the geographical level and at the metric level, it gains in clarity and readability. The dots on the top chart end up conveying mostly information about population density across Census tracts, which distracts readers from taking in the spatial pattern of the uninsured. The chart in the bottom aggregates the data to the level of a tract. Also, instead of showing all four levels of insuredness, the chart in the bottom concentrates its energy on showing the proportion of uninsured.

In short, the chart that uses fewer elements (areas rather than dots), fewer colors, fewer individual data points ends up answering the question of "mapping uninsured Americans" more effectively. (It is a common misunderstanding that aggregation throws away data -- in fact, aggregation consumes the data.)


When designers choose to plot raw data, they often find a need to compensate for its weakness of losing the signal in the noise. One of the strategies is to produce a hover-over effect that shows aggregated statistics, like this:


Notice the connection between this and my previous comment. What the aggregated map displays are two elements of the hover-over: the boundary of the Census tract, and the first statistic (the proportion of uninsured).

In addition to the hassle of having to hover over different tracts asynchronously, the reader also loses the ability to interpret the statistics. For example, is the proportion of uninsured (21.4%) a good or bad number? The reader can't tell unless he or she has an understanding of the full range of possibilities. In the other chart, this task has been performed by the designer when constructing the legend:


 This trade-off between relative and absolute metrics is one of the key decisions designers have to make all the time. Relative metrics also have problems. For instance, on the bottom chart, the reader loses the understanding of the relative population density between different Census tracts.

A similar design problem faced by Rhiel in the Brooklyn chart is whether to use the year of construction (e.g. 2003) as the metric or the age of buildings (10 years old). Rhiel chose the former while some other designer would have selected the latter.


Again, thanks for reading, and see you next year!


Two good maps, considered

A Relection on the past year:

Thanks to you for continuing to make this blog a success. Writing it has given me much enjoyment over the years, and I have learned much from your comments as well as from the visualization projects of many colleagues. 2013 also saw the publication of my new book Numbersense: How to Use Big Data to Your Advantage (link). I thank those of you who have purchased the book, and supported my writing. For those who haven't, please check it out. I have also been speaking at various events, mostly about interpreting data analyses published in the mass media, and building effective data analytics teams. In addition, I am heavily involved in the new Certificate in Analytics and Data Visualization at New York University (link). While the frequency of posting has suffered a little due to my other projects, I hope you found the contents as engaging, fun, and constructive as before.

Looking forward to 2014, I have as usual a basket of projects. Besides the two blogs, I will be expanding my teaching at NYU, including a visualization workshop that I'll be writing about here soon; taking on consulting projects; evangelizing better communications of data and analytics; and prospecting several book projects. I continue to spend most of the week at Vimeo, where my team analyzes data.

This will be my last post in 2013. It is an extra-long post to tie you over to the New Year. Happy New Year!



A short while ago, I was in correspondence with Thomas Rhiel who created a lovely map depicting the age of buildings in Brooklyn (link). In this case, it's the data that intrigues my interest. I haven't seen this type of data visualized before. The map type is exquisitely aligned to the data: buildings are geographically located and the age is a third, non-geographical dimension which is encoded in the colors. Red-orange is the most recent while green-blue is the oldest.




The data is at the level of individual buildings. If you hover over a building, you find the raw data including the address and the year of construction. The details seem to show that even the shape of each building is depicted. This really impressed me since a lot of manual labor must have been applied (according to Rhiel, there is a source for this type of data). Here is the map at its most magnified:


I came across this starry patch near the Manhattan Bridge, in which the buildings show up as red asterisks. (Rhiel said the shape came from the data. I am not sure I believe the data. Anyone lives near Sands Street?)



The map is useful if you are interested in questions such as "where are the new developments" (look for the deep red buildings) or "what's the average age of the buildings in a specific block" or "what's the age distribution of the buildings in a set of blocks". At the magnified level shown above, the street names are available to help readers orient themselves. The light gray color keeps the roads and the names safely in the background.

Now, zoomed to the other extreme, we get the image of the whole of Brooklyn:



I have a couple of suggestions for Rhiel. As someone who is not familiar with the geography of Brooklyn, this view presumes knowledge that I don't have. Unlike the magnified view, there are no text labels to help us decipher the different sections of Brooklyn. It would be nice if there is a background map to indicate the better-known areas like Williamsburg or Brooklyn Heights or Red Hook, etc.

The other concern is the apparent lack of pattern shown here. At this level, an appropriate question is which sections of Brooklyn are being redeveloped and which sections have older buildings. I see sprinkles of colors everywhere, giving the impression that everything is average. I suggested to Rhiel that aggregating the data would help bring out the pattern.

In data visualization, there is an obsession of plotting the "raw data" at its most granular level. Sometimes, this strategy backfires. It's the classic signal versus noise problem. Aggregation is a noise removal procedure. If for example, Rhiel gives up the data for individual buildings, including those beloved building shapes, and looks at the average age of buildings within each block, or even Census tracts, I suspect that the resulting map would be more informative.

It turns out that the Graphics team at the New York Times just published an interactive map that illustrates exactly what I suggested to Rhiel. Since this post is getting long, please go to the next post to continue reading.


The need to think about what you're seeing: an incomplete geography lesson

If your chart is titled "The Most Popular TV Show Set in Every State," what would you expect the data to look like?

You'd think the list would be dominated by the hit shows like The Walking Dead and Downton Abbey, and you might guess that there are probably only four or five unique shows on the list.

But then it's easy to miss the word "set" in the title. They are looking for most popular show given that it is set in a particular state. Now this is a completely different question -- and conversely, it guarantees that there will be 50 different shows for the 50 states, assuming that one show can't be set in multiple states. This is also, computationally, a much more complex question. Some locations, like New York, Mass. (Boston), and Illinois (Chicago), are many times more likely to be the settings of TV shows than other states. This means, one might need to go back many years to find the "popular" shows in the less attention-grabbing states.

I used quotations for the word "popular" because if one has to dig deep into history for a specific state, then it is possible that the selected show would not be popular in the aggregate! This is not unlike the issue of whether having your kids pick up a popular sport (like basketball) or instrument (like violin) is better or worse than an unpopular one (like squash or trombone). The latter route is potentially the shorter to stand out but their achievement will be known only to the niche audience.


This brings me to how one should look at a map like this one in Business Insider (link):


 The first thing that strikes you are the colors. The colors that signify nothing. Since each state has its own TV show, by definition each piece of information is unique. As far as I can tell, the choice of which states share the same color is totally up to the designer.

As I have remarked in the past, too often the designer uses the map as a lesson in geography. The only information presented to readers through the map type is where each state in located in the union. Without the state names, even this lesson is incomplete. We learn nothing about the relative popularity of these shows, the longevity, the years in which they went on air, etc.

Geographical data should not automatically be placed on a map.


Is there any "data" in this map? It depends on how you see it. Here's what the author described went into pairing each state with a TV show:

To qualify, we looked at television series as opposed to reality shows.* Selections were based on each show’s longevity, audience and critical acclaim using info from IMDB/Metacritic, awards, and lasting impact on American culture and television... *When there wasn't a famous enough series to choose from, we selected a more popular reality show. That happens once on this list (IA).


One is enough

If you blink, you might think the following graphic came from USA Today.


It turns out the Wall Street Journal has adopted USA-Today-style graphics. The chart about missions to the moon (together with two other similar ones) showed up this weekend.

There are only six data points, and the story is much simpler than the graphic implies.


The original chart uses three separate motifs to encode the six data points: the column height, the data label, and the axis. Usually only one is needed.

Digital music business needs numbersense

Joran E. sends us to the following chart via Twitter.


Link to the original here.


The top chart fails our self-sufficiency test. There are only eight numbers in the data. All eight numbers are printed onto the chart. If they were removed, the chart is neutered.

The triangle elements are distracting and pointless. The data is encoded in the two ends of those black lines. The two data series have very different scales so that when plotted on the same canvas, the information on digital albums (in red) becomes almost imperceptible.

The tiny font size strains our eyes.


But the bigger problem with the chart is the absence of numbersense.

Start with summing the number of digital albums and the number of tracks. While both units are literally units, they are different units.

A host of statistical adjustments is called for. Revenues would be more telling than unit sales since the average price paid is probably not constant across time. Price is typically inverse to quantity. Singles are cheaper than albums so comparing the units of tracks and the units of albums makes little sense.


The chart on the bottom is a nice idea but again can use some adjusting. As far as we know, the 47 weeks of sales data have not been seasonally adjusted. While the second half of 2013 looked worse than the first half, this insight is remarkable only if this pattern was not likely based on history. Adding lines from previous years would help put things into perspective.

Besides, if sales of other consumption goods fell by 20 percent while sales of digital music dropped by 10%, then by comparison, the digital music industry has fared well. In order to understand the song download data series, we also need to consult the trend for other related goods.

Lastly, by using an area chart, the designer is cornered into starting the vertical axis at zero. If a line chart was used instead, there would be no need to start the axis at zero, and consequently, the drop in weekly sales would appear more pronounced.

The exception to the rule against dual axes

Dual axes are almost always a bad idea. But there is one situation under which I'd use it.


Last week, Alberto Cairo (link) engaged in a Twitter/blogging debate about a chart that first appeared in Reuters concerning the state of the woman CEO in the Fortune 500 companies. Here is the chart under discussion:


This chart already is cleaner and more useful than the original original, which came from a research report from Catalyst (link):


Jonathan Keller re-made the Reuters chart as follows:



Cairo Jorge Camões contributed this version:


The Voila blog (link) has yet another take:


Then Chris Moore, responding to Cairo, created this view and also left some insightful comments:



What's at stake here? There are really three related topics of discussion.

First, there is the matter of the upper limit of the vertical axis. Three solutions were suggested: 100 percent, 50 percent, and 4 percent. (Cairo at one point suggested 25 percent, which can be wrapped into the 50 percent bucket.) In reality, this is an argument over which of two key messages should be emphasized. The first message is that women still comprises a pathetically small proportion of Fortune 500 CEOs. The second message is more hopeful, that the growth in this proportion has been quite rapid since 1995.

All versions of the chart actually display both messages. In the Reuters chart (as well as Moore and Cairo), the message about the absolute proportion of women is given as an annotation while the Keller and Voila versions extend the vertical axis, thus encoding this message directly to the chart. Conversely, the Keller and Voila versions deemphasize the growth in proportions, and so I'd have preferred to see a note about that growth when using their versions.

Voila selectes a 50% upper limit because the 50/50 split has an intuitive meaning in the context of gender balance. Because the resulting chart is so visually arresting, and so biased to one of the two key messages, I'd only consider it if the point of the display is to draw attention to the female deficit.


The second disagreement is in using absolute counts versus relative proportions. Moore chose absolute counts. I am in this camp as well. This is primarily because we are talking about Fortune 500 and the 500 number is an idee fixe. In Moore's version, I find the data labels distracting since all the numbers are small and insignificant.

Finally, the linkage between the absolute and the relative numbers also produces multiple solutions. Cairo's post pinpoints this issue. His solution is to include an inset pie chart with an arrow to explicitly link the two views. Moore likes the inset idea, but experimented with a donut chart or a partition in place of the pie chart. He also removes the explicit guiding arrow.


It turns out this dataset is perfectly made for the dual axes. The absolute counts and relative proportions are in one to one correspondence because it's really only one data series expressed twice. This happy situation leads to one line that can be cross-referenced on two axes, one side showing counts and the other side showing proportions. This is shown in my version below (the orange line).


In addition to having two axes, I have plotted two related data series. The second series (in red) shows the incremental change in the number of women CEOs from the previous year (also shown in both counts and proportions).

The first series (the same one everyone plotted) draws attention to the first message, that the growth rate of women CEOs is quite strong since 1995. The second series is a bit of a downer on that message, suggesting that from the absolute count perspective, the progress (only one or two additions per year) has been painfully slow, and not that impressive.

Thanks again to Alberto for making me aware of this discussion. This has been fun!


PS. I have left out the other chart and may return to it in a future post.

The importance of a proper scale

Business Insider (link) highlighted a map showing childhood food insecurity across the 50 states, with the data coming from a report by Brookings.


This is a nice map. I like the tones of the chosen colors although the colors are not intuitively matched to magnitude. (There is a small labeling issue in the New England section.) The message is very clear.

I wondered about the scale, in particular, the use of equal sized buckets to split the scale. As a designer, several key decisions here include the number of buckets, and the size of each bucket. The following chart shows the choice made by this designer:


In this chart, all the states are ranked by their food insecurity rates with the lowest on the left and the highest on the right. The three horizontal lines show where the current cutoff values are. They form two equal sized blocks because of the equal spacing chosen by the designer. There are a total of four buckets.

Now if you ignore the dashed lines, and focus on the solid line showing the increasing food insecurity rates, you'd notice that maybe there are only three buckets, not four. The following amended chart shows where I'd put the cutoff values resulting in three buckets. (18% and 23%).


With the new cutoff values, let's look at what the map looks like:


I'm pretty happy with this. It shows an even clearer picture. There are three clusters of states, most of the south and west suffer more than the north and east. The odd state here and there (e.g. Louisiana) turned out not to be so special.

But this version picks out the "outliers", the group that has the best food insecurity rates than the rest of the country (as shown on the left side of the line charts). These particularly well-performing states are North Dakota and Minnesota, New Hampshire and Mass. and Viriginia.

A small shift in the scaling cleans up the message!



Here is the same map with a progressive color scheme:




The graphical version of "to be seen"

In New York, there are many restaurants that serve mediocre food but which people go in to order to be seen. Here is the graphical equivalent, courtesy of Scientific American (link):



This is an attractive chart, but from which one should not expect to learn much.

The labels are well placed and unintrusive. The colors are not too sharp.

The size of the font draws our attention to the percentages -- the proportion of patents granted to China that falls into the specified categories. These percentages pertain to the single stacked column chart.

Looking right to left, the reader notices that the stacked column chart is an extension of the rightmost edge of the "ink blot". The ink blot is a variant of the stacked area chart. The massive growth between 1985 and 2010 looks mighty impressive. But the reader must navigate the transition from relative numbers to absolute numbers because the ink blot chart uses the number of patents, not the relative proportion.

In fact, the switch to absolute numbers leaves a void. The reader needs to know the relative proportion from decades past in order to interpret that single column representing just the year 2010. As the chart stands, has there been a change in distribution over time? Your bet is as good as mine.

I have previously explained why the ink blot chart is a silly invention. The central axis is arbitrary and meaningless. It's challenging to judge the growth from one year to another year because the growth is split in half and moving in different directions. The reader is asked to measure the vertical height at two points in time, and mentally shift the two line segments onto an even plane.

The other obstacle to understanding the rate of growth is the choice of scale. The exponential growth in recent years causes the earlier years to look completely flat.


 Furthermore, the taxonomy of patents is hard to grasp. There are two dimensions: purely Chinese invention versus co-invention; and assignment to {chinese indigenous firms only, or multinational firms only, or either, or other types of organizations}.

Without reading the article itself, it's hard to understand what the point of this taxonomy is. It's hard to learn anything from looking at this chart.

But it's nice to look at. That's for sure.

Beyond the obvious

Flowing Data has been doing some fine work on the baby names data. The names voyager is a successful project by Martin Wattenberg that has received praise from many corners. It's one of these projects that have taken on a commercial life as you can see from the link.

Here is a typical area chart presentation of the baby names data:


The typical insight one takes from this chart is that the name "Michael" (as a boy's name) reached a peak in the 1970s and have not been as popular lately. The data is organized as a series of trend lines, for each name and each gender.

Speaking of area charts, I have never understood their appeal. If I were to click on Michael in the above chart, the design responds by restricting itself to all names starting with "Michael", meaning it includes Michael given to a girl, and Michaela, for example. See below.


What is curious is that the peak has a red lining. At first thought, one expects to find hiding behind the blue Michael a girl's name that is almost as popular. But this is a stacked area chart so in fact, the girl's name (Michael given to a girl, if you mouse over it) is much less popular than the boy Michael (20,000 to 500 roughly).


Nathan decides to dig a layer deeper. Is there more information beyond the popularity of baby names over time?

In this post, Nathan zones in on the subset of names that are "unisex," that is to say, have been used to name both boys and girls. He selects the top 35 names based on a mean-square-error criterion and exposes the gender bias for each name. The metric being plotted is no longer pure popularity but gender popularity. The larger the red area, the greater the proportion of girls being given that name.

You can readily see some interesting trends. Kim (#34) has become almost predominantly female since the 1960s. On the other hand, Robbie (#18) used to be predominantly female but is now mostly a boy's name.


 One useful tip when performing this analysis is to pay attention to the popularity of each name (the original metric) even though you've decided to switch to the new metric of gender bias. This is because the relative proportions are unstable and difficult to interpret for less popular names. For example, the Name Voyager shows no values for Gale (#29) after the 1970s, which probably explains the massive gyrations in the 1990s and beyond.