« February 2012 | Main | April 2012 »

Popularizing public data

Dona Wong, whose graphics book I reviewed two years ago (link), has recently joined the New York Fed to lead an effort to visualize data. This is exciting because consumers are unlikely to learn anything from Excel spreadsheets, HTML tables, etc. which are the typical formats of public data.

One of their efforts is visualization of mortgage delinquency data in the Tri-state and Long Island regions (link). This animation reminds me of the CDC obesity map, to which I gave a positive review in 2005 (link). This type of chart is great for revealing the evolution of a metric over time and over space. The sliding control is a very nice extra touch. This allows readers to freeze-frame the map and examine the details.

I sent Dona a few comments:

  • A speed control would be nice
  • Remove the word "quintiles" from the legend
  • Add some cumulative measures, such as "90-day delinquency or worse"


Let's start with the last recommendation. Any given delinquent mortgage will move between the different states of delinquencies over time, and so it is useful to look simultaneously at the evolution of all three levels of problems. In particular, a reduction in 90-day delinquencies may mean that people are starting to pay off their loans, or it may mean that some of these mortgages have become foreclosures.

In Long Island, for example, the proportion of loans 90-day delinquent appeared to have decreased from Jan 2010 to Dec 2011 but as the second set of charts showed, the reduction was probably because many of these loans turned into foreclosures.



Nyfed_legendThe second recommendation requires a bit of explanation, and peering behind the scenes. The legend is shown on the right (this is for Tri-State, foreclosures).

The easiest way to read the chart is to ignore the quintiles, and think of the colors as representing different ranges of delinquency rates within a county.

The quintile is a description of how the designer divides the counties into five equal-sized groups (corresponding to the five shades of colors). All the counties are sorted in increasing order of delinquency rate from the lowest to the highest. The bottom 20% of counties are classified as Quintile 1, the next 20% as Quintile 2, and so on up to Quintile 5 (the worst performing counties).

Each quintile represents a range of delinquency rates. For instance, in the above example for Tri-State foreclosures, Quintile 1 are counties with foreclosure rates between 0% and 1.3%. My point is that it is sufficient for readers to know the range of foreclosure rates associated with each color. Sometimes, introducing technical terms is more trouble than it's worth.


There is another reason why I would hide the quintile information if I were the designer of this chart. It's because there are different ways to define quintiles, and it takes too much time to explain your choice.

Nyfed_alllightNotice that the maps of 2007 are very light colored across most counties (see example on the right) and the later maps are much darker colored. This means that my description above is too simplistic: Quintile 1 is not really the best-performing 20% of counties - it contains the best-performing 20% of county-month-year combinations. The designer starts with a list which contains an entry for each county for each month of each year, instead of a list that contains one entry for each county.

Both ways of defining quintiles are legitimate. The resulting maps will emphasize different aspects of the data. The way Dona's team did this, the maps emphasize the general worsening of delinquency rates over this period of time. (This is why the delinquency rates rather than quintiles are more important to understanding the chart.)

Alternatively, one can choose to take each month as a separate dataset, and then divide the list of counties into quintiles. The maps would now look totally different because in this rendition, all five colors will feature in equal numbers in each freeze-frame. This view allows readers to know at any moment in time which are the best counties and which are the worst. However, it has the disadvantage that the range of delinquency rates defining each color would shift from month to month. In this version, the legend should be described in terms of quintiles rather than rates.

In the end, I think the way this chart is constructed makes sense. My little suggestion is to not mention the quintiles at all and let that work in the background.


When reading complex charts like this, you may not realize the number of decisions that have already been made. This is a great example of how such decisions affect the appearance of the final product.



Come see this panel on data visualization in NYC

The New York Public Library is organizing a panel on "What makes a good data visualization?". This is happening on April 4 in New York City. I'll be on the panel. There are panelists from many different disciplines so I expect a lively dialogue.

The link to register for this free event is here.

High-effort graphics

Jon Quinton made a chart for Cancer Research UK, which is quite an eyeful.


The full infographic is here.

Below is a close-up of the key of this chart:


Jc_returnoneffortWhere would this chart fall in my "return on effort matrix"? It is an extremely high-effort chart; I got tired trying to figure out what all those dimensions mean.

Is it a high-reward or a low-reward chart? It depends on why you're reading the chart. For most readers, I suspect it's low-reward.


In my view, the best charts are high-reward, low-effort. I'd emphasize that by effort, I mean effort by the reader. In general, the effort by the chart designer is inversely proportional to that by the reader.

In some special cases, high-effort charts may have high reward justifying the destruction of some brain cells.

Low-effort, low-reward charts are harmless.

More on the return-on-effort matrix here.


One simple improvement to a chart like this one is to produce separate charts for men and women. Outside academia, it seems to me almost all use cases for this chart would involve only one gender.


Guess which day I made this chart

ESPN Magazine issued a special analytics edition to ride the Moneyball bandwagon. In an article talking about the disappearing midrange jump shot from college basketball, they put out this chart:


In the caption of the chart, the key conclusion is: "As you can see, threes reign outside the lane." Well, we must be blind, since that conclusion is very difficult to draw from what we see. A number of reasons contibutes to this failure:

  • In a chart like this, the reader is cued to the length of the arcs. But the arcs related to three-pointers are all medium length -- they don't stand out, exactly the opposite of what the caption is saying
  • It's impossible to interpret the scale of the chart. Compare the blue line on the left (Missouri around-the-basket attempts) and the yellow line in the middle (Kentucky three-pointers attempts). They both say 357 but the lines are clearly of different lengths.
  • The analyst is attempting to make a general statement about "college hoops" while the data being presented are from six specific teams. This means that readers are spending time digesting the variability between schools rather than understanding the commonality across schools.

The problem of this type of "racetrack graph" has been discussed here before (see here or here). By using ellipses rather than circles, this chart makes things worse. Now, we can't even imagine where the center of the circle is to judge the angles.


The following line chart has a few revelations:


  • The six schools are not all the same in terms of their shot selection. In particular, California is the exception to the rule. Also, Missouri and to some extent Syracuse are extreme examples where their players try about the same numbers of three pointers as around-the-basket shots. In our Trifecta checkup (explanation), this means the data used on the chart is out of sync with the key question being addressed. No amount of graphical wizardry can fix this problem.
  • The new chart uses much more sensible units, attempts per game. The original chart shows total attempts for matches up to the day the chart was prepared. To make matters worse, the designer did not disclose anywhere what day that was, or how many games were included. By looking at season-end statistics (34 total games), it appears to me that the data being plotted are the total attempts in the first 22 games (up to the end of January). No reader can interpret total attempts in the first 22 games. I just divided each number by 22, and for anyone who follows basketball, this unit is much more interpretable.
  • What determined the order of the six schools being plotted? Your guess is as good as mine. In our version, I sorted the schools by the ratio of three-pointers to midrange jump shots. So Missouri and Syracuse came out top because they focus so heavily on three-pointers at the expense of midrange shots. At the other extreme, California uses both types of shots in about equal proportions.



A reader's guide to a New York Times graphic

The New York Times chose to present the poll results from Super Tuesday in the following chart (link):


It took me a bit of time to take in what this chart has to offer. To save your troubles, I've drawn up a reader's guide:


The graphic is a disguised scatter plot with one axis being Romney's share minus Santorum's share and the other axis being the total share of all other candidates. This is an "uneven canvass" in the sense that the data are much more likely to fall into a small part of the chart area (the orange shaded region).

If the reader just wants to know which segments of the electorate favors Romney v. Santorum, the chart is pretty effective at pointing to the answer. It is quite challenging to learn much else about the data.


Here are the results for Ohio, plotted as a stacked bar chart, with three segments in each bar (Romney's share, Santorum's share and the share of all other candidates).


 This more standard presentation conveys much more of the underlying information. The trade-off is that the reader has to try harder to figure out the answer for each segment of voters.


[PS: 3/13/2012]:

Thanks to several readers for your comments. I went back to look at the NYT graphic again, and can confirm that it is a ternary chart. The chart area is indeed an equilateral triangle with three equal sides.

What threw me off was the axis labels, particularly the Santorum and Romney labels which give the impression that there is a zero mid-point and some kind of share data along the  east-west axis. If this were true, then the chart could not be a ternary plot because Romney and Santorum shares are not mirror images.

In a ternary plot, we must identify Romney, Santorum, and "Other candidates" as the three vertices. The way this chart is labelled, it invites readers to drop a perpendicular line to the horizontal axis to read Santorum's share (e.g.). That doesn't work. Trying to fish the data out of a ternary plot is always challenging. You pick the vertex corresponding to the data series you want, say Romney's share. Then you take the side opposite that vertex. Now draw lines parallel to that side -- as you approach the Romney vertex, Romney's share goes from 0% to 100%. The following chart shows this:


For ternary plots, it's easier to go with the hand-waving principle that the closer you are to the vertex, the greater the weight of that vertex. So with the abortion data point, we see that it is much closer to the Santorum corner than the other two corners.

The vertical line for "other candidates" is also misleading. To read the share of votes that went to other candidates, one has to followS either the OR side or the SO side of the triangle. Basic geometry will show that going up the vertical line will not produce the share of "other candidates".


Lastly, here is a scatter plot representation of the data using the Romney-Santorum difference as the horizontal axis and the share of all others as the "other candidates":


The pattern of dots on this chart looks very similar to the ternary chart (that is one other reason why I thought the original graphic was a scatter plot.) However, the two plots are distinct entities. For the scatter plot, the horizontal axis goes from -100% to +100% while the vertical axis can only go from 0% to 100%.


Conceptual colors, negative proportions, mysterious axes, and all that

Reader Jordan G. found a different-looking chart on visualizing.org, of which I excerpted the following:


This part comes from the bottom right corner of an entire page of charts (link). The title of the entire project "Gaps in the U.S. Healthcare System" may give some hints as to what the designer was intending to portray. Looking at this part by itself, the reader is missing some information:

  • What do the pink, orange, dark pink colors mean?
  • What's plotted on the vertical scale that are in percentages?
  • The horizontal axis may have something to do with distance/location. It's divided into three sections. Is it a continuous scale (say, kilometres or miles) or is it categorical scale (large, medium noncore)?

The first question is answered by the legend of the post, situated on the far left. Simply by printing the labels for racial groups on this chart, the designer would have saved readers the effort to look for this information.

The second question is not addressed anywhere on the chart but most likely, the percentages represent proportions of adults over 50 years old who ever received the three types of -scopies. The mirrored nature of the vertical axis is odd. As much of the chart is above the zero-proportion line as exists below the line. What does negative proportions mean?

Because I couldn't figure out the answer to the third question, I can't interpret this chart at all. I see that the proportion of adults fluctuates from left to right for every racial group. But with what is the proportion varying? It also appears as if Asians only live in urban areas.


Jordan asks a question about color choice here that is worth discussing.

In this chart, the author chooses to use a color coding system to represent race (pink squares = African Americans).  I have seen other charts use “actual” colors to represent race (white = Caucasian, black = African American).  I could see some audiences taking offense at these different color representations, especially where the color chosen has been used pejoratively in the past (like “red” for American Indian, “yellow” for Asian).  What would you consider an effective and appropriate way to encode different races on visualizations?