« December 2014 | Main | February 2015 »

The snow made me do it - California, here I come

Sunnysandiego_aforestfrolicCalifornia readers: here's a chance to come meet me. I am giving talks in San Diego (Feb 3) and San Mateo (Feb 5) next week, courtesy of JMP. Free registration is here

These talks are related to two ongoing projects of mine: the first project is to create a theory of data visualization criticism. How can we use precise language to describe our reactions - good and bad - to data visualization work? The second project is surrounding how to find stories from a mass of data.


I'd love to meet some of you on the West Coast who are fans of the blog. Please also forward this announcement to your friends or colleagues who might be interested.

Three short lessons on comparisons

I like this New York Times graphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.

There are three sections to the graphic, and each displays a different form of comparisons

The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.


One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.


The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.


The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).

The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!


The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.


Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.



Boxes or lines: showing the trend in US adoptions

Time used a pair of area charts (a form of treemap) to illustrate the trend in Americans adopting babies of foreign origin. The data consist of the number of babies labeled by country of birth in 1999 and in 2013.


This type of chart fails the self-sufficiency test. The entire dataset is faithfully reproduced on the printed page, and that is because readers cannot figure out the relative sizes by their own eyes. (Try imagining the charts without the numbers.)

This need to present all of the data creates an additional design challenge: how to place country names where several boxes crowd onto one another. The designer here adopts an expand-from-the-middle approach, which might require getting used to.


In addition, the amount of distance placed between the pair of dates is vast, and that is not optimal for a graphic whose primary goal is to elicit a trend.


Here is the Bumps-style chart. These charts are great except where the data are tightly clustered. Recently I have been experimenting with small-multiples as a way to split up the data, which alleviates the labeling challenge.


In this version, the countries are shown as four groups. The countries that show up as significant enough in each year to merit individual labels are shown in the middle, themselves split into two groups: those that have seen its share of adoptions increase versus those that have seen a decrease. The remaining countries show up in only one of the two years. Presumably this means in the other year, there were zero adoptions from those countries. (However, it is also possible that in the missing year, the numbers were so tiny that they were included in the "Rest of the World" category.)

I also switched to graphing shares of adoptions rather than number of adoptions. The total number of adoptions dropped drastically during that period. It is often the share, not the absolute numbers, that is of interest.



Why you need a second pair of eyes

Reader Aaron K. submitted an infographic advertising the upcoming New England Auto Show to be held in Boston (link).

As Aaron pointed out, there is plenty of elementary errors contained in one page. I don't think the designer did these things consciously. I believe in having someone else glance at your work before you publish it. Or take a walk around the house and look at your own work after flushing your head.

In the following diagram, the graphical elements (stick figures) are coding the data labels, rather than the data!


Helping readers figure out which one is male and which one is female seems, hmm, unnecessary.


Placing the above two charts side by side has the effect of suggesting that only male attendees were asked about their age.


 Look again, is the proportion of attendees over 18 4%, 96% or 100%?



This map irritates me.


Is it because they could have enlarged the frame just a little so as not to have to expel little Rhode Island from New England? Is it because not having the right frame size caused two numbers to sit outside New England when only one should? Is it because having two numbers outside the boundary tempted the designer to single out Rhode Island for the purpose of labeling? Is it because no other state is labeled besides Rhode Island?

Or is it because the land area is vastly disproportional to the data being displayed? Is it because the map construct is a geography lesson and nothing more (something I wrote about years ago)? Is it because the geography lesson is incomplete since only one state is labeled?


According to the text at the bottom, this part of the country is proud of "it's (sic) academia" and has hundreds of thousands of college students, who somehow "contribute $4.8 billion+ to the city's economy," which tells me they are super-productive in the classrooms.

Losing sleep over schedules

Fan of the blog, John H., made a JunkCharts-style post about a chart that has been picked as a "Best of" for 2014 by Fast Company (link). I agree with him. It seems more fit to be on the "Worst of" list. Here it is:


As John pointed out, the outside yellow arc (Beethoven) and the inside green arc (Simenon) present, shockingly, the same exact sleep schedule (10 pm to 6 am).

John unrolled the arcs and used R to make this version:


Go here to read John's entire post.


Another improvement is to add a "control". One way to understand how unusual these sleep patterns are is to compare them to the average person.

I'm also a little dubious as to the reliability of this data. How do we know their sleep schedules? And how variant were their schedules?

If I rate this via the Trifecta Checkup, I'd classify this as Type DV.



A great start to the year

I'd like to start 2015 on a happy note. I enjoyed reading the piece by Steven Rattner in the New York Times called "The Year in Charts". (link)

I particularly like the crisp headers, and unfussy language, placing the charts at the center. The components of the story flow nicely.


Here are my notes on some of the charts:


This chart is missing context, which is performance against population growth or potential. Changing the context also changes the implicit yardstick. The implied metric here is more-than-zero growth or continued growth.


It took me a while to find the titles to know what each section depicts. I'd prefer to put the titles back to the top or the top left corner. The "information in my head" is making me look at the "wrong" places. But otherwise, this is Tufte goodness.


This innocent thing prompts a host of questions. First, how could a "median" be found to have so many values within one population? It would appear that this is an exercise in isolating each quintile (decile in the case of the top 20%) and computing the median within each segment. In other words, the data represent these income percentiles: 95th, 85th, 75th, 50th, 3oth and 10th. Given that the income data have already been grouped, computing group averages makes more sense than calculating group medians. This is especially so when comparing changes over time. The robust median suppresses changes.

The bucketing of income presents another challenge. All buckets except at the very top are essentially bounded. All the central buckets have minimum and maximum values. The bottom bucket is bounded under by zero. The top bucket, however, is basically unbounded so important features of this data could be lost by summarizing the top bucket by its median.

A third problem surfaces if one were to inquire how the survey collects its data. According to the Federal Reserve description, the data concern "usual income" as opposed to "actual income". Respondents are told to ignore "temporary" conditions in describing their "usual incomes". It is likely the case that people think income increases are permanent while getting laid off is temporary so while usual income solves one problem (the long-term planner's problem), it creates a different problem (short-term bias). I particularly don't think it is a good metric for assessing changes around a recession/recovery.

I also wonder about the imputation of missing data. I'd assume that possibly there is a preponderance of missing values for unemployed people. If the imputation cannot predict the employment status of those people, then it would surely have inflated incomes.

I wonder if any of my readers knows details about some of these potential problems. Would love to hear how the Fed's statisticians deal with these issues.


On this chart, the author has found an excellent story, and the graphic is effective. I prefer to see the horizontal axis labelled "More Unequal" as opposed to "Less Equal" because of the conventional that "more" is usually placed to the right of "less" on the horizontal axis. Here is a scatter plot version of the data:

Screen shot 2015-01-01 at 10.52.13 AM

It shows the U.S. is a bit more extreme than all others.


This is another great chart. I like the imagery of the emptying middle. I find the labels a bit too long and requiring too much interpreting. I prefer this: