« March 2015 | Main | May 2015 »

Nice chart from the neck down

I was drawn to this Wall Street Journal chart because of the blue columns.


The blue color solves a common problem in time-series plots when the time axis is incomplete. The first quarter of 2015 is dangling. The article is about first-quarter economic performance and so it is appropriate to focus attention on the Q1 columns.

The rest of the chart is filled with Tufte goodies: the clean axes and labels and so on.

The online edition shows a slightly different chart: ("Slow Job Growth Tests Economy", April 4, 2015):


This one singles out the last column for attention. Readers are invited to compare the most recent month with any of the other months displayed on the chart. By contrast, in the printed version, readers are guided to compare first-quarters across years. The choice of colors leads readers in different directions.

Another difference between the two charts is portrayal of missed expectation in the online version. (I am ignoring the vertical line on top of the T, which is just confusing and unnecessary.) This seems to be the main story in the chart. If so, I'd like to see the forecast data displayed for several other months. By doing so, they drive home the message that this most recent month is uniquely bad.

The footnote is actually very important (I'd place it in a more visible spot). It is because of seasonal adjustment that readers can compare the heights of any column to any column on these two charts. If the data were not adjusted, then it would be difficult to separate the trend from seasonality. For a refresher on seasonal adjustment, see my post here, and the chapters on economic statistics in Numbersense (link).


The printed edition above is a great chart from the neck down. The headline of the chart lets it down. "The U.S. economy has seen several disappointing first quarters since the recovery began."

The blue columns do not stand out as particularly bad, especially if one considers that there should be a margin of error around each number in the chart.

The column chart is purely about "nonfarm payrolls" which is only one aspect of the U.S. economy. The other data series, GDP, tucked below the column chart, show a set of positive annual numbers which do not fit the headline either.

Reading between the gridlines

Reader Jamie H. pointed me to the following chart in the Guardian (link), which originated from Spotify.


This chart is likely inspired by the Arctic ice cover chart discussed here last year (link):


Spotify calls its chart "the Coolness Spiral of Death" while the other one is called "Arctic Death Spiral".

The spiral chart has many problems, some of which I discussed in the post from last year. Just take a look at the headline, and then the black dotted spiral. Does the shape invoke the idea of rapid evolution, followed by maturation? Or try to figure out the amount of evolution between ages 18 and 30.


Instead of the V corner of the Trifecta, I'd like to focus on the D corner today. When I look at charts, I'm always imagining the data behind the chart. Here are some questions to ponder:

  • Given that Spotify was founded in 2006 (not quite 10 years ago), how are they able to discern someone's music taste from 14 through 48?
  • The answer to the above question is they don't have a longitudinal view of anyone's music taste. They are comparing today's 14-year-old kid with today's 48-year-old adult. Under what assumptions would such an analysis yield the same outcome as a proper analysis that tracks the same people over time?
  • If the phenomenon under study follows a predictable trend, there will be little difference between the two ways of looking at the data. For example, teeth in the average baby follow a certain sequence of emergence, first incisors at six months, and first molars at 14 months (according to Wikipedia). Observing John's teething at six months and David's at 14 months won't yield much difference from looking at John at six then 14 months. Does music taste evolve like human growth?
  • Unfortunately, no. Imagine that a new genre of music suddenly erupts and it becomes popular among every generation of listeners. This causes the Spotify curve to shift towards the origin at all ages. However, if you take someone who is currently 30 years ol, the emergence of the new genre should affect his profile at age 30 but not anytime before. In fact, the new music creates a sharp shift at different locations of everyone's taste profile depending on one's age!
  • Let's re-interpret the chart, and accept that each spoke in the wheel concerns a different cohort of people. So we are looking at generational differences. Is the Spotify audience representative of music listeners? Particularly, is each Spotify cohort representative of all listeners of that age?
  • I find it unlikely since Spotify has that "cool" factor. It is probably more representative for younger age groups. Among older customers, there should be some bias. How does this affect the interpretation of the taste profile?
  • If we find that one cohort differs from another cohort, it is important to establish that the gap is a generational difference and not due to the older age group being biased (self-selected) in some way.



What if the Washington Post did not display all the data

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.


In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.


The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

One way to solve this problem is to use a density chart (heatmap).


You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.
This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:


Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.


More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.


The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

  • There is a national bias of whites being more likely to be in the police force
  • In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
  • Most cities confirm to the national bias, within an acceptable margin of error
  • There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.

Nice analysis of racial composition of police forces

The Washington Post has a good idea. Using Census data, they computed the proportion of police force who are white and the corresponding proportion of citizens who are white, in different cities.

In the following scatter plot, they singled out North Charleston, SC where the police force is 85% white but the citizens are only 40% white: (Link to the interactive chart.)


This plot itself is well done, with helpful coloring and labels.

One must be careful about "story time": it's easy to infer from the graph that blue dots mean worse racial tension but that interpretation requires an assumption not proven in the data. (What is missing is the correlation between this data and some other data measuring tension.)

The secret to reading this chart is to look at the slopes of lines from the origin to each point. Above the 45-degree diagonal separating the blue dots from the gray are the cities where the police is more white than the people. The steeper the line to the origin, the more unrepresentative. Once you pass the 45-degree line, do the reverse.

The slope is really the metric of X police per Y residents. So the two dimensions can be collapsed into one. With the one dimension, I'd try a histogram view. If you find the data, let me know. Or just post it to the comments.

Hello to St. Louis readers


I'll be hosting a Data Visualization workshop at the Digital Media Marketing Conference in St. Louis, Missouri on Thursday. Here is the link to their website.

The workshop is arranged from three themes: Appreciating, Conceptualizing, and Improving. There will be several hands-on exercises.

If you are a reader in St. Louis, and would like to meet up, email me.


Posting this week will be light because of various commitment. I may put something up later this week.

One of my students pointed me to this Medium article about a NYT chart. Well worth reading.


Planned redundancy

The following Wall Street Journal caught my eye the other day: (Link to article)


Looking closely, I realize that the four charts are identical, except for the call-outs. This is a kind of small-multiples in which the same data reside in each panel but the labeling changes. It's planned redundancy but I'm afraid I don't see the point.

The chart compares four different ways to save money by cutting cable. Here is an alternative that places the focus on the number of dollars saved:



Graphical forms impose assumptions on the data

In a comment to my previous post, reader Chris P. pointed me to the following set of maps, also from the New York Times crew, on the legalization of gay marriage in the U.S. (link)



(For those who did not click through, the orange colors represent two types of bans while the dark gray/grey color indicates legalization.)

These maps are pleasing to the eye for sure. By portraying every state as a same-sized square, the presentation avoids the usual areal distortion introduced by the map.

But not so quick. Note that each presentation makes its own assumption on the relative importance of states. The typical map scales weights according to geographical area while this presentation assumes that every state has equal weight. Another typical cartographic display uses squares of different sizes, based on the population of each state.

The location of states are necessarily distorted. One way to remedy this is to have hover-over state labels. On a browser, such interactivity works better than having to scroll to the top where there is a larger map which doubles as the legend.

It would be interesting to learn also about the future. Are there any legislation in the pipeline either to legalize gay marriage in the remaining orange states or to overturn the legalization laws in the gray states?


PS. [5/6/2015] Here is an alternative presentation of this data by David Mendoza.