« May 2013 | Main | July 2013 »

When a chart does nothing for the story

PixardeclineexcelThere is some banter on Twitter about a chart that appeared in The Atlantic on "Pixar's Sad Decline--in One Chart". (@thewhyaxis, @jschwabish, @tealtan).

Link to article


It's a bit horrible but not the worst chart ever.

The most offensive aspect is the linear regression line. It's clearly an inappropriate model for this dataset.

I also don't like charts that include impossible values on the axis, in this case, the Rotten Tomato Score does not ever go above 100%.

If the chart is turned on its side, the movie titles can be read horizontally.


I am compelled by the story but the chart doesn't help at all. Of course, it would be better if they can find data on the profitability of each movie. Readers should ask how correlated the Rotten Tomato Score is with box office, and also, what are the relative costs of producing these different movies. Jon has the score against profit chart (link).


Hard work pays off

At the NY Tech Meetup, Andrei Scheinkman showed off some work his team at Huffington Post did relating to gun violence in America.



Interactive version is here. The animation shows day by day, where the victims of gun violence were located. The table below contains the details of each victim, and links to the news story covering the event.


What is not seen on the chart is even more impressive. Andrei described how they looked around for databases that would provide them the raw materials for creating this chart but no timely source exists. This means that a team of 15 (if I heard correctly) spent a month or so manually collecting all the data on a spreadsheet.

It's also the reason why they cannot continue the map indefinitely, as people have other things to do.

Andrei also contrasted this visualization with a text article that describes the state of gun violence in words. You guessed it, the visual presentation is hands-down more compelling.

Back to basics

Today, we review one of the basic principles Ed Tufte very effectively advocated in his famous book: use gridlines and data labels only if absolutely necessary. The enemy is redundancy.

Here is a chart that appeared in the New York Times Real Estate pages: (with this article)


The gridlines serve no purpose. Between the axis labels and the data labels, the designer should pick one. If the data labels are used, then the vertical axis can be removed entirely without affecting our ability to understand the data. One can also argue that the data labels do not convey any real information since the average person is unlikely to be able to process 1004 feet versus 1250 feet. Why not remove the data labels and retain only the axis labels?

I'd be willing to go so far as to remove all data from the chart itself. This is because the Empire State Building has been chosen as the reference point. The assumption behind this choice is that the readers have a sense of "tallness" of the Empire State Building. It is then sufficient to just place columns of different heights next to the Empire State Building. To make the comparison a little easier, one can draw a reference line from the top of the Empire State, like this:



De-noising data

One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.

Jc_labor_force_decomposedThe labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.)  The residuals are  the raw data minus the trend; in other words, they are the data after removing the trend.

If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.

Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.

Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.


After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.

The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).



Once more, superimposing time series creates silly theories

After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.

The contentious issue is that X and Y might appear correlated but in fact, what we are observing is that both data series are strongly correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.

Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.

The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.

Redo_milesdriven_1The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.

The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.

This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.




One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.

Redo_milesdriven_3Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.

What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.



Dampened by Google

Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.

Here are the examples given by Kosara:

The same data is presented three ways. The slope is a reflection of the scales used on the two axes.

Well, I lied when I said I didn't care. Look at this particular chart below:

Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.

But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.

If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.

Maxima and minima

Andrew Sullivan (link) highlights the insanity of the law with this "Chart of the Day", except that chart fails to bring out the message:


For this data set, a Bumps-style chart works very well:



The bar chart uses the wrong minima. Bar charts encode data in the lengths of the bars. When an equal length is chopped off the base of the bars, the relative lengths are distorted.

In the case of Ecuador, it appeared as if murderers get half the sentence as drug traffickers, when in fact the difference is 25 percent.

The chart also obstructs readers from comparing sentences across countries. That must be why the 16 years in Ecuador has the same length as 25 years in Bolivia. Either that, or time runs faster in Bolivia (and Mexico).


The Trifecta checkup (link) reveals that the biggest problem is the misalignment between the question being asked and the data used to address that question.

It's hard to imagine why the "maximum" sentence is considered, rather than, say the average sentence. If the analyst chooses maxima, he/she should assure readers of a couple of things: that the judges in these countries do apply the maximum penalties, and that the proportion of sentences that reach the maxima is roughly similar between drug traffic and murder cases.

I suspect the use of maxima is related to data availability. To compute the average or median sentence requires data on every conviction that leads to a prison sentence. To find out the maximum sentence only requires consulting a book. Is there any real data on the chart? It depends on whether any of these countries routinely dole out the maximum sentence.

Then, there are cases involving both drug trafficking and murder...