« March 2013 | Main | May 2013 »

Figuring out the location (of the data)

When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.

When graphs are not done right, sometimes they manage to obscure the information.

Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.

The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.

Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).

When the data has a nice structure, there should be better ways to visualize it.

John submitted a much improved version, which he created using ggplot2.


This is essentially a small multiples chart. The key differences between the two charts are:

  • Giving more dimensions a chance to shine
  • Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
  • Using a profile chart, which also allows the y-axis to start from 2
  • One color versus six colors, and no chartjunk
  • Using fewer decimal points

When you read this chart, you finally realize that the experiment has yielded several insights:

  1. Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
  2. The improvement of GPR+WCL comes only from the inner access points.
  3. The WCL algorithm performs really well in inner access points but poorly in outer access points.
  4. The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)

Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.


The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.

And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.

Spin, spin, spin away

From a purely graphical perspective, the following NYT chart (link) is well executed:


Labeling is always a challenge with scatter plots. Here, they have 54 points, and the chart still doesn't look too crammed. I like the axis labels, and the clear labeling of the four quadrants.

I also like the vertical scale that goes from 4 to 8, despite the scoring range going from 1 to 10. This trims unneeded whitespace and magnifies the differences between nations.


6a00d8341e992c53ef017d412b7da2970cIn the Trifecta checkup, we also care about what the key question the chart is designed to answer, and how it relates to the graphical element. According to the subtitle, this chart showed that "the nations with more progressive tax rates had happier citizens."

This conclusion certainly does not jump off the page. Reader Christopher L. who submitted this chart found "no obvious trend." (Given the source, I suspect it's the researchers who drew that conclusion.)

There are lots of unanswered questions in an international comparison of subjective results of this type:

  • How were the 54 nations chosen?
  • Is the year 2007 representative of the recent situation in every one of these countries? Were there any tax reforms in 2007 in any of these countries?
  • How reliable is the Gallup poll in each of these countries? How large are the sample sizes? Is it the same survey?
  • Why is the difference between the highest and lowest tax burdens the right measure of progressiveness? And are they using the marginal tax rates or the average tax rates? (Judging from the Wikipedia page, there is a lot of arbitrariness in determining a country's tax rate.)
  • Are the two data sources comparable? Happiness is a personal question while the range of tax rate is an aggregate metric, with each individual only experiencing one tax rate.

These are not trivial questions. If the data is bad, no amount of graphical magic can save it.

Most vexing for a display like this is that it forces the reader to look for the impact of tax burden on happiness. That's how the question is framed. There is nothing in this chart, though, that suggests that tax rates can explain happiness, and certainly nothing to suggest that low tax rates cause more happiness.

I call this story time. Put up some data, then spin stories and spin away.




Two unhealthy submissions from readers

Abc_birdfluJosh hated this "dataless visualization" from ABC. (link; warning: ads). Here are his comments:

The report has planes leaving China, landing across the globe and instantly infecting us all with bird flu.  It doesn't do a good job explaining how and the rate pandemics actually spread.  However, it does do a good job scaring us all.

The entire flu pandemic theater is unscientific. It is based on the 100-year flood type of argument, with scientists claiming that we are "overdue" for some catastrophe. Reminds me of earthquake forecasting, covered by Nate Silver in his book. It is possible to predict the average frequency of, but virtually impossible to predict the timing of rare natural disasters.

The 100-year flood type calculations is based on averaging a small number of events over a very long time scale. There is no reason why these events should be spread out evenly over time (i.e. one event every 100 years).

This is a fallacy of "law of small numbers": if one throws a fair coin 10 times, one shouldn't expect exactly 5 heads, as the distribution of heads should look like the chart on the right. The chance of exactly 5 heads is only 25%.


Also, doctors keep me honest but I believe only one type of mutation, i.e. the one that makes the virus able to pass from human to human, has a chance of causing a pandemic. So it is wrong to say that "if the virus mutates," a pandemic will result. In addition, in the past, some viruses were able to pass from human to human but the rate of infection was not fast enough, and they failed to lead to a pandemic.


Daniel L. did not like the map shown below, from a research article on female mortality rate in the U.S., via Jezebel.

I was amused by what the blogger at Jezebel was able to take in from the map. Her post started with a huge version of the map, under which she said:

Mortality rates are rising in 43% of U.S. counties, as illustrated by this map from health researcher Bill Gardner.


Mortality rate is a statistic about the population. The map is an illustration of geographical area (distorted by the map projection). The map carries no information about population at all. Thus, it is not the right chart to display population data.

The statistic itself is poorly chosen. What does 43% of counties mean? Some counties have few people while others are very densely populated. New York County is barely visible on this map yet it has the heaviest weight on the average. 

According to the CDC data, the death rate, age-adjusted, for women has been decreasing over time. So, the backward motion in those 43% of counties is somehow compensated for by forward progress in the other 57% of the counties, it appears.


Maybe the average for the whole country masks some local patterns. The cited map doesn't help because it assumes that the importance of the mortality rate is proportional to the geographical size of the county, when the right comparison should be the population of women in the county.


Doing legwork, doing justice

The New York Times brought attention to the Bronx courtrooms this weekend. (link) The following small-multiples chart effectively illustrates how the Bronx system is uniquely unproductive, compared to the other boroughs:


The above chart shows the outcomes. The next chart shows the possible cause.

Nyt_bronx_courtsIt appeared that at any time of the day, at least one-third of the courtrooms are not actively conducting business. In fact, outside of the period between 10:30 and 12:30, and 2:30, less than half of the courtrooms have a judge present.

I want to draw your attention to the caption below the chart. It said: "The Times visited all 47 courtrooms at the Bronx County Hall of Justice in 30-minute intervals totally how many were open and actively in session, ..."

Too often, we analyze and plot whatever data has been collected conveniently by some machine. Such data frequently do not address the questions we'd like to answer. We let the data dictate our research question.

Most great work in statistics come from people who put in the effort to define their research goals first, and then manually collect the specific data needed to accomplish those goals.


Highway Safety Agency goes rogue

A reader sends me to Adam Obeng, who did the dirty work deconstructing a set of charts by the U.S. National Highway Traffic Safety Administration on his blog. Here's an example of these charts:


Aside from the sneaker chart, they concocted a pop stick, a pencil, a tower of Hanoi, etc. These objects are ones I think should be evaluated as art. Adam gamely tells us that the proportions are totally off, and they are both internally and externally inconsitent.


I'll add two small points to Adam's post.

First, these charts pass my self-sufficiency test, that is to say, they did not print the entire data set (just one number here) on the page. Alas, given the distortion identified by Adam, not printing the data means everyone is free to create their own data. Herein lies the problem: there is an argument for allowing a small degree of distortion in exchange for "beauty" but these charts without any data have gone too far.

Second, see Adam's last point (the footnote). The original data is something quite convoluted: “3 out of 4 kids are not as secure in the car as they should be because their car seats are not being used correctly.” (How would they know this, I wonder.) This is a statistic about kids while the picture shows a statistic about their parents (or drivers).



Bad charts can happen to good people

I shouldn't be surprised by this. No sooner did I sing the praise of Significance magazine (link) than a reader sent me to some charts that are not deserving of their standard.

Here is one such chart (link):

Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.

You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.

The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.

The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.

If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.

The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.


This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).

In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.