« August 2011 | Main | October 2011 »

When simple arithmetic doesn't cut it

DNAInfo made a set of interesting maps using crime data from New York.

The analyst headlined the counter-intuitive insight that the richest neighborhoods in New York are the least safe. In particular, the analysis claimed that Midtown ranks 69 out of 69 neighborhoods while Greenwich Village/Meatpacking is second last.

According to the analyst, there is no magic -- one only needs to assemble the data, and the insight just drops out:

The formula was simple: divide the number of reported crimes in a neighborhood by the number of people living there, for a per capita crime rate.

By definition, a statistic is an aggregation of data. Aggregation, however, is a tricky business. And this example is a great illustration.


DNAInfo’s finding is captured in the following map of “major crimes”. The deeper the color, the higher the per-capita crime rate. The southern part of Manhattan apparently is less safe than areas to the north like Harlem, which has the reputation of being seedy. Greenwich Village has 1,500 crimes for about 62,000 residents (240 per 10,000) while East Harlem has 900 for about 47,000 residents (190 per 10,000). East Harlem is not marginally safer than Greenwich Village – it is 20% safer according to these crime statistics.


Major crimes is the aggregate of individual classes of crimes. The following set of maps shows the geographical distribution of each class of crime. It seems rather odd that the south side would bleed deep red in the aggregate map above while by most measures, it is very safe (light hues almost everywhere in the maps below).


Greenwich Village registers among the lowest for rapes, assaults, shooting incidents, murders, etc. The only category for which it has a poor record is "grand larceny". I have to look up Wiki for that one. Grand larceny is "the common law crime involving threat theft". In New York, apparently "grand" means $1000 or more. That sounds like stealing to me.


How is it that a precinct that is safe from most types of crimes and safe for people who don't carry around $1,000 or more ends up at the bottom of the safety ranking?


The “simple” formula assigns equal weight to any kind of crime, whether it is a murder or theft. As shown below, murders occur in single-digit frequency while hundreds of thefts happen each year. It turns out that most of the other crime types also occur in small numbers so this ranking really only tells us where one is most likely to get robbed if one is carrying more than $1,000.

In the meantime, one might get murdered, or raped, or shot at.

Simple analysis can be dangerous.

Motion-sick, or just sick?

Reader Irene R. was asked by a client to emulate this infographic movie, made by UNIQLO, the Japanese clothing store.

Here is one screen shot of the movie:


This is the first screen of a section; from this moment, the globes dissolve into clusters of photographs representing the survey respondents, which then parade across the screen. Irene complains of motion sickness, and I can see why she feels that way.

Here is another screen shot:


Surprisingly, I don't find this effort completely wasteful. This is because I have read a fair share of bore-them-to-tears compilation of survey research results - you know, those presentations with one multi-colored, stacked or grouped bar chart after another, extending for dozens of pages.

There are some interesting ideas in this movie. They have buttons on the lower left that allow users to look at subgroups. You'll quickly find the limitations of such studies by clicking on one or more of those buttons... the sample sizes shrink drastically.

The use of faces animates the survey, reminding viewers that the statistics represent real people. I wonder how they chose which faces to highlight, and in particular, whether the answers thus highlighted represent the average respondent. There is a danger that viewers will remember individual faces and their answers more than they recall the average statistics.


If the choice is between a thick presentation gathering dust on the CEO's desk and this vertigo of a movie that perhaps might get viewed, which one would you pick?


Reading the landscape

Here are some posts I find worth reading on other graphics blogs:

Nick has done wonderful work on the evolution of the rail industry in the U.S., with a flow chart showing how mergers have produced the four giants of today, as well as a small multiples of maps showing how they split up the country.

A lovely feature of the flow chart is the use of red lines to let readers see at a glance that Union Pacific is the only rail company that has lasted the entire 4 decades, while the other 3 giants came into being within the last 20 years.

On the maps, notice a slight inconsistency between the left and right columns: on the right side, both maps have the same set of anchor cities, which act as "axes" to help readers compare the maps; on the left side, the sets of anchor cities are not identical. It would also be interested to see a version with all four route maps superimposed and differentiated by color. That may bring out the competitive structure better.


Georgette has a nice post summarizing issues with picking colors when producing charts. Her blog is called Moved by Metrics.


Meanwhile, Martin finds a shockingly poor pie chart here.


There was a time where you'd find the kind of heatmaps featured here by Nathan as wallpaper in my office. It's a great visualization tool for exploring temporal patterns in large data sets. However, I'd never even think of putting these in a presentation.  It's a starting point, not an end-point, of an analysis project. Some things are wonderful for consumption only in private!






Those prickly eyebrows

Darin M. points us to this speedometer chart, produced by IBM (larger version here). They call it the "Commuter Pain Index". I call it a prickly eyebrow eyelashes chart. You be the judge.


The "eyebrows" on this chart are purely ornaments. The only way to read this chart is to read the data labels, so it is a great example of failing the self-sufficiency test.

The simplest way to fix this chart is to unwrap the arc, turning this into a bar chart. The speedometer is a cute idea but very difficult to pull off because the city names are long text fields, and variable in length.




Rebirth of the twin towers

Perhaps it's this week's anniversary of the WTC disaster. Perhaps it's the New York-centric viewpoint of Citibank. One wonders what inspired Citibank analysts to make this absurdity.


(Via Business Insider.)

First, we must fix the vertical scale. For column charts, one must start at zero, without exceptions. The effect of not starting at zero is to chop off an equal length piece from the bottom of each column, and in doing so, the relative lengths/areas of the columns are distorted. The amount of distortion can be very severe. For example look at the fourth set of columns as shown below:


In both charts, I made the length of the first column the same so we are staring at comparative charts. The data plotted is exactly the same; the only difference is that the left chart starts the axis at zero. Notice that the huge difference seen on the right chart for the 4th pair of columns does not appear as extraordinary when the proper scale is used.

A multitude of other problems exist, not the least this is a chart that is highly redundant. The same data (10 numbers) show up three times, once as data labels, once as column lengths (distorted), and once as levels on the vertical scale.


An alternative way to look at this data is the Bumps chart. Like this:


What this chart brings out is the variability of the estimated vehicle densities. In theory, the density estimate should be quite accurate for the "today" numbers. You'd think that in surveying 2,000+ people about how many vehicles they currently own, most people should be able to provide accurate counts.

The data paint a different picture. From quarter to quarter, the estimated "today" density shows a range of 1.90x to 2.00x in the 5 periods analyzed, which is roughly 5%, a difference which, according to the analyst, equates to 5 million vehicles!  Given current vehicle sales of about 13 million per year, 5 million is almost 40% of the market.

So, one wonders how this survey was done, and one wants to know how large is the margin of error of this estimate. I also want to know if the survey produces estimates of number of households as well since the vehicle per household metric has two variable components.