Statistical adjustment in charts

On the book blog, I often talk about the reasons why statisticians adjust data, and why it is necessary in order to paint a proper picture of what the data is saying. (See here or here.)

On this blog, I have frequently complained about how the "prior information" on maps is too strong - large regions dominate our perception regardless of the data. In the U.S., large but sparsely populated states attain disproportionate attention.

So, why not bring "statistical adjustment" to maps?

***

That's exactly what cartograms do. For example, look at the following pair of maps created by the people at Leicestershire County Council. (PDF link here)

Lei_map12

The map on the left and the cartogram on the right plot identical data. The only difference is that each hexagon on the cartogram represents an equal number of people. The two views give very different impressions: the big dark green patch on the middle-right of the map -- representing a relatively sparse neighborhood -- is shrunk to a single dark green hexagon on the cartogram. Meanwhile, the most deprived areas (dark purple) which look relatively small on the map are expanded to quite a few hexagons.

According to the map, most of the county live in areas ranked in the half considered less deprived (green), and that is good news. But wait... there is a lot of purple in the cartogram!

The real piece of news is that the majority of people live in the half of the neighborhoods considered more deprived (purple) but this uncomfortable fact is well-hidden in the mostly green map on the left.

Given that the measures of "deprivation" are about people, not geographical neighborhoods, the cartogram is much closer to the real world experience... notwithstanding the obvious geographical distortion introduced by the statistical adjustment.

According to Alex L., who is part of the team producing these graphics:

LSOAs were created for the 2001 [UK] Census to disseminate the data and are generally considered to represent 'neighbourhoods'. They are created to have a broadly consistent population (approx 1500 people in 2001) and socio-economic traits.

***

Question: Is there any reason to show the map at all?

 


The wall of blinking lights

Reader Alex L. submitted this chart showing the evolution of quality of life in Warwickshire in the U.K.

Warwick_walloflights

 This wall of lights is drawing way too much power. Let's make a list of fixes:

  • Stretch out the hemisphere, turning those arcs into horizontal lines
  • Allow readers to read horizontally, rather than centrifugally (?)
  • Align horizontally all of the labels for the quality of life indicators
  • Allow readers to read indicator labels in one direction, rather than inside-out on the right hemisphere and outside-in on the left hemisphere
  • Assume readers understand that the first year for which there is data is the "baseline year"
  • Remove the distance between one data point and another, which makes unnecessary the white gridlines
  • Use rectangles (rather than circles) as they can be packed more tightly
  • Order the indicators in a meaningful way

Eventually this chart reveals itself as a heatmap:

Redo_warwick1

The heatmap is much better. But the heatmap doesn't expose the trends clearly, especially the differences between indicators. The heatmap function (in R) has a built-in clustering method which automatically groups the indicators by similarity of trends. The color scheme should really be reversed since on this chart, red is good, and blue is bad; the default orientation of the column labels is also annoying. The bad indicators are clustered to the top, the good ones in the middle and the neutral ones at the bottom.

***

The next version uses the line chart, in a small-multiples setting. Now we have something to chew on.

Redo_warwick2

Although not done here, we can order this set of charts using the clustering results from the heatmap.

The lesson is that the pretty colors in the heatmap really tell us much less than the plain levels in a line chart.

 

 


When simple arithmetic doesn't cut it

DNAInfo made a set of interesting maps using crime data from New York.

The analyst headlined the counter-intuitive insight that the richest neighborhoods in New York are the least safe. In particular, the analysis claimed that Midtown ranks 69 out of 69 neighborhoods while Greenwich Village/Meatpacking is second last.

According to the analyst, there is no magic -- one only needs to assemble the data, and the insight just drops out:

The formula was simple: divide the number of reported crimes in a neighborhood by the number of people living there, for a per capita crime rate.

By definition, a statistic is an aggregation of data. Aggregation, however, is a tricky business. And this example is a great illustration.

***

DNAInfo’s finding is captured in the following map of “major crimes”. The deeper the color, the higher the per-capita crime rate. The southern part of Manhattan apparently is less safe than areas to the north like Harlem, which has the reputation of being seedy. Greenwich Village has 1,500 crimes for about 62,000 residents (240 per 10,000) while East Harlem has 900 for about 47,000 residents (190 per 10,000). East Harlem is not marginally safer than Greenwich Village – it is 20% safer according to these crime statistics.

Dnanyc_overallrank

Major crimes is the aggregate of individual classes of crimes. The following set of maps shows the geographical distribution of each class of crime. It seems rather odd that the south side would bleed deep red in the aggregate map above while by most measures, it is very safe (light hues almost everywhere in the maps below).

Dnanyc_crimetype

Greenwich Village registers among the lowest for rapes, assaults, shooting incidents, murders, etc. The only category for which it has a poor record is "grand larceny". I have to look up Wiki for that one. Grand larceny is "the common law crime involving threat theft". In New York, apparently "grand" means $1000 or more. That sounds like stealing to me.

***

How is it that a precinct that is safe from most types of crimes and safe for people who don't carry around $1,000 or more ends up at the bottom of the safety ranking?

Dnanyc_formula

The “simple” formula assigns equal weight to any kind of crime, whether it is a murder or theft. As shown below, murders occur in single-digit frequency while hundreds of thefts happen each year. It turns out that most of the other crime types also occur in small numbers so this ranking really only tells us where one is most likely to get robbed if one is carrying more than $1,000.

In the meantime, one might get murdered, or raped, or shot at.

Simple analysis can be dangerous.


Drugged-up American graphic

Reader Chris P. found this chart on Visualizing.org, which is one of those sites that invite anyone to contribute graphics to it:

Visualizing_drug_info

It looks like the designer has taken Tufte's advice of maximizing data-to-ink ratio too literally. There are many, many things going on in a tight space, which leaves the reader feeling drugged-up and cloudy.

From a cosmetic standpoint, fixing the following would help a lot:

  • Make fonts 1-2 points larger in all cases, especially the text on the left hand side
  • Use colors judiciously to stress the key data. In this version, the trends, which are more interesting, are shown in pale gray while the raw data, which are not very exciting, are shown in loud red. Just flip the gray with the red. 
  • Rethink the American flag motive: is drug abuse a uniquely American phenomenon? Should data about the American people always be accompanied by the American flag?
  • Separately present in two charts the time-series data on total arrests, and the cross-sectional data (2008)

Stars_and_drugs Also, realize that by forcing the data into the 50-star configuration, one arbitrarily decides that the data should be rounded to 2-percent buckets. (see right). 

And always ask the fundamental question: what makes this data tick?

***

As I explored the data, I noticed various arithmetic problems. For example, the arrests by race analysis is itself split into two parts: White/black/Indian/Asian add up to 100 percent and then Hispanic Latino and Latino non Hispanic add up to 100 percent. In some surveys, Hispanics are counted within whites but that doesn't seem to be the case here. The numbers just don't add up.

Also, adding the types of drugs involved does not yield the total number of arrests. Perhaps the category of "others" has been omitted without comment. Now I closed my eyes and proceeded to make a chart out of this.

***

The new version focuses on one insight: that certain races seem to get arrested for certain drugs. The relative incidence for arrests are not similar among the races for any given drug. Asians and Native Americans appear to have higher proportions of people arrested for marijuana or meth while blacks are much more likely to be arrested for crack. 

Redo_drugs

You're going to need to click on the chart for the large version to see the text.

Doing this chart gives me another chance to plug the Profile chart. We deliberately connect with lines the categorical data. The lines are meant to mean anything; they are meant to guide our eyes towards the important features of the chart.

One can sometimes superimpose all the lines onto the same plot but the canvass clogs up quickly with more lines, and then a small-multiples presentation like this one is preferred.

We have a temptation to generalize arrest data to talk about drug habits by race but if you intend to do so, bear in mind that arrests need not correlate strongly with usage.


From text documents to our eyes

Today we look at an example of a powerful visualization of some unstructured data. The data team at Guardian (UK) organized the Wikileaks data concerning reported incidence of IEDs in Afghanistan.

Guardian_relief A scatter plot on a map provides an overview of the intensity of attacks from a spatial perspective. (A part of this map is shown on the right.) The background data -- the relief map of Afghanistan, and the major thoroughfares -- add to our understanding of why attacks were concentrated in certain parts of the country. It is always a great idea to add (con)textual data to help readers grasp the information shown on the chart.

Readers may want to understand the temporal pattern of attacks as well. The designer chose a small-multiples format to show this data, disaggregated by year of occurrence. This graphical construct is very versatile, and illustrates this data well... even though there has been little change over time, apart from a general increase in the number of reported attacks across the country.

Guardian_smallmultiples  It is a good idea to track the total number of attacks over time -- but not with those bubbles! The bubble chart almost always fails the self-sufficiency test; our eyes are not  equipped to read relative areas of circles, and so any information we obtain about the aggregate number of attacks comes from reading the data directly. Switching to a bar chart, or removing the bubbles, leaving just the data, is recommended.

The major problem with a dataset like this is reporting bias: only attacks that were reported by U.S. personnel were included. The following chart helps close the gap a little by also showing the number of defused attacks, reported in the U.S. database. I'd have preferred a stacked column chart here since the total of defused (gray) and detonated (red) IEDs is an interesting statistic.

Guardian_defused

A stock trading volume type chart would also be nice, something like this:

Yahoo_tradingvolume

 

 


Making charts beautiful without adding unneeded bits

Reader Dave S. sent me to some very pretty pictures, published in Wired.

Wired_311_1 This chart, which shows the distribution of types of 311 calls in New York City by hour of day, is tops in aesthetics. Rarely have I seen a prettier chart.

***

The problem: no insights.

When you look at this chart, what message are you catching? Furthermore, what message are you catching that is informative, that is, not obvious?

The fact that there are few complaints in the wee hours is obvious.

The fact that "noise" complaints dominate in the night-time hours is obvious.

The fact that complaints about "street lights" happen during the day is obvious.

There are a few not-so-obvious features: that few people call about rodents is surprising; that "chlorofluorocarbon recovery" is a relatively frequent source of complaint is surprising (what is it anyway?); that people call to complain about "property taxes" is surprising; that few moan about taxi drivers is surprising.

But - in all these cases, there are no interesting intraday patterns, and so there is no need to show the time-of-day dimension. The message can be made more striking doing away with the time-of-day dimension.

The challenge to the "artistic school" of charting is whether they can make clear charts look appetizing without adding extraneous details.

 


This is meant as an art piece

Sf_drugs_500Indeed, this set of maps produced by Doug Mccune (more here) using publicly available data released by the San Francisco government on its DataSF website is breathtakingly beautiful. Thanks to Rudy R for bringing this to our attention.

***

Hate to spoil the fun but it has to be said that if we apply the Trifecta checkup, these maps fail at the first question: what is the practical issue being addressed?

As Doug noticed, there is a ridge along Mission Street that appears on pretty much every map regardless of the type of crime. The features on various maps are rather consistent as well -- and I can assure you that those features are consistent with population density.

Alas, if you live in San Francisco and care about crime there, Mission Street is not news. We don't need a sophisticated map to tell us that insight. Same with where prostitution is.

What if you are interested in crime in your local neighborhood? Not these maps either because in creating the relief, Doug must make approximations; the higher the peak, the more collateral activity is created around the peak to avoid discontinuities in the surface. This destroys the local details.

***

Still, they are gorgeous to look at, and as Doug alluded to in his disclaimer, we just need to remove our junkcharts glasses to appreciate them.


Tiger tiger

Picked up the Metro paper the other day and found them ventilating about the possibility that Tiger Woods used steroids; the news was that a Canadian doctor he (and other professional athletes) hired has been caught with HGH and drug equipment. In the section on why Tiger couldn't be doping, the following chart appeared:

Metro_tigerped 

According to this line of argument, since steroids should improve driving distances, and since driving distance determines overall performance, the fact that his average driving distance "remained almost constant throughout the years" proved that he did not dope.

Now, I have no idea if he dopes or not.  But this particular argument is full of holes.  In the modern era, steroids are used not just for enhancing brute strength but also shortening recovery times, prolonging training, etc.  Also, it holds only if overall performance is heavily affected by driving distance.



  
The bar chart has multiple problems:

  • The choice of starting the vertical scale at 250 is completely arbitrary, and as been shown before, cutting off the bottoms of bars is a bad idea -- the lengths of the remaining parts are no longer proportional to the stated data.
  • The choice of the three years is also unexplained, especially when 2001 is not in the middle of 1997 and 2009. 
  • The horizontal gridlines are totally redundant since all three numbers sit in the very last section (290-300).  






Why were those three years chosen?  The following line chart that plots all the data may give us a clue:


Redo_tiger

The choice of 2001 and 2009 means we missed the peak of his driving distance performance.  Looking at the standardized units, we see that at its peak, the driving distance was about 2.6 times the standard deviation above his career average (the zero line using the scale on the right). 

The difference between 1997 and the peak was about 20, which looked large compared to the standard deviation of 6 over this entire period. Establishing a reference point is very important to interpreting any observed difference.

This is one of the few occasions where double axes can be recommended.  The two axes in fact plot the same data, only reflecting a difference in scale.


Reference: "Three reasons to believe he's totally clean", Metro USA, Dec 16 2009.




Supplemental reading

What are other graphics blogs talking about recently?

Subway_sparklines2 Information Aesthetics highlighted the so-called New York City Subway sparklines.   (original site)  (Andrew also mentioned it.)

IA said "
The general idea is that the history of subway ridership tells a story about the history of a neighborhood that is much richer than the overall trend." 

Okay but what about these sparklines would clarify that history?  From what I can tell, this is a case of making the chart and then making sense of it.

The chart designer did make a memorable comment in his blog entry: "Hammer in hand, I of course saw this spreadsheet as a bucket of nails."  The hammer is a piece of software he created; the nails, the data of trips taken.



Wsj_stresstest Nathan at FlowingData gave a reluctant passing grade to this Wall Street Journal bubbles chart illustrating the recent U.S. bank "stress" test.

One should fight grade inflation with an iron fist.  (Hat tip to Dean Malkiel at Princeton.)  A simple profile chart would work nicely since the focus is primarily on ranks.  The bubbles, as usual, add nothing to the chart, especially where one can create any kind of dramatic effect by scaling them differently.


Envy_map Nathan also pointed to the maps of the seven sins, which garnered some national attention.  This set of maps is a great illustration of the weakness of maps to study spatial distribution of anything that is highly correlated with population distribution.  Do cows have envy too?  See related discussion at the Gelman blog.





Whither complexity?

The ever interesting Gelman blog ("Too clever by half") ponders about this enterprising NYT chart.  Whatever its merits, this is one that requires close study. 

Nyt_drugs

Reception is generally positive.  Andrew himself learnt an important fact, that there are still more white people than other races in America!  In statistics, we distinguish between two types of errors, the significant kind and the ignorable kind.  From this perspective, using admissions count is a gigantic problem; it renders the rest of the chart useless.  So I agree with Andrew.  As ever, picking the right scale is the beginning of making a nice chart.

We can also use this example to discuss the concept of "interactions".  When we go about presenting small multiples, i.e. comparisons of subgroups within a population, it's because we have observed differences between those subgroups; otherwise, it is both simpler and clearer to present the aggregate results.  The present chart presents subgroups defined by race, gender, age and substance abused, that is quite a lot of subgroups. 

Focusing on the first row (Alcohol), we note that the colored mass has shifted to the right, indicating more older people abused alcohol.  This trend appeared for all races.  Now scanning the other rows, we discover that only heroin abuse showed a distinctly different pattern,
but only among whites.  For every other row, it seemed that the change from 1996 to 2005 was similar across races.

By breaking out substance abused, the designer added 21 little charts (7 sets of 3).   Only one set  (heroin) added information to what was true in aggregate i.e. that substance abusers got older.  The incremental gain in information does not justify the added complexity.

Nevertheless, the chart had many positive things such as judicious use of axis and gridlines and letting the graphical constructs speak for themselves (without accompanying data labels).



 


Reference: "Why is Mum in Rehab?",  New York Times, Jun 14 2008.