Making charts beautiful without adding unneeded bits
Nov 20, 2010
Reader Dave S. sent me to some very pretty pictures, published in Wired.
This chart, which shows the distribution of types of 311 calls in New York City by hour of day, is tops in aesthetics. Rarely have I seen a prettier chart.
The problem: no insights.
When you look at this chart, what message are you catching? Furthermore, what message are you catching that is informative, that is, not obvious?
The fact that there are few complaints in the wee hours is obvious.
The fact that "noise" complaints dominate in the night-time hours is obvious.
The fact that complaints about "street lights" happen during the day is obvious.
There are a few not-so-obvious features: that few people call about rodents is surprising; that "chlorofluorocarbon recovery" is a relatively frequent source of complaint is surprising (what is it anyway?); that people call to complain about "property taxes" is surprising; that few moan about taxi drivers is surprising.
But - in all these cases, there are no interesting intraday patterns, and so there is no need to show the time-of-day dimension. The message can be made more striking doing away with the time-of-day dimension.
The challenge to the "artistic school" of charting is whether they can make clear charts look appetizing without adding extraneous details.
Chlorofluorocarbon is CFC. Those people are calling about recycling their old refrigerators.
I don't think the street light data are so obvious: why do people call about them so much more often between 10 and noon than at other times? Also, graffiti and dirty conditions both have two clear peaks at 2pm and 7pm. That is weird. There's several more of these one-hour peaks. I'm concerned about the way the data were collected and plotted!
Posted by: Cris | Nov 20, 2010 at 03:51 AM
I think you already phrased the most important issue: "no insights".
From a statistical point of view we need to ask what model do we expect behind the data. Are all issues people are calling in for more or less equally distributed and only the intensity changes over time? This is certainly too simple, as we already know that people will complain about noise more likely during nighttime.
That will lead us to a model that has certain *expected* intensities of complaints for certain times over the course of one day, estimated from a larger period of time.
To get insights of what is going on on a particular day, we then would need to plot the differences between the "model day" and the actual data.
This difference is something I keep on preaching to business people: "Don't be surprised by the data you look at, but be surprised by the deviation of that data from your expectation!" But for an expectation you need to have at least some kind of (naive) model ...
Posted by: Martin | Nov 20, 2010 at 06:09 AM
'The fact that complaints about "street lights" happen during the day is obvious.'
Not for me. I would have bet a lot that the opposite would have been true.
Sewer maintenance at 3:00 in the morning? Why would that be?
The graffiti complain at 7:00 pm probably makes sense because that is when people travelling back by train would spot and call to complain about graffiti. But 2:00 pm?
It may not be the best way to display this, but by getting this volume of information across in this format, it makes it easy to look for "deviation from ... [our] expectation" (to borrow Martin's phrase).
Posted by: bv | Nov 20, 2010 at 10:00 AM
While I agree this data is fit for additional analysis, showing the raw data in this way is informative too. For example, who really knows the relative number of complaint x versus complaint y? Or the magnitude of the difference in night volume versus day volume? (Though as i type that last sentence I realize there is no indication of what the horizontal bars indicate!)
Posted by: Zubin | Nov 20, 2010 at 02:43 PM
sorry, should have said that there are NO horizontal bars.
Posted by: Zubin | Nov 20, 2010 at 02:46 PM
Zubin: this chart would be useful for exploration. It is very dangerous to hone in on little bumps and troughs because most of those would be random noise. What the chart designer could have done is to investigate the interesting bits, determine if they are "statistically significant" and then present a chart that draws attention to the real information, and hides the random noise.
Posted by: Kaiser | Dec 01, 2010 at 09:51 PM