« May 2010 | Main | July 2010 »

This is meant as an art piece

Sf_drugs_500Indeed, this set of maps produced by Doug Mccune (more here) using publicly available data released by the San Francisco government on its DataSF website is breathtakingly beautiful. Thanks to Rudy R for bringing this to our attention.

***

Hate to spoil the fun but it has to be said that if we apply the Trifecta checkup, these maps fail at the first question: what is the practical issue being addressed?

As Doug noticed, there is a ridge along Mission Street that appears on pretty much every map regardless of the type of crime. The features on various maps are rather consistent as well -- and I can assure you that those features are consistent with population density.

Alas, if you live in San Francisco and care about crime there, Mission Street is not news. We don't need a sophisticated map to tell us that insight. Same with where prostitution is.

What if you are interested in crime in your local neighborhood? Not these maps either because in creating the relief, Doug must make approximations; the higher the peak, the more collateral activity is created around the peak to avoid discontinuities in the surface. This destroys the local details.

***

Still, they are gorgeous to look at, and as Doug alluded to in his disclaimer, we just need to remove our junkcharts glasses to appreciate them.


Pain, but no gain

Reader Chris P. wasn't amused by this infographics about World Cup economics created by Mint.

Each section has its various problems; I'll take a close look at the middle part titled "No Pain, No Gain".

Mint_worldcup_sm Let's start with the conclusion: Revenues and costs have increased over the past ten years. 

Even without a graph, this statement can at best be based on either two or three data points (depending on the starting year in the decade being referenced). Two or three data points do not a trend make.

When looking at the change in currency amounts over time, we must be take into account inflation. Currencies should be expressed in "real terms", or else any "growth" can merely arise from inflation. The US$0.5 billion earned in 1994 is worth about $0.7 billion in 2009 dollars.

For revenues that were earned in foreign currency, we must also be careful with the choice of exchange rates used to convert to US dollars. The designer should definitely explain how this was done in a footnote.

Back to the conclusion: if growth in revenues and costs is the key message, this chart showing drastic ups and downs fails to convey it.

***

The chart itself is all pain, no gain.

  • The two sections should use the same scale: the bars above bear no relationship to the bars below, even though they are paired up in an inviting manner so that the same height means different values on each chart
  • If plotted on the same scale, it becomes clear that hosting the World Cup is a terrifically unprofitable activity; the costs dwarfed the revenues in each case. Assuming that we have accounted for all the primary revenues and costs (questionable).
  • Does a football really bounce like that? Perhaps the physicists could tell us.
  • I suspect the distracting diagonal lines running down the canvas reflect something in real life but I'm not sure what... perhaps the way the grass is cut on the pitch? But then the lines were aligned in order to send slanted sun rays down, so perhaps it's sunlight?
  • There is no need to fade out the 2010 projections - we can assume that readers are literate and know what "* PROJECTED" means, speaking of which, one of my pet peeves is the use of a stranded asterisk not linked to anything, or in this case, confusingly linked to the 1998 revenues number which clearly could not have been a projection. 
***

Here is an alternative:

Redo_worldcupecon I like to run against convention once in a while; here, I order the host nations from most profitable to least profitable, dropping the standard chronology. 

I did this because I don't think the chronology matters. Each World Cup is unique, held in a different country (even continent), with different organizations, under different business cycles.

What we see on this chart is that all World Cups are unprofitable, some vastly loss-making; and the current one is projected to be one of the least profitable of the past two decades.


Head-shaking at the deep hole

Continuing to work through the pile of submissions, here is Jeannie C. recommending one of my favorite economics charts. The economics blogs are generating lots of charts, many of which uninspiring and run-of-the-mill but this one about the jobs picture, relative to past recessions, truly paints a harrowing story. (Looks like TPM took the chart from Business Insider but this chart has appeared everywhere).

Tpm_scaryjobs

The little dotted extension at the end of the current curve (red) indicates the jobs picture after removing Census jobs. I have already explained why this adjustment is necessary (here, and here).

Two small improvements I'd make to the chart:

  • Instead of a rainbow of colors, should use a foreground-background concept. Have all the past recessions in gray, and the current one in red. This change necessitates a change in curve labeling strategy: should affix the year labels directly on the 0% line above the curves. Doing so eliminates the head shakes needed to find the year of the curve.
  • Smoothing out some of the curves will help remove clutter without harming the central message of the chart.


The scatter-plot matrix: a great tool

The scatter-plot matrix is one of the lesser known graphical tools beloved by statisticians. A scatter plot displays the correlation between a pair of variables. Given a set of n variables, there are n-choose-2 pairs of variables, and thus the same numbers of scatter plots. These scatter plots can be organized into a matrix, making it easy to look at all pairwise correlations in one place.

***

Since Nate Silver's feature article about New York neighborhoods came out, I have been working on capturing the data because so much was left unsaid in that article.  His ranking formula takes 12 factors (housing affordability, transit, green space, nightlife, etc.) and combines individual scores into an overall score based on chosen weights (e.g. housing affordability counted for 25%). Scores are then converted to ranks.

Silver's discussion focuses on explaining which factors caused which neighborhoods to be ranked high (or low). I'm interested in whether the individual factors are correlated. For example, do neighborhoods with more expensive housing also tend to have higher-quality housing? what about better schools? are more diverse neighborhoods also more creative? and so on. There is really a treasure trove of information locked up in this data.

***

A scatter-plot matrix neatly organizes all of the pairwise correlation information.  See below.

Cat_scatterplot_2

Each small chart shows the correlation between the given pair of variables (one listed on the right, the other listed below). The dots represent the neighborhoods. The pink patch contains the "middle 75%" of the nieghborhoods, and we can use the orientation of these patches to get a sense of whether the two variables are positively, negatively or not correlated.

There are lots to see in this chart. I just picked a random few things for illustration:

  • In the top left corner, the slant shows that the more affordable the homes are, the worse is the transit.
  • The better the shopping, the better the dining.
  • Interestingly, more diversity seems to mean lower creative capital (also the correlation is only moderate).
  • Wellness scores fall within a rather narrow range compared to other categories, and they seem to be almost completely unrelated to any of the other factors.

***

(Note: I used JMP to generate this matrix. Excel unfortunately does not make scatter-plot matrices natively. JMP is great for such exploration... if the developers are reading this, please make it easier to man-handle the category labels! I made a mess of rotating the text on the right.)

P.S. I had an adventure processing the data from New York magazine. There appears to have been quite a few typos. For more, see my writeup on the book blog.


Oil spills bring out the worst

Infographics types are having a field day telling us how "big" the big BP oil spill truly is. Unfortunately, whenever size is concerned, the specter of pie charts beckons.

Via reader Fia, we are alerted to the discussion at Discover's Cosmic Variance blog about the following chart (the entire poster hosted here at Iglucruise):

Iglucruise_pies1and2 Sean's original post focuses on the fundamental issue in making pie charts of making the square of the radius proportional to the data, rather than the radius. The current version (shown left below) at Iglucruise has been corrected and the sizes of the pies are much closer together.

A more pressing issue with pie charts is that we do poorly in assessing relative sizes of circular areas. Try figuring out how much smaller is the BP spill relative to the largest circle in the chart...

***

Applying the Trifecta Checkup, we find problems with the choice of chart type, and also the choice of the practical question. The question of comparing this spill to past spills is premature given that the oil is still gushing. In addition, the nature of the 1991 spill (Iraq's deliberate dumping) is in a different class from accidental spillage.

***

Reader Curtis R. pointed us to another infographics about the spill, this one made by Fast Company, a business magazine. Their concept is reminiscent of the racetrack graph.

Fastco_oilspills In other racetrack graphs, the tracks are close together, and terminate at different angles to the center; the data is encoded in the angles.

In this chart, the angle is everywhere 270 degrees, but the distance between the tracks are varied, and the data is encoded in the circumferences.

Alternatively, this can be thought of as a set of concentric circles with a right angle cut off. Since the circumference is proportional to the radius, the data can be found either in the radii or in the track lengths.

Judging the relative lengths of the tracks is not any easier in judging the relative areas of circles. One can stare at the orange and green arcs for however long, but can one tell the orange is shorter than the green by 3.3 million?

In other words, this chart is not self-sufficient. If the data is not printed directly onto the chart, it is impossible for readers to interpret it.

***

I would like to see a chart that looks like the following, which emphasizes the open-ended, progressive nature of the spill, using past spills as benchmarks:

Redo_oil spills


Radar charts, Ripleyesque replies, and straight-ticket votes

As you may have noticed, posting has been sluggish lately, and this means I have a backlog of submissions. So please be patient if you sent something to me.

Accenture_gaping Alex C., at NUS, wasn't so patient so he wrote up his own post on this gaping chart from Accenture reporting on a survey of people's attitudes towards health care. In particular, Alex does a great job taking down the questions in the survey: for appetizers, he noted "So, the interviewees were asked 'Do you think it’s important to focus on delivering real improvements in the overall health of the nation?' I wonder who answered no?"

(Amusing aside: I fibbed. Alex wasn't impatient. He just couldn't believe how rude I was. After spending some unsavory time manually eyeballing the data from these charts, I accidentally emailed the spreadsheet to Alex, thinking that I was emailing myself. So Alex received a reply from me, with the spreadsheet attached but with no comments, and he figured I couldn't be bothered. Perhaps I "gave minimalist, Ripleyesque replies", he wondered.)

Well, I digress. There are a few other problems with this "radar" chart that Alex was too kind to overlook:

  • All of the information is in the radius from the center of the circle to individual dots and yet lines were drawn to connect dots into a ragged circle, drawing attention away from the information
  • Only on one quadrant was the scale of the radii provided, which frustrated me to no end when I tried to "eyelift" the data off the charts
  • There are slips in craftsmanship as some of the dots seem to fall out of place, e.g. on this particular chart, the two dots for "issue 16" does not seem to be aligned with the label 16, similarly something seems off with the dots for "issue 13". (These minor slips become very obvious when you are lifting data off the charts.)
  • Each category contains four questions and occupy one quadrant but the way the information is arranged on this chart type, one cannot visually aggregate the four individual scores to arrive at a category score.

***

Even more fundamentally, what data are being plotted? Turns out this is convoluted:

  • For each of the 16 "issues", respondents are asked to rate the importance and the government's performance on separate 5-point scales.
  • The top 2 points are considered "favorable", and the data depicted are the top-2-point proportions.

So the "gap" that Accenture consultants have stuck their fingers into is the proportion of respondents rating the government's role in an issue as "very important" or "essential" minus the proportion of respondents rating the government performance on that same issue as "fairly well" or "very well".

Accenture_gaping2 I just don't share the enthusiasm about this metric. Not merely because we are bound to think an issue is more important if the government has done a poor job at it.

***

Just to surface another problem: from these charts, it's clear that people in different countries approach 5-point scales differently. Perhaps in some countries (see right), they just fill out a "straight-ticket" vote for all issues?!

Thanks Alex for bringing this to our attention.

The PDF of the Accenture report is here.


Self-sufficient charts

A good example showed up in the New York Times recently of a chart that fails the self-sufficiency test that I often speak about here. First, the doctored chart (with the data removed):

Redo_hometeampies
And for comparison, the chart as originally printed (the chart was found only on the paper edition but not on line):

Nyt_homefield_sm
There is little doubt that the second version, with the data -- all four numbers -- printed on the chart, is much more effective, and that is why the designer thought to include them.

This shows that readers are gravitating to the data rather than the graphical constructs, and thus I consider these types of charts not self-sufficient. The graphical constructs can't stand on their own.

***

The choice of pie charts in a small-multiples arrangement is a mistake for this data set. While indeed in theory the winning percentage could range from 0 to 100%, in practice the winning percentages are rather narrowly dispersed (with the exception of the NFL which has a 16-game regular season).

Just quickly looking up the 2009 regular seasons: MLB teams ranged from 36% (Nationals) to 65% (Yankees); NHL ranged from 32% (Islanders) to 65% (Bruins); NBA from 21% (Sacramento) to 81% (Cleveland).

In order to judge whether 60% or 52% is a large or small number, readers need to have a sense of how teams are dispersed around those averages. A side-by-side boxplot brings this out pretty well (the data is for 2009 seasons).

Redo_homewins

The "box" in a boxplot contains the middle 50% of the teams in each league while the line inside the box depicts the median team (in terms of winning percentage).

The NBA teams showed much higher variability in winning percentages than the NHL or the MLB. The difference in average winning percentage of say, 2% or 5%, from one league to the next is not remarkable, given this fact.

(The original article did not really pertain to such a comparison so the reason for this chart is not clear.)


A luxury we can afford to miss

Raghu R. alerted us to this bar chart from RealClearWorld.com, reporting survey results in collaboration with Gallup

  Mali-Tanzania-DRofC  

The poll showed that most of the most pro-U.S. countries are African, and this chart displayed the levels of approval in Mali, Tanzania and the Democratic Republic of the Congo, all ranked fifth.

This effort is an obvious failure, and the resulting chart, particularly the background image, is too much baggage for the simple data set it contained.

***

Junkcharts_trifecta_sm

  In our Trifecta Check-up, this chart fails on two fronts: it uses the wrong chart type, and it asks a meaningless question.

  • The percentages are proportions of each country's population, and it is silly to stack them up as if they could be added.
  • Because the three countries are all ranked 5th -- in each country, 89% of the respondents approved of the U.S., there is no point in showing three series of data! Each series say the same thing.

I am not offering up the following profile chart as an alternative but it is here to illustrate why there really are only three data points, and so a chart is a luxury we can afford to miss.

Junkcharts_realclearworld