Reading: WSJ Guide to Information Graphics

Dona Wong, who had stints on the graphics teams at both the Wall Street Journal and the New York Times, has contributed a how-to book on statistical graphics. It is called "The Wall Street Journal. Guide to Information Graphics".

The biggest strength of this book is the material on data collection and selection, which is an overlooked aspect of statistical graphics. The content of p.103, for example, is not typically found in similar books: on this page, Wong works through how to determine the scales for two stock-price charts in such a way that the distances represent relative changes in stock prices (rather than absolute changes). Chapter 3 ("Ready Reference"), which covers this type of material, is almost as big as Chapter 2, which runs through basic rules of making graphs that should be familiar to our readers. Her philosophy, then, leans toward Tukey's as espoused in his seminal book EDA, although Wong keeps to the most basic elements (percentages, indices, log scales, etc.), obviously aiming for a different audience than Tukey.

The guidelines relating to making charts are prescriptive and concise. The following snippet (pp.72-73) is typical of the style:


Wong focuses on saying what to do, but (usually) not why. Perhaps for this reason, the book has no references or notes, except for mentioning Ed Tufte as Wong's thesis adviser. Almost all the best practices described in the book would meet with our approval. One that has not been featured much on this blog is the preference for shades of the same color to many different colors of the same shade.

Despite the title, the book actually discusses statistical graphics (same as Junk Charts), not "infographics" (as covered by Information Aesthetics, for example). Almost all the graphical examples are conceptual, and not based on real-life examples. This editorial decision has the advantage of sharpening the educational message but the disadvantage of being less engaging.

A unique feature of Wong's book is Chapter 5 ("Charting Your Course"), which covers business charts used to organize operational data, rather than present insights -- things like Gantt charts (which she calls work plans), org charts, flow charts, 2-by-2 matrices, and so on. Things that are in the toolkit of management consultants. This is an under-studied area, and deserves more attention. I am reminded of Tufte's re-design of bus schedules. This type of charts is different in the need to print all pieces of data onto the chart, the prevalence of text data (and the difficulty of incorporating them into charts), and efficient search as a primary goal. And it is in this chapter that the decision to stay conceptual diminishes the impact: it would be very valuable for readers to see a complete Gantt chart based on a real project, and how it evolves over the course of the project. I have always found these types of charts to start out nicely but gradually sink as details and detours pile up.


There is one chart on p.59 I would like to discuss.

Dona2  Here, Wong allows the use of double axes in certain cases, basically when the two data series have linearly-related scales. She appends the advice: "Adhere to the correct chart type for each series -- lines for continuous data and bars for discrete quantities... The only exception is when both data series call for a chart with vertical bars. In such instances, convert one to a line." (Regular readers know I don't think much of this rule.)

Based on the chart above, Wong either considers both revenue and market share to be discrete quantities, or considers revenue to be discrete and market share to be continuous. In my mind, both series are continuous data and a chart with two lines is appropriate here.

Following one's nose 2

This is the second post on the immigration paradox study, first discussed on the Gelman blog.  My prior post on the graphing aspect is here; this post focuses on the statistical aspects. I am working backwards on Andrew's discussion points.

Which difference is most interesting?

Interaction 5. Agree with Andrew; they should publish similar analyses on other minority groups as soon as possible.  One thing that strikes me when looking at the interaction plot is that the U.S. born non-Latino whites have a much higher incidence of mental illness.  The difference between different subgroups of Latinos paled in comparison to the difference between non-Latinos and the Latinos.  This latter difference is particularly acute among the U.S. born than the immigrants. The importance of the Latino analysis hinges upon whether the "paradox" is also found among other minority groups.

(Chris P also pointed this out in his comment on the previous post.)

Disaggregation, Practical Significance, and the Meaning of Not Significant

2. Andrew is also right in expressing moderate skepticism about this sort of disaggregation exercise.  He connects this to the subtle statistical point that "the difference between significant and not significant is not significant."  A related but less obtruse issue is that as one disaggregates any data, the chance of seeing variations that stray from the average gets higher and higher.  This is because the sample size is decreasing, and so the statistical estimates are less reliable.

(To give a flavor of the scale, there were a total of 2500 Latinos in the sample, with 500 Puerto Rican Latinos. The analysis drilled down to the level of different types of mental disorders, subgroups of Latinos, and also adjusted for demographics.  The details of the demographic adjustment are not available but in any case, one should be concerned about whether there were sufficient numbers of say, male immigrant Puerto Rican Latinos age 18-25 with income < $10,000 living in a rental apartment, for such an elaborate exercise.)

Expanding on this point further, one observes that the measured gap between U.S. born and immigrant Puerto Rican Latinos was about 5%.  But this 5% is probably of considerable practical significance since the base rate of incidence is about 30% (I say probably since I am not an expert in mental illness).  The current statistical analysis judged this to be insignificant -- if the sample size were larger, this difference could conceivably be statistically significant, and also practically significant.

But, doesn't the significance test deal with the small sample size problem?  Yes, if the authors merely described the Puerto Rico result as inconclusive.  Here, as is done very commonly, insignificance is equated to "no difference": they said

No differences were found in lifetime prevalence rates between migrant and U.S.-born Puerto Rican subjects.

In reality, a difference of 5% was found in the sample that was analyzed.  The statistical procedure found that this difference could have been a result of chance -- notice "could", not "must".  If the measured difference was 0.5% on 30%, then I might be willing to accept a finding of "no difference"; when it was 5% on 30%, I would like to see a larger sample analyzed.

The Meaning of Paradox

1. Andrew was perplexed by why the phenomenon is known as a "paradox". I had the same issue until I read the paper. The authors were a bit sloppy in the abstract. In the paper itself, they explained that the conventional wisdom has it that immigrants should be more likely to have mental illness because of the stress from the immigration process, and yet the statistics showed the exact opposite. That is the paradox.

Publication Bias

I was a little shocked to see the data tables that gave all the estimates of the various effects at the various subgroup levels: shocked because the authors were allowed (or asked) to include only the p-values that were below some unspecified level (which I surmised is 10% although a 5% significance level is used to judge significance as per convention). This is publication bias within publication bias. P-values that are not significant still provide valuable information and should not be omitted. They did provide confidence intervals but for each subgroup separately, rather than for the difference -- and as they noted, such intervals by themselves are inconclusive when they overlap moderately.


Following one's nose 1

Andrew Gelman has a great post about a so-called Immigrant paradox here, which should be interesting to our readers too.

He posed a set of sharp questions.  My read, in reverse order:

Immigrant_paradox 6. The graph is pretty effective, I agree.  This is known as an "interaction plot".  The message the authors were trying to send was that the gap between immigrants and U.S. born in terms of prevalence of mental illness is not constant across sub-groups of Latinos.  For example, the gap for Mexicans (light blue) is larger than the gap for Puerto Ricans (pink).  Thus, the authors concluded that one should be careful about speaking of an aggregate (average) gap.

The graph lays this out clearly.  The steeper the line, the bigger the gap between the  immigrants and non-immigrants.

When Andrew showed this, I knew for sure someone will cry foul that a line is drawn between unrelated, discrete things.  Indeed, the very first commenter weighed in with this complaint.  In fact, whenever I show such charts to non-statisticians, a lot of people have this reaction.

So I'll take this as another chance to convince you to release interaction plots from jail.

Mental_nolines Typically, a dissenter will offer up a dot plot as an alternative.  So let's look at the same chart without the lines.  Since the reader is supposed to figure out how the gap between U.S. born and immigrant groups across different subgroups of Latinos, the proverbial nose is tracing a line from a left dot to a right dot.  Thus, to follow one's nose is to mentally draw the lines I just removed.  The chart designer has done us a favor by making the lines explicit.

In addition, as Andrew pointed out, it is always better to try to get rid of the legend and put the line labels directly onto the chart.

One shortcoming of the interaction plot is that it does not disclose the relative importance of the different lines, which correspond to the relative proportions of people in these subgroups.  Without this information, the reader will likely assume the lines have equal weight.  This assumption, as I will explain in a future post, may be a problem.

This post dealt with the graphical aspect.  I will have more to say about Andrew's other points on the statistics in a future post.

Seth's Rules

(Via Gelman blog)

Prominent marketer Seth Godin came up with some sensible rules for making "graphs that work".  We pretty much agree with most of what he says here, unlike the last time he talked about charting.

One must recognize that he has a very specific type of chart in mind, the purpose of which is to facilitate business decisions.  And not surprisingly, he advocates simple, predictable story-telling.

His first rule: dispense with Excel and Powerpoint.  Agreed but to our dismay, there are not many alternatives out there that sit on corporate computers.  So we need a corollary: assume that Excel will unerringly pick the wrong option, whether it is the gridlines, axis labels, font sizes, colors, etc.  Spend the time to edit every single aspect of the chart!

His second rule: never show a chart for exploration or one that says nothing.  I used to call these charts that murmur but do not opine.  (See here, for example.)  This pretty much condemns the entire class of infographics as graphs that don't work.   This statement will surely drive some mad.  One of the challenges that face infographics is to bridge the gap between exploration and enlightenment, between research and insight.  As I said repeatedly, I value the immense amount of effort taken to impose structure and clarity on massive volumes of data -- but more is needed for these to jump out of the research lab.

In rules 3 and 4, Seth apparently makes a distinction between rules made to be followed and rules made to be broken.  In his view, time going left to right belongs to the former while not using circles belongs to the latter.  He gave a good example of why pictures of white teeth are preferred to pie charts, bravo.  I hope all those marketers are listening.

As readers know, I cannot agree with "don't connect unrelated events".  He's talking about using line charts only for continuous data.  This rule condemns the whole class of profile plots, including interaction charts in which statisticians routinely connect average values across discrete groupings.  The same rule has created the menace of grouped bar charts used almost exclusively to illustrate market research results (dozens to hundreds of pages of these for each study).  I'd file this under rules made to be broken!

What menace?


What menace?


What menace?


What menace?

Alright, I made my point.  If you don't work in market research, the mother lode of cross-tabs and grouped bars, consider yourself lucky.  If you do, will you start making line charts please?

Serving donuts

David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.

Nyt_gradrate David gave a partial review of a new book "Crossing The Finish Line", focusing on their conclusion that public universities must improve their 4-year graduation rates in order for education in the U.S. to achieve progress.  This conclusion was arrived at through statistical analysis of detailed longitudinal data (collected since 1999).

This chart is used to illustrate this conclusion.  We will come to the graphical offering later but first I want to fill in some details omitted from David's article by walking through how a statistician would look at this matter, what it means by "controlling for" something.

The question at hand is whether public universities, especially less selective ones, have "caused" students to lag behind in graduation rate.  A first-order analysis would immediately find that the overall graduation rate at less selective public universities to be lower, about 20% lower, than at more selective public universities.  

A doubter appears, and suggests that less selective schools are saddled with lower-ability students, and that would be the "cause" of lower graduation rates, as opposed to anything the schools actually do to students.  Not so fast, the statistician now disaggregates the data and look at the graduation rates within subgroups of students with comparable ability (in this instance, the researchers used GPA and SAT scores as indicators of ability).  This is known as "controlling for the ability level".  The data now shows that at every ability level, the same gap of about 20% exists: about 20% fewer students graduate at the less selective colleges than at the more selective ones.  This eliminates the mix of abilities as a viable "cause" of lower graduation rates.

The researchers now conclude that conditions of the schools (I think they blame the administrators) "caused" the lower graduation rates.  Note, however, that this does not preclude factors other than mix of abilities and school conditions from being the real "cause" of lower graduation rates.  But as far as this analysis goes, it sounds pretty convincing to me.

That is, if I ignore the fact that graduation rates are really artifacts of how much the administrators want to graduate students.  As the book review article pointed out, at the less selective colleges, they may want to reduce graduation rates in order to save money since juniors and seniors are more expensive to support due to smaller class sizes and so on.  On the other hand, the most selective colleges have an incentive to maintain a near-perfect graduation rates since the US News and other organizations typically use this metric in their rankings -- if you were the administrator, what would you do?  (You didn't hear it from here.)

Back to the chart, or shall we say the delivery of 16 donuts?

First, it fails the self-sufficiency principle.  If we remove the graphical bits, nothing much is lost from the chart.  Both are equally impenetrable.

A far better alternative is shown below, using a type of profile chart.


Finally, I must mention that in this particular case, there is no need to draw all four lines.  Since the finding of a 20% gap essentially holds for all subgroups, no information is lost by collapsing the subgroups and reporting the average line instead (with a note explaining that the same effect affected every subgroup).  

By the way, that is the difference between the statistical grapher - who is always looking to simplify the data - and the information grapher - who is aiming for fidelity. 

Reference: "Colleges are lagging in graduation rates", New York Times, Sept 9, 2009; "Book review: (Not) Crossing the Finish Line", Inside Higher Education, Sept 9 2009.

Supplemental reading

What are other graphics blogs talking about recently?

Subway_sparklines2 Information Aesthetics highlighted the so-called New York City Subway sparklines.   (original site)  (Andrew also mentioned it.)

IA said "
The general idea is that the history of subway ridership tells a story about the history of a neighborhood that is much richer than the overall trend." 

Okay but what about these sparklines would clarify that history?  From what I can tell, this is a case of making the chart and then making sense of it.

The chart designer did make a memorable comment in his blog entry: "Hammer in hand, I of course saw this spreadsheet as a bucket of nails."  The hammer is a piece of software he created; the nails, the data of trips taken.

Wsj_stresstest Nathan at FlowingData gave a reluctant passing grade to this Wall Street Journal bubbles chart illustrating the recent U.S. bank "stress" test.

One should fight grade inflation with an iron fist.  (Hat tip to Dean Malkiel at Princeton.)  A simple profile chart would work nicely since the focus is primarily on ranks.  The bubbles, as usual, add nothing to the chart, especially where one can create any kind of dramatic effect by scaling them differently.

Envy_map Nathan also pointed to the maps of the seven sins, which garnered some national attention.  This set of maps is a great illustration of the weakness of maps to study spatial distribution of anything that is highly correlated with population distribution.  Do cows have envy too?  See related discussion at the Gelman blog.

The matter of bad choice

Right on the heels of the disastrous bubble chart comes another, courtesy of the NYT Magazine.  Bubble charts are okay for the conceptual ("this is really big, and that is really tiny").  This chart wants readers to compare the sizes of the bubbles, which highlights the worst part of such graphs.

Poor scaling is the huge issue with bubble charts.  They are the prototype of what I call not "self-sufficient" charts.  Without printing all the data, the chart is unscaled, and thus useless (see below middle).  When all the data is printed (as in the original, below left), it is no better than a data table.


In the above right chart, we simulated the situation of a bar or column chart, i.e. we provide a scale.  For this chart, the convenient "tick marks" are at 10, 20, 34, 41.  Unfortunately, this scaled version also fails to amuse.

Note further that the data should have been presented in two sections: the party affiliation analysis and the gender analysis.  Also, it is customary to place "Independents" between "Republicans" and "Democrats" because they are middle-of-the-road.

Redo_pewpoll A profile chart is an attractive way to show this data.  Here, we quickly learn a couple of things obscured in the bubble chart.

On the issue of abortion, Independents are much closer to Democrats than Republicans.  Also, there is barely any difference between the genders, the only difference being the strength of support among those who want to legalize.

Reference: "A matter of Choice", New York Times Magazine, Oct 19 2008.

PS. Based on RichmondTom's suggestion, here are the cumulative profile charts.


Bernard L. suggested a "tornado" chart:

A matter of choice

Two books

Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data.   There are still a few days left to participate.  While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.

I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs.  She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on.  Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover.  The page design - with half of every page blank - is refreshingly easy on the eyes.  Inclusion of examples is generous. 

Lets review her point of view of some of the topics we discuss frequently on Junk Charts:

Starting axis at zero: she thinks "all bar charts must include zero.  However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)

Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.

Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot.  Again, I infer she's in favor.

Football rankings 1.1

Long-time reader Jon sent in a different view of the QB data.  He uses a nifty tool in Excel to generate a parallel coordinates plot (also called profile plot) on which pairs of QBs can be highlighted and compared.

Jon_garrard This chart exploits the foreground background concept very nicely.  One way to deal with abundant data is to highlight only those bits that matter to the question at hand, and relegating the rest to the background.

The gray lines in the background provide context without grabbing undue attention. He also converted every metric to a scale between 0 and 1, similar to what we did with our version.

The Eli Manning / Philip Rivers comparison shows that both QBs were below average on most of these metrics, with Manning near the bottom of each.

Noisy subways

This NYC subway report is impossible to read.

However, it is very difficult to find a good way to show the information.  In fact, the data contained very little of that.  Curiously, the ratings are very dispersed so that each line is graded high on some category and low on others.  Here's one view of it:


I have grouped the subway lines together (A/C/E, 4/5/6, etc.).  The metrics are plotted left to right in the same order as in the original.  Is it all noise and no signal?

(I just realized the vertical axis is reversed: best ratings are at the bottom, worst ratings at the top.  Doesn't matter anyway since I can't see any patterns.)

Source: "No. 1 Train is Rated Highest by Commuter Advocates", New York Times, July 24 2007.

PS. Two contributions from readers.  Still looking for insight from this data...

Trains789fg5_2 Trainspotmatrix_2