David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.
David gave a partial review of a new book "Crossing The Finish Line", focusing on their conclusion that public universities must improve their 4-year graduation rates in order for education in the U.S. to achieve progress. This conclusion was arrived at through statistical analysis of detailed longitudinal data (collected since 1999).
This chart is used to illustrate this conclusion. We will come to the graphical offering later but first I want to fill in some details omitted from David's article by walking through how a statistician would look at this matter, what it means by "controlling for" something.
The question at hand is whether public universities, especially less selective ones, have "caused" students to lag behind in graduation rate. A first-order analysis would immediately find that the overall graduation rate at less selective public universities to be lower, about 20% lower, than at more selective public universities.
A doubter appears, and suggests that less selective schools are saddled with lower-ability students, and that would be the "cause" of lower graduation rates, as opposed to anything the schools actually do to students. Not so fast, the statistician now disaggregates the data and look at the graduation rates within subgroups of students with comparable ability (in this instance, the researchers used GPA and SAT scores as indicators of ability). This is known as "controlling for the ability level". The data now shows that at every ability level, the same gap of about 20% exists: about 20% fewer students graduate at the less selective colleges than at the more selective ones. This eliminates the mix of abilities as a viable "cause" of lower graduation rates.
The researchers now conclude that conditions of the schools (I think they blame the administrators) "caused" the lower graduation rates. Note, however, that this does not preclude factors other than mix of abilities and school conditions from being the real "cause" of lower graduation rates. But as far as this analysis goes, it sounds pretty convincing to me.
That is, if I ignore the fact that graduation rates are really artifacts of how much the administrators want to graduate students. As the book review article pointed out, at the less selective colleges, they may want to reduce graduation rates in order to save money since juniors and seniors are more expensive to support due to smaller class sizes and so on. On the other hand, the most selective colleges have an incentive to maintain a near-perfect graduation rates since the US News and other organizations typically use this metric in their rankings -- if you were the administrator, what would you do? (You didn't hear it from here.)
Back to the chart, or shall we say the delivery of 16 donuts?
First, it fails the self-sufficiency principle. If we remove the graphical bits, nothing much is lost from the chart. Both are equally impenetrable.
A far better alternative is shown below, using a type of profile chart.
Finally, I must mention that in this particular case, there is no need to draw all four lines. Since the finding of a 20% gap essentially holds for all subgroups, no information is lost by collapsing the subgroups and reporting the average line instead (with a note explaining that the same effect affected every subgroup).
By the way, that is the difference between the statistical grapher - who is always looking to simplify the data - and the information grapher - who is aiming for fidelity.
Reference: "Colleges are lagging in graduation rates", New York Times, Sept 9, 2009; "Book review: (Not) Crossing the Finish Line", Inside Higher Education, Sept 9 2009.
Here are some interesting reading from other places:
Tag clouds have caught on since we approved them a while ago. One interesting use was at the Life Vicarious blog. They use it to compare the inclinations of three New York-based restaurant reviewers. What they should have done is to remove irrelevant words like "one", "also", "many", "make"/"made", etc. In statistics, this is called removing "noise" which helps bring out the "signal".
Andrew Gelman discussed the NYT article that reported the finding of unexpected male bias in the children of Asian American families. He can be counted on to make useful comments on any accompanying graphics. He rightly pointed out that this is one example of not starting at zero: the relevant baseline is 100 since the metric is essentially the over-age of males relative to females. I also agree that a line chart with a longer time series plotting percentages rather than over-age would work better.
The first one is a Marimekko which many would consider to be appropriate for this type of data. It is essentially a stacked bar chart where the width of the bar is scaled to the proportion of the type of gas. Here's what one would be looking at:
Merimekkos (also called Mosaic charts) share many of the problems of pie charts. Note the need to use multi-color, the difficulty in comparing the areas of the pieces (even worse than looking at sectors), and the difficulty in comparing across categories since the pieces float in irregular space (take for example the three pink pieces). My rule is: avoid at all costs. (Well, like the pie chart, when the data is sufficiently simple, with very few pieces and with some outliers, these could be acceptable.)
Secondly, here is a recycled junkart chart, with all white space removed from the interior. (Thanks to Derek for the suggestion.)
Depending on what the purpose of the chart is, one can decide what is the base for the proportions. My version preserves equity between the two dimensions. Anything else will require the designer to make a choice. If, for example, the base is 100% for each type of gas emitted, then the reader could not derive from the same chart the proportion of each source of emission (across all types of gases).
A particular genre of graphics is designed to induce awe: certain bits are allowed to stick out like a sore thumb. Via reader Andre L., and an archive of US Army medical photos and illustrations:
This is a small multiples graph designed to display the somewhat seasonal pattern of deaths due to influenza over years. Basically, we see a U shape in almost every year; however, the height of the peak, and the timing of the peak shows quite a lot of variation. Further, some years exhibit more of an L-shape than U-shape.
But the attention grabber here is the massive peak that occurred between 1918 and 1919. It was unusual in many ways... it was the second big peak during 1918, it occurred late in the year and ellided with the next year's peak. The designer allowed these two components to bleed into the other charts.
From the perspective of scale, readability, cleanliness, this bit sticks out like a sore thumb! But one has to say it is effective.
A log scale is often used to deal with data containing such outliers. But while this makes neater charts, the impact of the orders-of-magnitude difference is lost on the reader, except in her imagination.
Message to readers: I have a large backlog of reader suggestions. Please be patient as I slowly get through them. The frequency of posts will remain lower for the time being as I am busy finalizing a draft of a book. More on that in the near future.
Matt H, a reader, sent in the following entry (with minor edits).
I saw a couple of bad charts on money.cnn.com and thought I'd submit them to you.
They're both part of the same
feature on investment bargains caused by the recession.
It seems to me like both charts would have made their points more
eloquently by using a much simpler, more common form, like a bar
In Chart A, cubes are used to display the difference between
treasury bond yields and AAA municipal bond yields at the two-year
horizon and the ten-year horizon. The volume of each cube corresponds
to the yield for the given type of bond in the given period (I think),
which spreads the one dimension being compared (yields) across three
dimensions, making the differences look smaller than they really are. [...] At the two-year horizon, the two yields being compared are 1.16% for
Treasury bonds and 3.01% for AAA municipal bonds. The yield for AAA
municipal bonds in this case is more than 2.5 times larger than the
yield for Treasury bonds, but the difference doesn't look nearly that
big in the chart provided. [...]
Time out. Let me add that the inadvertent reference to an optical illusion concerning foreground and background! The "outline only" cube on the left should have approximately the same volume as the "solid red" cube on the right (3.01% versus 3.30%) and yet the red cube appeared quite a bit larger because our eyes reacted to the solid color more than thin outlines.
In Chart B, [...] Again, the
metric in question is bond yields: ten-year Treasury bond yields
compared to investment-grade corporate bond yields. The 2008 figure
for each is shown alongside the five-year average. This chart uses the
area of a circle to express these yields, spreading the one-dimensional
value across two dimensions. As in Chart A, the result is a chart in
which the difference between values does not appear as large as it
I will also send
a simple bar chart version of each chart -- the bar charts should illustrate the differences in yields more effectively than the charts actually used in this article.
These are his revised charts:
We can do even better to convert the chart on the right to a time-seriesline chart. Instead of the five-year average, it is better to display the gap beween treasury and corporate bonds for each of the five years plus 2008. This should make for a more eye-catching graphic.
Reference: "Investment in the bargain bin", CNN Money.
I ran across this hugely successful chart on Dean Foster's home page (and noted that he and his Wharton colleagues have a nice blog picking apart statistical errors committed in public.)
This is a histogram plotting the historical year-on-year returns of the S&P 500 index, binned into 10%-levels. It succeeds on two levels: the innovation of printing the years inside little blocks provides extra information without distracting the overall picture; the key message of this plot, that the negative return of 2008 is a negative outlier in the history of returns, is extremely clear.
This, in my mind, is a superior presentation than the usual time-seriesline chart that we see in every economics publication. For some purposes, it is better to unshackle ourselves from the linear time dimension, and this is a good example.
One question/comment: within each 10% level, the years are arranged in reverse chronological order fro top to bottom. This facilitates searching for a particular year. The obvious alternative is to order by the actual level of return, so that the result is akin to a stem-and-leaf plot.
While I like the graphical aspect of the chart, I feel like it has limited function. This graph appears useful to anyone who has a one-year investment horizon. If I want to predict what next year's S&P 500 return is, I might take a random sample from this distribution. However, as a lazy investor, I never look at a one-year horizon so this creates two problems: if I am looking five years out, I can't take five samples from this distribution because there is serial correlation in this data for sure; even if I could take those five samples, it is difficult to compute the five-year return in my head.
So what I did was to take the data and replicate this histogram for 2-year, 3-year, 5-year, 10-year, etc. returns. The results are as follows. I decided to simplify further and use Tukey's boxplot instead of the histogram. The data are real compounded total returns from S&P 500 from 1910-2008.
The boxplot on the top right shows that there is about a 25% chance that an investment in the S&P 500 will return negative in real terms in any three-year period (below the green line). At the other end, there is a 25% chance of getting earning more than 50% on the principal during those three years.
The next set of boxplots compared 5-year returns to 10-year returns and 10-year returns to 20-year returns. If we have a 10-year horizon, there is still positive chance of reaching the end of the decade and finding the investment under water! The median 10-year return is approximately doubling the principal (about 8% per annum compounded).
In a twenty-year period, there is hardly any chance of not making money on the S&P. There were two positive outliers of over 1000% (about 13% per annum compounded over 20 years).
If you happened to be in a Starbucks recently, you might have picked up some charts, which was what happened to one of our regular readers and commentators, ZBicyclist, who then tried his hand at chart critique here. He is worried they may reach millions soon. So look out!
This graph on the right -- which should rightfully be called a "tree ring graph" -- seems to me a fantastic concept although it is hard to think of data that would deserve this treatment. Certainly not the retail sales series plotted here!
One issue is scale: note the awkward way in which the innermost ring is used to designate the oldest sales data of $375 billion presumably in 1996, and think about how you would decide where to place the 2007 ring. (It's arbitrary.)
Another problem is labeling: when the growth is slow, the rings are close together, and labels have to be jittered (look at 2001 and 2002). In this case, a relatively simple solution is to have the entire series of years run diagonally.
Yet another challenge is relative radii versus relative areas. Inevitably, some readers will respond to the areas while others will respond to radii ZBicyclist, for example, belongs to the first group while in this case, I find myself siding with the latter. When the bubbles/rings overlap, it is difficult to assess areas.
Of course, a simple line chart would do the job with minimal fuss. The following chart issued by the National Retail Federation actually plots the growth rates, rather than the annual sales.
The first thought that came to mind after browsing through all the charts was: what a great job they have done to generate interest in food data, which has no right to be entertaining. Specifically, this is a list of things I appreciated:
An obvious effort was undertaken to extract the most thought provoking data out of a massive amount of statistics collected by various international agencies. There weren't any chart that is overstuffed, which is a common problem.
It would be somewhat inappropriate to use our standard tools to critique these charts. Clearly, the purpose of the designer was to draw readers into statistics that they might otherwise not care for.Moreover, the Wired culture
has long traded off efficiency for aesthetics, and this showed in a graph such as this, which is basically a line chart with two lines, and a lot of mysterious meaningless ornaments:
A nice use of a dualline chart, though. It works because both data series share the same scale and only one vertical axis is necessary, which is very subtly annotated here.
The maintenance of the same motifs across several charts is well done. (See the pages on corn, beef, catfish)
It would be nice if Wired would be brave enough to adopt the self-sufficiency principle, i.e. graphs should not contain a copy of the entire data set being depicted. Otherwise, a data table would suffice. The graphical construct should be self-sufficient. This rule is not often followed because of "loss aversion"; there is the fear that a graph without all the data is like an orphan separated from the parents. Since, as I noted, these graphs are mostly made for awe, there is really no need to print all the underlying data. For instance, these "column"-type charts can stand on their own without the data (adding a scale would help).
Not sure if sorting the categories alphabetically in the column chart is preferred to sorting by size of the category. The side effect of sorting alphabetically is that it spreads out the long and the short chunks, which simplifies labelling and thus reading.
Not a fan of area charts (see below). Although it is labelled properly, it is easy at first glance to focus on the orange line rather than the orange area. That would be a grave mistake. The orange line actually plots the total of the two types of fish rearing, not the aquaculture component. The chart is somewhat misleading because it is difficult to assess the growth rate of aquaculture. Much better to plot the size of both markets as two lines (either indiced or not).
Reference: "The Future of Food", Wired, Oct 20 2008.
When comparing two time series, one typically wants to discuss the size of the gap as it changes over time. This Business Weekchart, for example, depicted for readers the expanding gap between intra-day high and low prices of the S&P 500 for 2008.
This chart construct is effective at pointing out large changes but lacks precision in conveying smaller differences, or trends. It is always a good idea to plot the gap directly, as we will show below.
More importantly, a better choice of scale can help a lot. By focusing exclusively on variability (extreme values), this chart hides the relevant information of the closing prices of the S&P. A point spread of a 100 points means more when the index is at 800 than at 1200. In order to capture this, we can divide the point spread by the opening price of that day so we say the gap is one-eighth or one-twelfth of the opening price.
The junkart version makes both changes. The top chart fixes the scale, plotting the point spread as a percentage of daily opening prices. Relative to the original chart, the variability in the front part of 2008 was muted because the index was at higher levels back then.
The bottom chart plots the gap sizes (lengths of the high-low lines). It is without doubt that directly plotting the gaps showcases the key message. The current level of volatility is more than double what occurred at the beginning of the year.
If one wants to illuminate the trend as opposed to daily fluctuations, a further improvement will be using moving averages.
For those interested, shown below is a scatter plot that compares the original point spread and the derived point spread, which shows that the change is not trivial.
Reference: "The Market: A Daily Roller Coaster", Business Week, Oct 27 2008.
Frederic M. sent in this chart, together with his commentary.
Bubbles across rows have vastly different numbers but their circles are
of identical size (or vice versa). It borders on the ridiculous that all
bubbles of the US
row have the same size... The question if teenage birth rates and teen sex are
correlated cannot be eye-balled with this kind of display. The fact that you
cannot compare across rows make this an instance of “chart junk”.
White spaces -- always dangerous. Does lack of bubble imply no data or no abortions/sex?
Sorting -- this is what Howard Wainer called "Arizona first" with a twist (United States)
Loss aversion -- would U.S. readers be resentful if countries like Iceland are excluded? A much reduced version comparing U.S. to say Canada, U.K, Japan and Germany may yield more information for the reader.
Sufficiency -- if all the data are printed as in a table, why do we need the bubbles?
Reference: "Let's Talk About Sex ", New York Times, Sep 6 2008.