David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.
David gave a partial review of a new book "Crossing The Finish Line", focusing on their conclusion that public universities must improve their 4-year graduation rates in order for education in the U.S. to achieve progress. This conclusion was arrived at through statistical analysis of detailed longitudinal data (collected since 1999).
This chart is used to illustrate this conclusion. We will come to the graphical offering later but first I want to fill in some details omitted from David's article by walking through how a statistician would look at this matter, what it means by "controlling for" something.
The question at hand is whether public universities, especially less selective ones, have "caused" students to lag behind in graduation rate. A first-order analysis would immediately find that the overall graduation rate at less selective public universities to be lower, about 20% lower, than at more selective public universities.
A doubter appears, and suggests that less selective schools are saddled with lower-ability students, and that would be the "cause" of lower graduation rates, as opposed to anything the schools actually do to students. Not so fast, the statistician now disaggregates the data and look at the graduation rates within subgroups of students with comparable ability (in this instance, the researchers used GPA and SAT scores as indicators of ability). This is known as "controlling for the ability level". The data now shows that at every ability level, the same gap of about 20% exists: about 20% fewer students graduate at the less selective colleges than at the more selective ones. This eliminates the mix of abilities as a viable "cause" of lower graduation rates.
The researchers now conclude that conditions of the schools (I think they blame the administrators) "caused" the lower graduation rates. Note, however, that this does not preclude factors other than mix of abilities and school conditions from being the real "cause" of lower graduation rates. But as far as this analysis goes, it sounds pretty convincing to me.
That is, if I ignore the fact that graduation rates are really artifacts of how much the administrators want to graduate students. As the book review article pointed out, at the less selective colleges, they may want to reduce graduation rates in order to save money since juniors and seniors are more expensive to support due to smaller class sizes and so on. On the other hand, the most selective colleges have an incentive to maintain a near-perfect graduation rates since the US News and other organizations typically use this metric in their rankings -- if you were the administrator, what would you do? (You didn't hear it from here.)
Back to the chart, or shall we say the delivery of 16 donuts?
First, it fails the self-sufficiency principle. If we remove the graphical bits, nothing much is lost from the chart. Both are equally impenetrable.
A far better alternative is shown below, using a type of profile chart.
Finally, I must mention that in this particular case, there is no need to draw all four lines. Since the finding of a 20% gap essentially holds for all subgroups, no information is lost by collapsing the subgroups and reporting the average line instead (with a note explaining that the same effect affected every subgroup).
By the way, that is the difference between the statistical grapher - who is always looking to simplify the data - and the information grapher - who is aiming for fidelity.
Reference: "Colleges are lagging in graduation rates", New York Times, Sept 9, 2009; "Book review: (Not) Crossing the Finish Line", Inside Higher Education, Sept 9 2009.
IA said "The general idea is that the history of subway ridership tells a story
about the history of a neighborhood that is much richer than the
Okay but what about these sparklines would clarify that history? From what I can tell, this is a case of making the chart and then making sense of it.
The chart designer did make a memorable comment in his blog entry: "Hammer in hand, I of course saw this spreadsheet as a bucket of nails." The hammer is a piece of software he created; the nails, the data of trips taken.
Nathan at FlowingData gave a reluctant passing grade to this Wall Street Journal bubbles chart illustrating the recent U.S. bank "stress" test.
One should fight grade inflation with an iron fist. (Hat tip to Dean Malkiel at Princeton.) A simple profile chart would work nicely since the focus is primarily on ranks. The bubbles, as usual, add nothing to the chart, especially where one can create any kind of dramatic effect by scaling them differently.
Nathan also pointed to the maps of the seven sins, which garnered some national attention. This set of maps is a great illustration of the weakness of maps to study spatial distribution of anything that is highly correlated with population distribution. Do cows have envy too? See related discussion at the Gelman blog.
Right on the heels of the disastrous bubble chart comes another, courtesy of the NYT Magazine. Bubble charts are okay for the conceptual ("this is really big, and that is really tiny"). This chart wants readers to compare the sizes of the bubbles, which highlights the worst part of such graphs.
Poor scaling is the huge issue with bubble charts. They are the prototype of what I call not "self-sufficient" charts. Without printing all the data, the chart is unscaled, and thus useless (see below middle). When all the data is printed (as in the original, below left), it is no better than a data table.
In the above right chart, we simulated the situation of a bar or column chart, i.e. we provide a scale. For this chart, the convenient "tick marks" are at 10, 20, 34, 41. Unfortunately, this scaled version also fails to amuse.
Note further that the data should have been presented in two sections: the party affiliation analysis and the gender analysis. Also, it is customary to place "Independents" between "Republicans" and "Democrats" because they are middle-of-the-road.
A profile chart is an attractive way to show this data. Here, we quickly learn a couple of things obscured in the bubble chart.
On the issue of abortion, Independents are much closer to Democrats than Republicans. Also, there is barely any difference between the genders, the only difference being the strength of support among those who want to legalize.
Reference: "A matter of Choice", New York Times Magazine, Oct 19 2008.
PS. Based on RichmondTom's suggestion, here are the cumulative profile charts.
Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data. There are still a few days left to participate. While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.
I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs. She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on. Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover. The page design - with half of every page blank - is refreshingly easy on the eyes. Inclusion of examples is generous.
Lets review her point of view of some of the topics we discuss frequently on Junk Charts:
Starting axis at zero: she thinks "all bar charts must include zero. However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)
Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.
Long-time reader Jon sent in a different view of the QB data. He uses a nifty tool in Excel to generate a parallel coordinates plot (also called profile plot) on which pairs of QBs can be highlighted and compared.
This chart exploits the foreground background concept very nicely. One way to deal with abundant data is to highlight only those bits that matter to the question at hand, and relegating the rest to the background.
The gray lines in the background provide context without grabbing undue attention. He also converted every metric to a scale between 0 and 1, similar to what we did with our version.
The Eli Manning / Philip Rivers comparison shows that both QBs were below average on most of these metrics, with Manning near the bottom of each.
However, it is very difficult to find a good way to show the information. In fact, the data contained very little of that. Curiously, the ratings are very dispersed so that each line is graded high on some category and low on others. Here's one view of it:
I have grouped the subway lines together (A/C/E, 4/5/6, etc.). The metrics are plotted left to right in the same order as in the original. Is it all noise and no signal?
(I just realized the vertical axis is reversed: best ratings are at the bottom, worst ratings at the top. Doesn't matter anyway since I can't see any patterns.)
Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line. This is a pretty chart that does an admirable job with a difficult data set.
The key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense. So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line. In addition, the total of each column can be much more than 100% because multiple responses were allowed.
Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people. A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers". But this is wrong because the chart hides the age distribution. While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives". This is the difference between prevalence and incidence rate. (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)
The construct of the square grids is less damaging than it seems. In effect, the data has been rescaled by dividing by 10. The reader is then forced to apply "rounding". If you are someone who sees $19.95 as $19, then you'd round down the partial rows. If you see $19.95 as $20, you'd round up the partial rows. So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.
Here's another example where the profile chart shines. Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work. (Prior discussion of this issue here.)
This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities. The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives". We also see that the likelihood of being "Collectors" has little to do with age.
In a previous post, I explained the value of sketching when creating graphs. Today, I will share a few other graphs that plot the same data as we discussed the other day, regarding the proportion of time spent on developing different modules of software.
A stacked column chart, suggested by John J., would look like this:
Compared to the profile chart, this chart has some weaknesses:
it's difficult to read off the proportions for middle blocks like Blinksale-Billing;
because the middle blocks "float", it is impossible to compare them properly;
it requires as many colors as there are variables.
These problems get worse as the data scale: more difficult to read off the data; more colors needed.
The Merrimecko, suggested by Bernard L., is the same chart as above except that the widths of the columns are made proportional to the relative number of lines of code. However, because these four companies do not make up the entire universe, proportional width make little sense here.
The profile chart can be drawn up in two ways: These charts typically display results of cluster analysis. This is a statistical data mining technique which discovers groups of like objects within a large data set. Often times, the computer will only tell you these 15 belong to Cluster 1, those 22 form Cluster 2, etc.
To figure out why the 15 belong together, the analyst needs to plot the explanatory variables against cluster index. Now, think of WuFoo, FeedBurner, etc. as clusters, and the proportion of code given to Application, etc. as variables.
While the line segments don't signify anything real, they trace out the precise paths our eyes would take when reading the stacked column chart above! Remember we wanted to compare the number of lines given to each function across companies. If shown the column chart, my eyes would flip across the top of the Application (blue) blocks from WuFoo to regonline. This path is exactly the brown line on our first profile chart.
The numbers for Marketing, Support and Billing are much easier to read too as they all start from zero for each company.
The right chart is another possibility but for this particular situation, I prefer the left one.
Finally, I am less familiar with the "parallel coordinates plot" that Derek talked about. I believe it is a variant of the profile chart.
Xan G. tells us that these "inconsistent pie charts ... make [his] head hurt". The dizzy array of colors is unfortunate, especially when "Application" gets a medium blue in three of four pies but an orange-red in one of them. Just like the baby names charts, it's important to keep the background constant when constructing small multiples.
The goal of this section was to uncover any [software development] task that might be
overlooked [by these startup companies]. When writing a software product, the tendency is to focus
100% on the application. Items like support, marketing, and especially
billing never cross your mind.
The junkart version below is designed to bring out this one message: that Blinksale has distinguished itself from the rest by having spent more time developing code for purposes other than the application itself.
I removed the raw counts of lines of code and focused only on the relative proportions. The former does nothing to argue the author's case.
The pie charts fail our self-sufficiency test. The reader must rely on the data table and data labels to understand the chart. If removed, the key message is obscured.
Going public may just be the most important -- and nerve racking -- decision any company will make. Managing and pricing an IPO is tricky, so picking the right underwriter is crucial. Bankers often boast of their league table prowess to win mandates, but quantity does not necessarily mean quality.
By quantity, they meant the amount of underwriting fees (revenues) earned; and by quality, the average stock performance of the newly-public companies, as of Feb 16, 2007.
Ten banks were compared on the two Qs using this chart, which is best described as the "file folder chart".
Amusingly, its creator sized the height of each file
according to the quality metric, which is the return % listed at the
top right corner of each file. The files were sorted by decreasing
quality. Since each file is a parallelogram, its area is proportional to quality.
However, the files overlap, preventing us from comparing the areas of the files. Besides, the point made in the article about the importance of both Qs is lost since this chart stressed quality over quantity. Quantity showed up as a low dot on the tallest file and a high dot on the shortest file.
The junkart version restores the balance. The blue lines highlighted several banks that scored high on one metric but low on the other. The construct is a profile chart, with only two variables.
Curious readers may wonder if there were only 10 banks in the IPO underwriting market. Far from it. The chart designer introduced a selection bias because banks were included based on Quantity, and then Quality was rated. This meant there is possibly a boutique firm with small revenues but higher quality than any of the 10 in the plot.
Furthermore, much useful information is missing, including the dispersion of returns, the number of deals, etc.
Reference: "Grading the IPO Underwriters", Institutional Investor, March 2007.