David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.
David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.
11:45 PM | Permalink | Comments (13) | TrackBack (0)
We bring attention to a book on graphics written by Bernard Lebelle, a frequent contributor to this blog. The book came out in France earlier this year. The title is "Convaincre avec des graphiques efficaces sous Excel, PowerPoint ...", published by Eyrolles. Thankfully because much of the book is visual, I don't need to know French to understand much of it. Here, I discuss two interesting things:
On page 13, he discussed flow diagrams using the energy flow example that led to a long discussion on this blog. He proposed using a Merimecko chart instead.
On page 89, he showed a concentric circle chart (see below). This is a relatively simple train schedule showing the frequency of trains at each hour on each day of the week. It looks interesting because of the allusion to the clock, except that typical clocks have twelve hours rather than 24. I'd create a set of two charts, one for the first twelve hours, one for the second twelve.
This sort of chart is very limited in utility but it works well here because the data is entirely categorical - one or two trains per hour, hour of day, day of week - and in addition, the relationships are very simple. In fact, the reader/user does not need to read any trends, general patterns or estimate the size or shape of anything. The user is performing a simple search operation, that's it.
(The innermost circle is unlabelled so it is unclear what that signifies.)
Lebelle provided an alternative on page 90, which is essentially a data table, with time on the vertical dimension and calendar date on the horizontal, and the frequency inside the cells. This is more straightforward, less interesting.
On page 151, he mentioned the self-sufficiency test that we discussed often here. A graph should do more than just print all the data in the data set.
Lebelle is currently Senior Manager at Deloitte, the management consulting company, and he focuses on graphical construction in Excel. This is both a limitation and an advantage. Excel, of course, has many imperfections (don't get me started on the new and horrid Excel). However, Excel is still the most widely used graphing application, by fa
The book takes a perspective on charting that fits our philosophy very well. Here is a rough summary of the contents of the book (any mistakes are mine):
chapter 1: a summary of the key features of good charts... issues such as clarity and efficiency of the message are addressed
chapter 2: historical perspective, with examples from Playfair, Minard, Nightingale, etc. page 38 has an interesting table comparing the contributions of Bertin, Tukey, Tufte, Ware and Cleveland.
chapter 3: constructs of a chart such as axes, legends, etc. page 43 explains the difference between "information design", "infographics", "charts" and "information visualization". introduces chartjunk, data-ink ratio.
chapter 4: "decoding" of a chart. Discusses optical illusions, which I also consider to be fundamental to understanding the effect of charts on the audience. Talks about how different ways of displaying the same data is perceived differently. Interesting section (starting p.101) considering some quantitative theories about perception, citing Ernst Weber and Stanley Smith Stevens.
chapter 5: process of making a chart. The nitty-gritty things like transforming the data, picking a scale, etc.
chapter 6: examples. Also introduces a classification system for charts. It has one of those flowcharts which is supposed to allow someone to pick a type of chart based on whether the data is numeric or categorical, etc. I know this is very popular in engineering and scientific textbooks but I have never found any use for such flowcharts. There are 30 - 40 pages of charts here and a great resource to get some ideas.
chapter 7: exercises
chapter 8: resources
08:01 AM | Permalink | Comments (7) | TrackBack (0)
Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.
Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.
This post is about how to read a graph. Here are some things that come to mind looking at the map:
The more data is used to create a graph, the harder our task is to interpret it. But the pay-off for spending the time is all the sweeter. Happy graph-reading!
One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon. This is a great and fun way for readers to find interesting books.
Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.
01:06 AM | Permalink | Comments (4) | TrackBack (0)
Professor Gelman generally believes the red state, blue state paradigm is too simplistic to describe the American electorate. He has been sharing some of his work on his blog, and has just published a book about this topic. Recently he produced the following chart, which is gimmick-looking but crystal clear in its message.
Here, economic and social ideology are plotted on a scatter chart, with positive values indicating conservatism and negative values liberalism. Further, each state is represented twice on the chart, the red point for the Republicans and the blue for Democrats within the state.
This is a cluster analyst's dream data set. The absolute separation of the Republican cluster and the Democrat cluster is astounding: imagine a diagonal line perfectly classifying all points.
We should not miss a host of details:
Reference: Gelman, "Ranking states by conservatism/liberalism of their voters", June 30 2008.
07:10 PM | Permalink | Comments (6) | TrackBack (0)
Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data. There are still a few days left to participate. While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.
I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs. She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on. Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover. The page design - with half of every page blank - is refreshingly easy on the eyes. Inclusion of examples is generous.
Lets review her point of view of some of the topics we discuss frequently on Junk Charts:
Starting axis at zero: she thinks "all bar charts must include zero. However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)
Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.
Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot. Again, I infer she's in favor.
12:14 AM | Permalink | Comments (2) | TrackBack (0)
Graduation rates at 47 new small public high schools that have opened since 2002 are substantially higher than the citywide average, an indication that the Bloomberg administration’s decision to break up many large failing high schools has achieved some early success.
Most of the schools have made considerable advances over the low-performing large high schools they replaced. Eight schools out of the 47 small schools graduated more than 90 percent of their students.
This graphic included in the NYT article lent support to the "small schools movement". In particular, note the last sentence of the above quotation: it incorporates the oft-used device of subgroup support of a hypothesis, in this case, the subgroup of eight top-performing schools.
Such analysis is "dangerous", according to Howard Wainer, who discusses this and other examples of misapplication in a recent article in American Scientist, entitled "The Most Dangerous Equation". He alleged that billions have been wasted in the pursuit of small schools.
The issue concerns sample size. Dr. Wainer and associates analyzed math scores from Pennsylvania public schools.
Average scores for smaller schools are based on smaller number of students, and therefore less stable (more variable). More variability means more extremes. Thus, by chance alone, we expect to find more smaller schools among the top performers. Similarly, by chance alone, we also expect to find more smaller schools among the worst performers.
The scatter plot lays out their argument. Focusing only on the top performers (blue dots), one might conclude that smaller schools do better. However, when the bottom performers (green) are also considered, the story no longer holds. Indeed, the regression line is essentially flat, indicating that scores are not correlated with school size.
This is all nicely explained via the standard error formula (De Moivre's equation) in Dr. Wainer's article. Here is a NYT article from the mid 1990s describing this same phenomenon.
File this as another comparability problem. Because estimates based on smaller samples are less reliable, one must take extra care when comparing small samples to large samples.
Dr. Wainer is publishing a new book next year, called
"The Second Watch: navigating the uncertain world". I'm eagerly looking forward to it. His previous books, such as Graphic Discovery and Visual Revelations
, both part of the Junk
Charts collection.
Sources: "The Most Dangerous Equation", American Scientist, November 2007; "Small Schools Are Ahead in Graduation", New York Times, June 30 2007.
P.S. Referring back to the NYT chart above, one might wonder at the impossible feat of raising graduation rates across the board simply by breaking up large schools into smaller ones. This topic was taken up here, here and here. When evaluating the "small schools" policy, it is a mistake to discuss only the performance of small schools; any responsible analysis must look at improvement over all schools. Otherwise, it's a simple matter of letting small schools skim off the cream from larger schools.
08:57 PM | Permalink | Comments (3) | TrackBack (0)
Here's something different, a mini book review of Ian Ayre's "Super Crunchers". This book can be recommended to anyone interested in what statisticians and data analysts do for a living. Ian is to be congratulated for making an abstruse subject lively.
His main thesis is that data analysis beats intuition and expertise in many decision-making processes; and therefore it is important for everyone to have a basic notion of the two powerful tools of regression and randomization. He correctly points out that the ready availability of large amounts of data in recent times has empowered data analysts.
Regression is a statistical workhorse often used for prediction based on historical data. Randomization refers to assigning subjects at random to multiple groups, and then examining if differential treatment by group leads to differential response. (In particular, the chapter on randomization covers the topic well.) Using regression to analyze data collected from randomized experiments allows one to establish cause-effect.
In the following, I offer a second helping for those who have tasted Ian's first course:
08:22 AM | Permalink | Comments (7) | TrackBack (0)
I've been reading my friend's anti-smoking tome, and traced this "infographic" back to its source (World Health Organization).
I was very intrigued by the "lines of death" which seemed to make the point that the risk of death had a spatial correlation: specifically, that the death risk for male smokers was higher in northern hemisphere (above the line), primarily developed countries, as compared to the southern hemisphere, mostly developing nations.
I find that somewhat counter-intuitive but in a fascinating book like this, that brings together both scientific, psychological and societal commentary, I was expecting to learn new things.
Looking at the legend, the red areas were regions in which deaths from tobacco use accounted for over 25% of "total deaths among men and women over 35". This explained some, as perhaps there were more reasons to die (warfare, other diseases, mine accidents, etc.) in developing nations than in developed nations, or that they had larger populations (so more deaths even at lower rates).
However, the description of the "lines of death" raised my eyebrows. It is now claimed that more than 25% of middle-aged people (35-69 years old) die from tobacco use in the red regions.
Did they mean 25% of the dead middle-aged people die from smoking? Or 25% of all middle-aged folks die from smoking? A gigantic difference!
Percentages are very tricky things to use. Every time I see a percentage, the first thing I ask is what is the base population. Here, the baseline appeared to have gotten lost in translation.
This set of maps also shows the peril of focusing too much on entertainment value, and losing the plot.
For those concerned about the effect of smoking on our society and our children, I highly recommend Dr. Rabinoff's highly readable new book, "Ending the tobacco holocaust". It contains lots of interesting tidbits and really brings together every cogent argument that exists, including the common ones you've heard and others you haven't.
Reference: "Ending the tobacco holocaust" by Michael Rabinoff; The Tobacco Atlas by the World Health Organization
01:32 AM | Permalink | Comments (7) | TrackBack (0)

Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.
This graphic, unfortunately chosen, contains many distractions from the main message, including:
The last point merits special attention. The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households. It is better to directly compare the two household types than to indirectly compare one type to the overall.
Further, households without smokers should be extremely likely to have no smoking in residence all week. And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis. Our junkart version below corrects for this as well as other things.
One of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within. Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data. In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.
I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics. Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.
Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.
01:48 AM | Permalink | Comments (2) | TrackBack (0)
I finally got around to reading "When Genius Failed", Roger Lowenstein's account of the spectacular collapse of LTCM, the hedge fund fronted by Scholes and Merton, Nobel laureates both.
It is a sobering read for anyone in the business of statistical prediction and modeling for sure.
What also caught my eye, and caused dismay, is how Lowenstein got basic statistical principles wrong in the book. He used the bully pulpit to sound the usual alarm against the normality assumption and for fat tails. He began by confusing LLN and CLT (central limit theorem):
Statisticians have long been aware of the "law of large numbers". Roughly speaking, if you have enough samples of a random event, they will tend to distribute in the familiar bell curve ...
In the same breadth, he then equated two different probability distributions:
This is called the normal distribution, or in mathematical terms, the lognormal distribution.
Doesn't this say something about the state of statistical literacy?
PS. Here is a link to Dunbar's "Inventing Money" (thanks Marc). It apparently came out before Lowenstein but didn't get as much press.
11:54 PM | Permalink | Comments (4) | TrackBack (0)


Recent Comments