My coworker pointed me to a Huffington Post article claiming a Bill Gates byline that contains some highly dubious analysis and a horrific chart. We presume Gates was fed this information by some analysts but even so, one wishes he wouldn't promote innumeracy. But then, he has a history: Howard Wainer demolished analysis by his foundation used to channel lots of dollars to the "small schools" movement a few years ago; I wrote about that before.
First, the offensive chart:
Using double axes earns justified heckles but using two gridlines is a scandal! A scatter plot is the default for this type of data. (See next section for why this particular set of data is not informative anyway.)
I can't understand the choice of scale for the score axis. The orange line, for instance, seems to have a positive slope. In any case, since these scores are "scaled", and the "standard error" is about 1 (this number is surprisingly hard to find, even on Google), it would appear that between 300 and 400 on the score axis, there are 100 units of standard error. By convention, three units of standard error away from the average is considered rare (events). There is no conceivable way that the average score could jump by that much.
The analysis is also flawed. Here's the key paragraph:
Over the last four decades, the per-student cost of running our K-12 schools has more than doubled, while our student achievement has remained flat, and other countries have raced ahead. The same pattern holds for higher education. Spending has climbed, but our percentage of college graduates has dropped compared to other countries... For more than 30 years, spending has risen while performance stayed flat. Now we need to raise performance without spending a lot more.
This argument contains several statistical fallacies:
Comparing apples and oranges: a glaring piece of missing information is whether other countries have increased their per-student spending on education, and if so, how fast the growth is compared to that in the U.S. Without this, the analysis makes no sense.
Confusing correlation and causation: so spending increased while test scores stagnated. In order to conclude that there is something wrong with the spending, one must first believe that spending has a causal effect on test scores. Observe that this is not a conclusion from the data; it is an assumption going into the analysis, neither supported nor disputed by the data since the data merely show a (lack of) correlation. This is another instance of "story time": we see data, we see conclusion, we are misled into thinking that data supports conclusion but in fact, the data is an irrelevant distraction. (For other instances of "story time", see this link to my book blog.)
Fallacy #1 and fallacy#2 combined: even if you believe that spending affects test scores, it is still a stretch to say that spending in U.S. schools affects the gap in test scores between U.S. students and foreign students. In the world where foreign countries are frozen in time, maybe so but where foreign countries are investing in education, one can't say anything about the test score gap without first knowing what's going on overseas.
Assumption invalidating the analysis: In a short breath, the analyst admits the possibility of (a) spending increase together with flat scores and (b) score increase together with flat spending. One model under which both of those possibilities coexist is one in which test scores are independent of spending. If so, why would one even look at a plot of these two quantities?
The dilemma of being together (a la Chapter 3 of Numbers Rule Your World): sorry to say but the spending on pupils is likely to have a highly skewed distribution depending on school district. Also, the average test scores is likely to have high variability across school districts. Thus, using an average for the entire country muddies the water.
Needless to say, test scores are a poor measure of the quality of education, especially in light of the frequent discovery of large-scale coordinated cheating by principals and teachers driven by perverse incentives of the high-stakes testing movement.
In the same article, Gates asserts that quality of teaching is the greatest decisive factor explaining student achievement. Which study proves that we are not told. How one can measure such an intangible quantity as "excellent teaching" we are not told. How student achievement is defined, well, you guessed it, we are not told.
It's great that the Gates Foundation supports investment in education. Apparently they need some statistical expertise so that they don't waste more money on unproductive projects based on innumerate analyses.
Given the recent post questioning the value of the MBA degree, one would think the Economist powers-that-be would not be staffing up MBAs. But then, if not useless MBAs, how would the Economist explain this chart they printed next to the said article?
This chart appears to tell us that all the top MBA programs succeed in reducing their students' earning potential. In each case, the "pre-MBA salary" exceeds the "salary on graduation".
More likely, the red part is the incremental salary, possibly explained by the value of the degree while the gray part is the pre-MBA salary.
However, since the author has few nice words to say about business schools, one can never be 100% sure if he is presenting some counter-intuitive data.
In the Trifecta checkup, one would find nothing wrong with the chart type, nor is there anything wrong with asking the return on investment of an MBA degree.
The third component -- having the right data -- is what renders this effort a failure. It is too simplistic to measure return on investment on the salary upon graduation. Surely, one must also include future career paths, intangible benefits from network relationships, personal development, etc.
In a prior post, I showed a chart of Pisa test scores that can be used to investigate differences between any pair of countries. At least one reader found it confusing, containing too much data. I then realize that if the objective of the chart is re-stated as "How the UK fared relative to other OECD countries", which was the intent of the original Guardian chart, the chart could be presented in the following simplified fashion:
Simplification can be achieved in many ways, one of which is simplifying the objective. In fact, I'd not be opposed to showing just the left side of the chart, which addresses an even more general question, which is how the countries fared in a general sense.
While the lines in the Guardian chart display correlations of math, reading and science scores within specific countries, essentially a parallel coordinates plot, the same correlation can be visualized in a scatterplot matrix (see this post).
Each scatter plot here relates the scores of two subject areas as indicated by the axis labels. The simplest observation is the high degree of positive correlation on all three panels: in other words, countries in general do well in all three subjects, or poorly in all three subjects.
This pattern confirms why it isn't very productive to focus readers' attention on this set of correlations when dealing with this data set.
You'll notice the use of colored dots on the scatter plots. Imagine that I have put the countries into groups based on overall scores (rather than just reading scores) as in my earlier analysis. The dots of the same color represent countries that are deemed to have performed similarly. The black cross indicates the "average country".
Focusing on the colors for the moment, you can confirm yet again that a country doing well in one subject is highly predictive of it doing well in the other subjects.
As I pointed out at the start of the prior post, using a little statistical technique allows us to understand the data better, and plotting summaries of the data allows us to draw more interesting conclusions than putting all the data, unperturbed, onto a canvass.
Information graphics is one of many terms used to describe charts showing data -- and a very ambitious one at that. It promises the delivery of "information". Too often, readers are disappointed, sometimes because the "information" cannot be found on the chart, and sometimes because the "information" is resolutely hidden behind thickets.
Statistical techniques are useful to expose the hidden information. They work by getting rid of the extraneous or misleading bits of data, and by accentuating the most informative parts. A statistical graphic distinguishes itself by not showing all the raw data.
Here is the Guardian's take on the OECD PISA scores that were released recently. (Perhaps some of you are playing around with this data, which I featured in the Open Call... alas, no takers so far.) I only excerpted the top part of the chart.
This graphic is not bad, could have been much worse, and I'm sure there are much worse out there.
But think about this for a moment: what question did the designer hope to address with this chart? The headline says comparing UK against other OECD countries, which is a simple objective that does not justify such a complex chart.
The most noticeable feature are the line segments showing the correlation of ranks among the three subject areas within each country. So, South Korea is ranked first in reading and math, and third in science. Equally prominent is the rank of countries shown on the left-hand-side of the chart (which, on inspection, shows the ranking of reading scores); this ranking also determines the colors used, another eye-catching part of this chart. (The thick black UK line is, of course, important also.)
In my opinion, those are not the three or four most interesting questions about this data set. In such a rich data set, there could be dozens of interesting questions. I'm not arguing that we have to agree on which ones are the most prominent. I'm saying the designer should be clear in his or her own mind what questions are being answered -- prior to digging around the data.
*** With that in mind, I decided that a popular question concerns the comparison of scores between any pair of countries. From there, I worked on how to simplify the data to bring out the "information". Specifically, I used a little statistics to classify countries into 7 groups; countries within each group are judged to have performed equally well in the test and any difference could be considered statistical noise. (I will discuss how I put countries into these groups in a future post, just focusing on the chart here.)
Here is the result: (PS. Just realized the axis should be labelled "PISA Reading Score Differentials from the Reference Country Group" as they show pairwise differences, not scores.)
Each row uses one of the country groups as the reference level. For example, the first row shows that Finland and South Korea, the two best performing countries, did significantly better than all other country groups, except those in A2. The relative distance of each set of countries from the reference level is meaningful, and gives information about how much worse they did.
(The standard error seems to be about 3-6 based on some table I found on the web, which may or may not be correct. This value leads to very high standardized score differentials, indicating that the spread between countries are very wide.
I have done this for the reading test only. The test scores were standardized, which is not necessary if we are only concerned about the reading test. But since I was also looking at correlations between the three subjects, I chose to standardize the scores, which is another way of saying putting them on an identical scale.)
Before settling on the above chart, I produced this version:
This post is getting too long so I'll be brief on this next point. You may wonder whether having all 7 rows is redundant. The reason why they are all there is that the pairwise differences lack "transitivity": e.g., the difference between Finland and UK is not the difference between Finland and Sweden plus the difference between Sweden and the UK. The right way to read it is to cling to the reference country group, and only look at the differences between the reference group and each of the other groups. The differences between two country groups neither of which is a reference group should be ignored in this chart: instead look up the two rows for which those countries are a reference group.
Before that, I tried a more typical network graph. It looks "sophisticated" and is much more compact but it contains less information than the previous chart, and gets murkier as the number of entities increases. Readers have to work hard to dig out the interesting bits.
I noticed a burst of activity on Twitter with "Junk Charts" nominations, too many for me to take care of. So, I'm trying a new feature, the Open Call. It's your chance to start the conversation on these charts.
When Mike K. sent this in, he had a few comments, including "This is, of course, from the Chronicle of Higher Education", and "talking about a math course", which mean "very naively", we would "impose a higher standard." Should scientists be held to a higher standard? Lead by example, perhaps? I had the same feeling when I wrote the post on "Unscientific American" about the charts that'd flunk Ed Tufte's intro class, published in Scientific American.
In one word: confusion. Mike couldn't understand the relationship between the first row of bubbles and the second row of bubbles. It is as if the one course taken at Bronx Community College results in credits recognized everywhere! (You basically have to read all the footnotes to get some clues.)
Also note the usual confusion about areas and diameters.
In addition, the zero-bubbles prove themselves to be the nothing that is. They expose the folly of using bubbles when the data series contain zeroes (not to mention negative numbers). We can visualize this problem:
It gets worse.
For those mathematically inclined, we actually have an impossible situation: the size 4 bubble really contains the zero bubble plus a size 4 bubble; that is, 0 + 4 = 4 but if 0 has positive area, then the area of the 4 on the left hand side of the equation must be smaller than the area of the 4 on the right side. So, basically, don't use bubble charts if your data has zeroes.
The Times Higher Education magazine fancied itself an arbiter of good universities and yet they appeared not to have heard of Tufte, or know why we should not use 3-D pie charts, ever.
Reader Cedric K. sent in this chart, with a note of dismay. Quick, which is most important: the pink, the blue or the green?
Something like the stacked bar chart shown below delivers the information more effectively. The section showing sub-categories can be omitted.
If, in fact, it is crucial for the readers to know each weight to the second decimal, then why not just print a data table? The beauty of just using a data table is that it can accommodate long text strings, which are needed in this case to explain clearly what the subcategories actually mean.
If one wants bells and whistles, one can add little bars to the right of the proportions to visualize the weights.
Reader Chris B. pointed us to this unfortunate chart, based on a one-question on-line poll conducted by Reader's Digest.
The data is highly structured: for each country, respondents, identified as male or female, are asked about their favorite methods to discipline their kids. (At first, I thought the "male" and "female" meant what methods they would apply to sons versus daughters but based on the summary paragraph, I now feel they refer to the genders of the respondents.)
The textual summary is extremely well-written, and successfully points to the most salient information (my italics and bolding):
Spare the rod, period. That's what parents across the globe told us
when we asked how they discipline their children. Respondents in all 16
countries in this month's global survey picked a good talking-to as the
best tactic for teaching a lesson, by a wide margin. Taking away a
privilege placed second. Two other traditional forms of
discipline-sending kids to their rooms and spanking-were the least
favored choices in all but two countries. Among respondents who did
favor physical punishment, men outnumbered women in every country
except Canada, France, and India. Not a single woman in the United
States expressed a preference for spanking.
Unfortunately, the graphical summary is a complete failure.
One feature plotting against the designer is that the general profiles of the responses are very similar between countries, and so the differences are well hidden inside this small-multiples display.
It also takes on an elongated form, making it almost impossible to compare the top two countries with the bottom two countries.
When data has such strong structure, it is a blessing to the chart designer. In the first chart, I made a set of profile charts, in small multiples. On average, parents everywhere act very similarly. There are some subtle differences: one common pattern, occurring in the Philippines, Malaysia, India, France, Brazil, etc., is the preference for a talking-to over all other methods; another pattern, applying to Netherlands, Spain, Australia, Canada, etc. is a talking-to, followed by taking away privileges with sparing use of the other two methods.
In some countries, like Australia, Brazil, Canada, Spain, Italy, etc., the gender of respondents mattered little but in the United States for instance, female respondents are more likely to prefer a talking-to while men liked using sticks.
Is it really the case that parents punish sons and daughters using the same methods? This poll seems to think so.
If we want to expose the minute differences at the level of country-gender, then something like this would do:
The purpose is to surface any outliers. I really can't say there are any here. The supposed reversion of responses by gender in India, France, and Canada is hardly worth noting since the physical punishment category is hardly used. (Reflection of reality, or response bias due to sensitive subject?)
Notice that these new charts do not have the data printed on them - the graphical elements are sufficient to show what the data is; readers are not auditors.
Climategate is all the rage at the moment. What interests me about this episode is not the integrity of certain scientists, or science in general, nor the culture of academia, and certainly not the evidence of climate change. For me, the real climategate is the woeful state of statistical education. Let me explain.
Here is the infamous email: (via Nathan Silver, with my highlights)
From: Phil Jones To: ray bradley ,mann@[snipped], mhughes@ [snipped] Subject: Diagram for WMO Statement Date: Tue, 16 Nov 1999 13:31:15 +0000 Cc: k.briffa@[snipped],t.osborn@[snipped] Dear Ray, Mike and Malcolm,
Once Tim’s got a diagram here we’ll send that either later today or first thing tomorrow. I’ve just completed Mike’s Nature trick of adding in the real temps to each series for the last 20 years (ie from 1981 onwards) amd [sic] from1961 for Keith’s to hide the decline. Mike’s series got the annual land and marine values while the other two got April-Sept for NH land N of 20N. The latter two are real for 1999, while the estimate for 1999 for NH combined is +0.44C wrt 61-90. The Global estimate for 1999 with data through Oct is +0.35C cf. 0.57 for 1998.
Thanks for the comments, Ray.
What concerns me is Phil Jones' describing what he did as a "trick" to "hide the decline". He apparently thought that he was doing something shameful. But when is it shameful to extend the plot of a time series so as to display the long-term trend, and not be misled for short-term fluctuations? This is providing statistical context to the data being examined. Lots of people are condemning this as a willful act to mislead the public but if they have some statistical literacy, they will understand that finding the appropriate time scale to look at the data is one of the most important tasks of analyzing time series data. It's a problem when even prominent scientists do not comprehend why they should be doing this.
I have always wondered why in climatology as well as in economics, we rarely see decomposed time-series plots (at least not in the public's eye).
On the right, I found on-line a plot of a decomposition of beer sales that separates out seasonality, trend and other parts of a time series. The original data is shown up top. In practice, newspapers and blogs give us such plots all the time when they should show us the third plot down (the trend with the seasonal factor removed), unless the story is about seasonality.
Note to self: should include basic time-series decomposition in the intro stats syllabus; much too important a topic to leave to a second course.