Be guided by the questions
Dec 16, 2010
Information graphics is one of many terms used to describe charts showing data -- and a very ambitious one at that. It promises the delivery of "information". Too often, readers are disappointed, sometimes because the "information" cannot be found on the chart, and sometimes because the "information" is resolutely hidden behind thickets.
Statistical techniques are useful to expose the hidden information. They work by getting rid of the extraneous or misleading bits of data, and by accentuating the most informative parts. A statistical graphic distinguishes itself by not showing all the raw data.
Here is the Guardian's take on the OECD PISA scores that were released recently. (Perhaps some of you are playing around with this data, which I featured in the Open Call... alas, no takers so far.) I only excerpted the top part of the chart.
This graphic is not bad, could have been much worse, and I'm sure there are much worse out there.
But think about this for a moment: what question did the designer hope to address with this chart? The headline says comparing UK against other OECD countries, which is a simple objective that does not justify such a complex chart.
The most noticeable feature are the line segments showing the correlation of ranks among the three subject areas within each country. So, South Korea is ranked first in reading and math, and third in science. Equally prominent is the rank of countries shown on the left-hand-side of the chart (which, on inspection, shows the ranking of reading scores); this ranking also determines the colors used, another eye-catching part of this chart. (The thick black UK line is, of course, important also.)
In my opinion, those are not the three or four most interesting questions about this data set. In such a rich data set, there could be dozens of interesting questions. I'm not arguing that we have to agree on which ones are the most prominent. I'm saying the designer should be clear in his or her own mind what questions are being answered -- prior to digging around the data.
With that in mind, I decided that a popular question concerns the comparison of scores between any pair of countries. From there, I worked on how to simplify the data to bring out the "information". Specifically, I used a little statistics to classify countries into 7 groups; countries within each group are judged to have performed equally well in the test and any difference could be considered statistical noise. (I will discuss how I put countries into these groups in a future post, just focusing on the chart here.)
Here is the result: (PS. Just realized the axis should be labelled "PISA Reading Score Differentials from the Reference Country Group" as they show pairwise differences, not scores.)
Each row uses one of the country groups as the reference level. For example, the first row shows that Finland and South Korea, the two best performing countries, did significantly better than all other country groups, except those in A2. The relative distance of each set of countries from the reference level is meaningful, and gives information about how much worse they did.
(The standard error seems to be about 3-6 based on some table I found on the web, which may or may not be correct. This value leads to very high standardized score differentials, indicating that the spread between countries are very wide.
I have done this for the reading test only. The test scores were standardized, which is not necessary if we are only concerned about the reading test. But since I was also looking at correlations between the three subjects, I chose to standardize the scores, which is another way of saying putting them on an identical scale.)
Before settling on the above chart, I produced this version:
This post is getting too long so I'll be brief on this next point. You may wonder whether having all 7 rows is redundant. The reason why they are all there is that the pairwise differences lack "transitivity": e.g., the difference between Finland and UK is not the difference between Finland and Sweden plus the difference between Sweden and the UK. The right way to read it is to cling to the reference country group, and only look at the differences between the reference group and each of the other groups. The differences between two country groups neither of which is a reference group should be ignored in this chart: instead look up the two rows for which those countries are a reference group.
Before that, I tried a more typical network graph. It looks "sophisticated" and is much more compact but it contains less information than the previous chart, and gets murkier as the number of entities increases. Readers have to work hard to dig out the interesting bits.
Hate to say it, but the more complex chart you criticize gets and holds my attention. It also answers the question in an easy to read format. On the other hand, your attempts at simplification are not user friendly to a non-expert like me and completely lose my interest.
Best wishes for the holidays.
Posted by: Vida_Jay | Dec 16, 2010 at 02:51 AM
Your attempts seem reasonable, but it would be less disorienting if the groups were listed in the other order, i.e., with the higher performing ones to the right. The differences between relative ratings, which necessitate your seven rows, looks like a minor effect. Couldn't you use some kind of pooled standard error and show them as a single series?
Posted by: Jon Peltier | Dec 16, 2010 at 07:15 AM
I like your way to look at the problem. I always find it challenging to visualize such seemingly easy task as multiple comparisons. Having seven rows makes sense because depending on the question being asked reference category can be different. For me the biggest plus of your chart is having an axis to judge the relative differences between groups.
Posted by: Ульвия Ибрагимова | Dec 17, 2010 at 10:08 AM