Reader Darryl guessed correctly that I'd be interested in this paper in which the authors assert that chartjunk of the USA Today type is more "useful" than Tuftian "plain" graphics. (Via Information Aesthetics) I applaud the attempt to put Ed Tufte's theories to the statistical test, and I have written about Bill Cleveland's experiments. However, after reading their paper carefully, I must conclude that the design of the statistical experiment contains so many major flaws that it is hard to take their conclusion seriously.
Please see the companion post on the book blog for technical comments. This post focuses on conceptual issues.
RED FLAG 1
The sample size consisted of 20 students. Flipping open any elementary statistics textbook, you will find the standard advice to ignore experiments with fewer than 30 observations.RED FLAG 2
No mention of how participants were "recruited". (Or for that matter, how experimenters were recruited. See RED FLAG 4)RED FLAG 3
The charts used in the study were mind-numbingly simple. The five examples given in the paper all contained data series with exactly 5 numbers. Many of the charts had little of interest. For example, the chart shown right was titled "Diamonds were a girl's best friend" and showed a rise then fall of diamond prices, huh?
RED FLAG 4
The degree of subjectivity in this experiment is mind-boggling. Instead of a multiple-choice test, a "single experimenter" conducted interviews with participants, asking open-ended questions. These answers were later scored by a "single experimenter". Whether the interviewer and the scorer were the same person was not known. The identity of the experimenter, his/her affiliation, and how he/she was recruited were not mentioned.
RED FLAG 5
The interviewer was allowed, in fact instructed, to "prompt" participants until he/she was satisfied with the final answer. Multiple prompts were allowed. However, only whether any prompting was needed was used in the scoring; the number of prompts used was ignored.RED FLAG 6
The response of participants were scored against a "checklist" but the checklist was not released with the paper. The guidelines for scoring were described in detail but they appeared to leave room for discretion (e.g. 2 points for "providing most of the relevant information" -- what does it mean by "most"?) The transcripts of the interviews were not published and therefore it's hard to understand the effect of (multiple) prompting.
RED FLAG 7
Some of the questions posed to the participants after they viewed the charts were very silly. Q2 (values) was "What are the displayed categories and values?" Is a good chart defined as one that leads readers to retain displayed values? Not in my book.
RED FLAG 8
The participants were asked to inspect a succession of 14 charts "for as long as they needed", and then answer questions. As a result, the effect being measured is hopefully confounded with (1) memory capacity and (2) how much time the participants chose to spend on reading the charts.
Just to underline RED FLAG 4 above, I cite the paragraph where the researchers described their subjective "scoring" system: (By "Holmes", they meant the USA Today style chartjunk.)
To a participant looking at the Holmes 'Monstrous Costs' chart, we would ask question Q3: 'What is the basic trend of the chart?' If the participant responded, 'I don't understand', we would elaborate: 'Tell me whether the chart shows any changes and describe these changes.' The participant might answer 'The teeth get bigger every year.' This answer would score 1 point, as it is not a complete answer (with incorrect information about the period of the data reported) but provides at least some information that the bars increase. The experimenter would then provide additional prompts starting with 'Can you be more specific?' A complete answer scoring four points might be 'The chart shows that campaign expenditures by the house increased by about 50 million dollar every two years, starting in 1972 and ending in 1982.'