The small sample size used in the "useful chartjunk" paper is a major downer. Typically, small samples contain much "noise", making it difficult to find the "signal". (Recall the fallacy discussed by Howard Wainer concerning the small-schools movement.)
The authors, however, found several statistically significant differences. For example, participants were found to have greater ability to describe the "value message" of USA-Today-type (Holmes) charts relative to Tuftian (plain) charts showing the same dataset of 5 numbers. The chart below displays this result:
Even more shocking: the significance threshold was not merely passed but demolished. According to the paper, the p-values for the above tests were 0.003, 0.026 and 0.020 respectively. These are incredibly small p-values, especially when the sample size was only 20. (The p-value of 0.003 or 0.3% means that if both types of chart are equally effective, there is only a 0.3% chance that the 20 participants did as well as they did on the USA-Today charts relative to the Tuftian charts. Thus, the observed result presented an almost bulletproof case that chartjunk was better. For more on how this works, see Chapter 5 of Numbers Rule Your World.)
How did the researchers overcome the small sample size? The short answer is: it appeared that the experimenter consistently scored the Holmes charts higher than the plain charts for all participants, thus the "signal" was very strong, and able to rise above the noise.
It is hard to believe that Tuftian charts are so awful that everyone performs worse on those relative to Holmes charts. I'm more inclined to believe that this result is due to too much subjectivity in the design of the experiment.
Warning: the rest of the post is technical.
Fortunately, the authors provided just enough data in the paper to unravel this mystery. I'll focus attention on the description task (the first set of columns in the figure above). Since the sample size is so small, we may suspect that significance is a result of participants being very similar to one another.
The figure above tells us that the metric being evaluated is the difference in sum of scores between the Holmes charts and the plain charts. Recall that each participant saw 6 Holmes charts, 6 plain charts, and 2 training charts (dropped from the analysis). Each chart is given a score by the experimenter between 0 and 3. Thus, the sum of scores for any one participant and one chart type could range from 0 to 18. The maximum difference in sum of scores would be 18-0 = 18.
Amazingly, the observed difference in sum of scores, averaged across 20 participants, was 1, since on average, the participants scored 5 on the Holmes charts and 4 on the plain charts. Put differently, on average, they scored 0.83 per Holmes chart, and 0.67 per plain chart. According to the scoring criteria, this means they were "mostly" to "all" incorrect for pretty much every chart.
Based on the t statistic (t=3.37) provided in the paper, we can also estimate the variability across participants. Since the difference was 1.0, the "standard error" (of the difference) is 0.3. This means the standard deviation of each chart type's sum of scores was approx. 0.21. As a first-order approximation, if we assume the sums were normally distributed and use the 3-sigma rule, this implies that for the Holmes charts, the participants scored between 4.4 and 5.6 while for the plain charts, between 3.4 and 4.6. (This estimation appeared to match the SE intervals shown in the figure above.)
So, incredibly, pretty much everyone did more poorly on the plain charts than the Holmes chart. Since the difference is so consistent, there is no need to have a large number of participants to prove the case!
The question is whether we believe in the scoring mechanism.