This post is a companion to my Junk Charts post on why we can't trust the research which purportedly showed that USA-Today chartjunk is "more useful" than Tuftian plain graphics. Here is an example of the two chart types they compared:

In this post, I discuss how to read a paper such as this that describes a statistical experiment, and evaluate its validity.

***

First, note the sample size. They only interviewed 20 participants. This is the first big sign of trouble. Daniel Kahneman calls this "law of small numbers", the fallacy of generalizing limited information from small samples. For a "painless" experiment of this sort in which subjects are just asked to read a bunch of charts, there is no excuse to use such a small sample.

***

Next, tally up the research questions. At the minimum, the researchers claimed to have answered the following questions:

- Which chart type led to a better description of subject?
- Which chart type led to a better description of categories?
- Which chart type led to a better description of trend?
- Which chart type led to a better description of value message?
- Did chart type affect the total completion time of the description tasks?
- Which chart type led to a better immediate recall of subject?
- Which chart type led to a better immediate recall of categories?
- Which chart type led to a better immediate recall of trend?
- Which chart type led to a better immediate recall of value message?
- Which chart type led to a better long-term recall of subject?
- Which chart type led to a better long-term recall of categories?
- Which chart type led to a better long-term recall of trend?
- Which chart type led to a better long-term recall of value message?
- Which chart type led to more prompting during immediate recall of subject?
- Which chart type led to more prompting during immediate recall of categories?
- Which chart type led to more prompting during immediate recall of trend?
- Which chart type led to more prompting during immediate recall of value message?
- Which chart type led to more prompting during long-term recall of subject?
- Which chart type led to more prompting during long-term recall of categories?
- Which chart type led to more prompting during long-term recall of trend?
- Which chart type led to more prompting during long-term recall of value message?
- Which chart type did subjects prefer more?
- Which chart type did subjects most enjoy?
- Which chart type did subjects find most attractive?
- Which chart type did subjects find easiest to describe?
- Which chart type did subjects find easiest to remember?
- Which chart type did subjects find easiest to remember details?
- Which chart type did subjects find most accurate to describe?
- Which chart type did subjects find most accurate to remember?
- Which chart type did subjects find fastest to describe?
- Which chart type did subjects find fastest to remember?

I think I made my point. There were more research questions than participants. Why is this bad?

Let's do a back-of-the-envelope calculation. First, think about any one of these research questions. For a statistically significant result, we would need roughly 15 of the 20 participants to pick one chart type over the other. Now, if the subjects had no preference for one chart type over the other, what is the chance that at least one of the 31 questions above will yield a statistically significant difference? The answer is about 50%! Ouch. In other words, the probability of one or more false positive results in this experiment is 50%.

For those wanting to see some math: Let's say I give you a fair coin for each of the 31 questions. Then, I ask you to flip each coin 20 times. What is the chance that at least one of these coins will show heads more than 15 out of 20 flips? For any one fair coin, the chance of getting 15 heads in 20 flips is very small (about 2%). But if you repeat this with 31 coins, then there is a 47% chance that you will see one of the coins showing 15 heads out of 20 flips! The probability of at least one 2% event is 1 minus the probability of zero 2% events; the probability of zero 2% events is the product (31 times) of the probability of any given coin showing fewer than 15 heads in 20 flips (= 98%).

Technically, this is known as the "multiple comparisons" problem, and is particularly bad when a small sample size is juxtaposed with a large number of hypotheses.

***

Another check is needed on the nature of the significance, which I defer to a future post.

I can't bring myself to read the paper, but it sounds like they showed each subject multiple chart pairs, say k pairs, which would mean they had 20*k observations and not just 20. Still a hell of a multiple comparisons problem, though. Of course, the real problem here is that the experimental procedure can be fairly paraphrased as "someone with an ax to grind asked leading questions until he was happy with the answers".

Posted by: Cosma Shalizi | 05/14/2010 at 09:26 AM

Cosma: Not really. They are not randomizing treatment. Every one of 20 participants inspected both sets of charts. They should have used a paired-difference test but they didn't. The comparison is for the difference between the sum total of scores for chart type A and the sum total of scores for chart type B, replicated 20 times. So I think you have to call that 20 observations.

Posted by: Kaiser | 05/14/2010 at 10:24 AM

Are you sure? The paper says, "Each participant saw only one version of each chart, either Holmes or plain."

Posted by: Jerzy | 05/14/2010 at 04:53 PM

Jerzy: I know it's confusing because the way they set it up is super complicated. Each participant saw 14 charts, 7 are of the Holmes variety, 7 are of the plain variety. They alternate between Holmes and plain charts, and insert a blank chart to provide a "visual break".

So each participant is exposed to both "treatments" in alternating fashion.

However, for any given chart (better described as given data set), each participant sees only once. So in the above figure, you either saw the Holmes version or the plain version. But if you saw the Holmes version for the Diamond chart, your next chart would be the plain version.

Posted by: Kaiser | 05/14/2010 at 05:46 PM

So I think you have to call that 20 observations.If that's what they did, then yes, it's just n=20. Wow (and not in a good way). "We can perhaps say what the experiment died of."

Posted by: Cosma Shalizi | 05/15/2010 at 09:23 AM