Given several datasets (scatter plots), how does one tell random from non-random? We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.
In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; non-random data betrays their character because they would show more or less variability at different points in the distribution (see right). So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets. We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.
Given what we know now, a much simpler plot will suffice. The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).
The evidence of non-randomness in Dataset 9 now starkly confronts us. Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.
Now, lets take a step back and look at what this non-randomness means. We are concluding that the choice of suicide location is not random. Location 69/70 (the outlier, with 61 deaths) is definitely the most popular. Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.
Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.