There is a tendency to mistake complexity for randomness. Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.
We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here). Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.
Robert found a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.
According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes). Each point on the year axis represents a frequency of occurrence.
Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers. The complexity of this spider web makes a tough job impossible! We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.
In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above). A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis. We'd like to show that the resulting histogram is flat, on average over all years.
Given several datasets (scatter plots), how does one tell random from non-random? We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.
In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; non-random data betrays their character because they would show more or less variability at different points in the distribution (see right). So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets. We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.
Given what we know now, a much simpler plot will suffice. The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).
The evidence of non-randomness in Dataset 9 now starkly confronts us. Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.
Now, lets take a step back and look at what this non-randomness means. We are concluding that the choice of suicide location is not random. Location 69/70 (the outlier, with 61 deaths) is definitely the most popular. Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.
Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.
John Shonder, a reader, alerted me to the following unusual chart which identifies the precise locations where people jumped from the Golden Gate Bridge:
He asks: is the choice of location "random"?
This is a very rich question and different statisticians will take different approaches. In this post, I take a purely visual, non-rigorous look at the question; and if I have time (and if other readers haven't commented already), I may discuss more rigorous methods in the future.
First, I restrict my attention to light poles 43 through 112, i.e. the bridge segment that lies above the water. Also, I only consider the north-south locations: in other words, 43 and 44 are counted as one, so are 111 and 112. Otherwise, the distribution is clearly biased (towards the water and the east side).
When we say "random", we usually mean there is equal chance that someone will jump from location 43/44 or from location 111/112 or any location in between. There are 35 locations and 755 documented suicides, averaging to 21.6 suicides per location. But 21.6 is the average which is not observable; assuming that the choice of location is random, we still would not find exactly 21-22 suicides at each location. (Similarly, even if there is a 50/50 chance of getting a head when we flip a coin, in any given run of 100 flips, it is very unlikely that we will see exactly 50 heads.)
So, at some locations we will see more than 21.6 deaths; at others, fewer. The question becomes whether the fluctuations are too much to refute the notion that the choice of location is random.
In the following set of graphs, I ran some simulations. Eight of the nine graphs represent scenarios under which I sent 755 people to the bridge and randomly assign them one of the 35 locations to jump from (okay, this is a thought experiment only; please don't do this at home). The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.
The standardized scale allows us to compare across graphs. The zero line represents the mean number of suicides per location. The number of suicides at most locations is within one standard deviation away from this mean (i.e. between -1 and 1 on the y-axis). In some extreme cases, the number of suicides is more than 3 standard deviations larger than the mean (i.e. greater than 3).
Back to randomness: well, one of the 9 graphs is the real data from the map above. If you can guess which of the 9 is real, then the real data is probably not random. If you can't, then the real data may be random!
I will publish the answer tomorrow. In the meantime, feel free to take a guess and/or comment on what other approach you'd take. One take-away from this exercise is that it's very hard to tell non-random from random unless it is very obvious.