Last week, I posed the question of how one can ascertain if a data set is "random" using Golden Gate Bridge suicide data compiled by the SF Chronicle. For review, see here (and less importantly, here). I plan to document my own exploration slowly over a few posts as the material may be a bit heavy for those not trained in statistics. Don't run away yet as I'll try to explain things in simple terms as best I can.
In the original plots, I compared eight randomly generated datasets with the real suicide data (bottom right graph here). The random data imagine the scenario in which death-seekers randomly picked the locations from which to jump off the bridge. While in theory there should be about 21.6 suicides at each of the 35 locations, the random data exhibited a wide spread, with some locations having the number of deaths over 2 to 3 standard deviations away from the average (norm). Standard deviation is a way to measure the spread of data around the average value.
Staring at a graph like the one shown above, how does one decide whether the pattern is random or not? I will, for now, stick to visualization methods.
(1) Frequency of extreme values
Intuitively, assuming random, then the number of suicides should be close to the mean at most locations, with few extreme values. If the number of locations having extreme values in the real data is significantly higher (or lower) than the same count in the random data, one might reasonably conclude that the real data is not random.
In my chart, there are 9 lines, each a dataset. I identified two unusual lines by red color. We have evidence that the real data (Dataset 9) is not random since the line deviated from most other lines. This evidence is inconclusive because Dataset 4 also strayed even though it was randomly generated.
This illustrates why the question of randomness is tricky: we rely on strange (rare) behavior to detect non-randomness but strange things can happen randomly too. Therefore, it is best to collect many pieces of evidence.
(2) Cumulative distribution
Like Robert, I plotted the cumulative distribution. This graph shows the % of locations with less than 10, 20, 30, etc. suicides. Again, each "staircase" line represents one of the 9 datasets. They all converge at the point (20, 0.6): that is, in each dataset, about 21 locations (60% x 35 locations = 21) had 20 suicides or fewer.
The red line clearly stood out in this chart. It is Dataset 9. So we have stronger evidence still that it is different from random data.
Observe that for random data (the black lines), a steep rise occurred between 14 and 35 suicides, indicating that if randomly selected, most locations will see approximately 14-35 suicides. By contrast, the red line rose fast at about 10 suicides or so, then saw less action in the 25+ range, and had an extreme outlier at 60. In other words, the variance (spread) in the real data was higher than can be expected in random data.
These visualizations, of extreme values and cumulative distribution, faciliated our objective of comparing the real and the random data. The difference showed up more potently here than in the scatter plots.
In a future post, I will look at other ways, both simpler and more complex ideas, to identify the important differentiator of variance.