Wrong variable and omitted record
The sad tally 2: the data

The sad tally

John Shonder, a reader, alerted me to the following unusual chart which identifies the precise locations where people jumped from the Golden Gate Bridge:

Mn_suicide30_loc_tt

He asks: is the choice of location "random"?

This is a very rich question and different statisticians will take different approaches.   In this post, I take a purely visual, non-rigorous look at the question; and if I have time (and if other readers haven't commented already), I may discuss more rigorous methods in the future.

First, I restrict my attention to light poles 43 through 112, i.e. the bridge segment that lies above the water.  Also, I only consider the north-south locations: in other words, 43 and 44 are counted as one, so are 111 and 112.  Otherwise, the distribution is clearly biased (towards the water and the east side).

When we say "random", we usually mean there is equal chance that someone will jump from location 43/44 or from location 111/112 or any location in between.  There are 35 locations and 755 documented suicides, averaging to 21.6 suicides per location.  But 21.6 is the average which is not observable; assuming that the choice of location is random, we still would not find exactly 21-22 suicides at each location.  (Similarly, even if there is a 50/50 chance of getting a head when we flip a coin, in any given run of 100 flips, it is very unlikely that we will see exactly 50 heads.) 

So, at some locations we will see more than 21.6 deaths; at others, fewer.  The question becomes whether the fluctuations are too much to refute the notion that the choice of location is random.

In the following set of graphs, I ran some simulations.  Eight of the nine graphs represent scenarios under which I sent 755 people to the bridge and randomly assign them one of the 35 locations to jump from (okay, this is a thought experiment only; please don't do this at home).  The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.

Randsuid

The standardized scale allows us to compare across graphs.  The zero line represents the mean number of suicides per location.  The number of suicides at most locations is within one standard deviation away from this mean (i.e. between -1 and 1 on the y-axis).  In some extreme cases, the number of suicides is more than 3 standard deviations larger than the mean (i.e. greater than 3).

Back to randomness: well, one of the 9 graphs is the real data from the map above.  If you can guess which of the 9 is real, then the real data is probably not random.  If you can't, then the real data may be random!

I will publish the answer tomorrow.  In the meantime, feel free to take a guess and/or comment on what other approach you'd take.  One take-away from this exercise is that it's very hard to tell non-random from random unless it is very obvious.

References: San Francisco Chronicle, Boing Boing


Comments

Mike Anderson

I hope you'll publish AN answer tomorrow; I'm not sure there is only THE answer.

This is great stuff! I'm shamelessly swiping the data and the randomness question to make this into a short project for my statistics undergraduates. At least 4 hypothesis tests come to mind immediately; I suspect my students will think up several more. Thanks!

Kaiser

Let me clarify - the "answer" that I will post here is merely which graph has the real data.

As for randomness, there are good answers and bad answers but hardly one answer. Good luck to your students!

Steve Citron-Pousty

can you post the data or send it to me. I would love to run some fractal analysis on it. I would also suspect you can do any of the other quadrat analysis techniques. Yipee - it's spatial data, fun fun...

Robert

Bottom right.

The comments to this entry are closed.