Let the data speak. That's advice frequently given. But it is bad advice, especially in this Big Data era.
While doing research, reader Winnie H. found this webpage (link), apparently generated by some automated software that explores every variable found on a dataset about people living in the 32257 zip code. How else do we explain the data table shown on the right?
The research question seemed to be how your first name can kill you. The table extracts the most common first names among the residents who died during the year; for each first name, it computes the average age at death of all those who died. (There is another table that does the same thing for last names.)
***
I did a little exploration. First, the data is split by gender which helps us see the true variability. I then estimated the average age and the standard deviation of the age--but this can be done only for the 585 people who died and had one of those 10 first names. One can make the not-unreasonable assumption that the true average and standard deviation would not be far off.
The standard deviation is about 1.3-1.5 years. Using a conventional 95% interval estimate, we would expect the length of the interval to be about 5-6 years. The range of the average ages listed in the original table is 3.9 for females and 4.8 for males, so in both cases, the variability observed is basically noise.
***
It would be more scary if the observed variability is larger than noise. We'd likely see headlines screaming "Your first name reduces your life expectancy!!", or "Margaret, it's time to become Elizabeth!". And why not "James, if you want to live longer, become Elizabeth now!"
In situations like this, the statistician has to make sure the effect is not an artifact of her methodology. This data certainly do not speak for themselves. This data should probably be muzzled.
Perhaps there's something that escapes me. I think that never or almost never one can find that four times standard deviation is greater than range (no hypothesis at all). The same data is used to compute range and standard deviation, so it should be very strange to see that one gives a very different evidence than the other. They are simply two different indexes of the same variability.
Posted by: Antonio | 03/25/2012 at 06:04 AM
It gives me runtime errors how do i solve it?
I look it up on google it gave me
what is runtime error
Posted by: Jason F | 04/08/2012 at 09:10 AM
even when you have a variation larger than stdev, it only shows correlation, not necessarily causality.
but there is some problem of circular argument in your analysis. you should use this stdev to check the variation in another data set (not this particular one), e.g. different year in this zip or data in another zip.
btw, a nice blog you have :)
Posted by: dan | 04/30/2012 at 11:17 PM