Let the data speak. That's advice frequently given. But it is bad advice, especially in this Big Data era.
While doing research, reader Winnie H. found this webpage (link), apparently generated by some automated software that explores every variable found on a dataset about people living in the 32257 zip code. How else do we explain the data table shown on the right?
The research question seemed to be how your first name can kill you. The table extracts the most common first names among the residents who died during the year; for each first name, it computes the average age at death of all those who died. (There is another table that does the same thing for last names.)
I did a little exploration. First, the data is split by gender which helps us see the true variability. I then estimated the average age and the standard deviation of the age--but this can be done only for the 585 people who died and had one of those 10 first names. One can make the not-unreasonable assumption that the true average and standard deviation would not be far off.
The standard deviation is about 1.3-1.5 years. Using a conventional 95% interval estimate, we would expect the length of the interval to be about 5-6 years. The range of the average ages listed in the original table is 3.9 for females and 4.8 for males, so in both cases, the variability observed is basically noise.
It would be more scary if the observed variability is larger than noise. We'd likely see headlines screaming "Your first name reduces your life expectancy!!", or "Margaret, it's time to become Elizabeth!". And why not "James, if you want to live longer, become Elizabeth now!"
In situations like this, the statistician has to make sure the effect is not an artifact of her methodology. This data certainly do not speak for themselves. This data should probably be muzzled.