In my new book, I have a chapter on interpreting the statistics of obesity. Andrew Sullivan (link) recently pointed to a Nature article discussing an aspect of the controversy around these numbers.
The bone of contention is the shape of the mortality curve. It has been thought that the curve is monotonic increasing, meaning that the higher your BMI, the higher the mortality rate. But survey data in the U.S. now show that the curve is probably U-shaped: mortality rates are high for both obese and thin people. Overweight (less than obese) people paradoxically seemed to live longer than those with "normal" weight. This last observation has driven some people nuts.
The article focused on two Harvard researchers who organized a conference specifically to attack a CDC paper demonstrating the U-shaped curve. This is the crux of their argument:
When the researchers excluded women who had ever smoked and those who died during the first four years of the study (reasoning that these women may have had disease-related weight loss), they found a direct linear relationship between BMI and death, with the lowest mortality at BMIs below 19.
Excluding portions of a sample from analysis is a dangerous game, and should be heavily discouraged. It's one thing to adjust the data; it's another thing to remove data completely. Notice that what was removed weren't outliers, that is, data that might be incorrect and so extreme as to dominate the outcomes. They removed data specifically to conform to their model of the world.
First, they removed smokers because "smokers tend to be leaner and die earlier than non-smokers". This sounded like smokers who die earlier are on the thin side of the curve; removing them has th effect of straightening the curve.
The second cut is even more egregious. How can there be any justification for removing people who died during the first four years when the study's primary metric is death rate? They claimed reverse causality.
The most important reason why you should never drop large chunks of data in a systematic way is that your conclusions are now limited to the group that hasn't been dropped. Since there are no smokers in your sample, you cannot make a statement that applies to the general population. And yet, these researchers seem to have done so.
***
Later on in the article, the journalist repeats the nonsense about how using BMI is a problem. I have previously written about this topic here.
On a related note, a visiting professor at NYU has been making the news, having made insulting comments about "fat PhD applicants". Somehow, the field of evolutionary psychology has attracted many crazies.
I work with BMI data and related health implications. In our work we see a slightly u shaped impact on the dependent (health outcomes) variable. Because BMI is not that important to the work (we do use it to adjust the impacts) I have simply used BMI categorical variables instead of a falsely linear continuous variable. Does that make sense?
Posted by: Floormaster Squeeze | 06/05/2013 at 01:31 PM
Removing the data altogether does seem odd. Why not model the interactions with smoking and disease?
Posted by: Shampshire | 06/06/2013 at 08:35 AM
"Somehow, the field of evolutionary psychology has attracted many crazies."
:)
Yes. Yes it has...
Posted by: jlbriggs | 06/06/2013 at 12:20 PM
FMS: You are asking about discretizing predictor variables, which is often debated. My standard answer to this is look at the analysis both ways, discretized and not. If they tell you a similar story, then it is okay to discretize as you are not losing any valuable information. While you might think linearizing is arbitrary, discretizing is another kind of arbitrary! What you're doing is to impose a step function on the curve. That is fine so long as you set the right bounds.
Shampshire: Maybe it wasn't enough to prove their theory :)
Posted by: Kaiser | 06/06/2013 at 08:50 PM
Several studies have concluded that life expectancy is greatest in the slightly overweight group. I believe the standard BMI defintions were developed before the second world war, when it's thought that most of the population were mal(under)nourished. The categories need re-visiting, but there's a huge vested interest in some parts of the public health industry. Having spent their careers propagating one set of beliefs many are reluctant to accept they need to change their message.
Posted by: Meic Goodyear | 06/07/2013 at 04:13 AM
Meic: and you're right, it's not that the BMI metric is bad, we can use the metric but interpret it differently.
All: The Typepad spam filter has been churning out false positives lately. If your comment doesn't show up, that means I have to fish it out of the spam folder. My own comment above was deemed "spam".
Posted by: Kaiser | 06/07/2013 at 10:22 AM
Thanks for the response. You are right that it is objectively arbitrary and good make things worse; I think it works for our adjustments better.
Using BMI linearly for us just means weaker or smaller impacts (heavier, worse outcomes generally). I am sure it has some value in our adjustments. However, as noted in the Nature discussion above, the Overweight category generally has as good (sometimes slightly better) outcomes as the Normal weight. The categories allow us to adjust for the worse outcomes of the Underweight (in our data there are very few people in this group) as well as the slight worse outcomes of the Obese and the markedly worse outcomes of the Morbidly Obese (we use the standard BMI categories and cut-offs).
Also in one of our outcomes the Obese have it slightly better/"pretty close" to Normal and Overweight and the categories allow the differences the Morbidly Obese have be more stark (linearly I believe this relationship is nearly flat).
Posted by: Floormaster Squeeze | 06/07/2013 at 10:40 AM
FMS: Your reasoning seems sound. You need to look at the un-discretized analysis to make sure that there are indeed three groups and get an idea of where the boundaries are. The advantage of discretizing is in the presentation.
Posted by: Kaiser | 06/09/2013 at 12:51 PM