« Chance to ask me a question this Friday | Main | Willing the data to fit your model »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Floormaster Squeeze

I work with BMI data and related health implications. In our work we see a slightly u shaped impact on the dependent (health outcomes) variable. Because BMI is not that important to the work (we do use it to adjust the impacts) I have simply used BMI categorical variables instead of a falsely linear continuous variable. Does that make sense?


Removing the data altogether does seem odd. Why not model the interactions with smoking and disease?


"Somehow, the field of evolutionary psychology has attracted many crazies."


Yes. Yes it has...


FMS: You are asking about discretizing predictor variables, which is often debated. My standard answer to this is look at the analysis both ways, discretized and not. If they tell you a similar story, then it is okay to discretize as you are not losing any valuable information. While you might think linearizing is arbitrary, discretizing is another kind of arbitrary! What you're doing is to impose a step function on the curve. That is fine so long as you set the right bounds.

Shampshire: Maybe it wasn't enough to prove their theory :)

Meic Goodyear

Several studies have concluded that life expectancy is greatest in the slightly overweight group. I believe the standard BMI defintions were developed before the second world war, when it's thought that most of the population were mal(under)nourished. The categories need re-visiting, but there's a huge vested interest in some parts of the public health industry. Having spent their careers propagating one set of beliefs many are reluctant to accept they need to change their message.


Meic: and you're right, it's not that the BMI metric is bad, we can use the metric but interpret it differently.

All: The Typepad spam filter has been churning out false positives lately. If your comment doesn't show up, that means I have to fish it out of the spam folder. My own comment above was deemed "spam".

Floormaster Squeeze

Thanks for the response. You are right that it is objectively arbitrary and good make things worse; I think it works for our adjustments better.

Using BMI linearly for us just means weaker or smaller impacts (heavier, worse outcomes generally). I am sure it has some value in our adjustments. However, as noted in the Nature discussion above, the Overweight category generally has as good (sometimes slightly better) outcomes as the Normal weight. The categories allow us to adjust for the worse outcomes of the Underweight (in our data there are very few people in this group) as well as the slight worse outcomes of the Obese and the markedly worse outcomes of the Morbidly Obese (we use the standard BMI categories and cut-offs).

Also in one of our outcomes the Obese have it slightly better/"pretty close" to Normal and Overweight and the categories allow the differences the Morbidly Obese have be more stark (linearly I believe this relationship is nearly flat).


FMS: Your reasoning seems sound. You need to look at the un-discretized analysis to make sure that there are indeed three groups and get an idea of where the boundaries are. The advantage of discretizing is in the presentation.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep