« Know your data 33: oops the dog did it | Main | Why you must know how analytical results were obtained »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Thomas Dietterich

I would like to understand this distinction better.

Variable importance scores only tell us which variables the predictive model found useful. In a problem with many highly-correlated predictor variables, obviously you can remove a lot of them at random and still be able to fit a good predictive model.

As an ML person trying to become a better statistician, can you explain to me how this is different from finding "a small set of genes that have high explanatory power"?

Is the real difference that geneticists seek a causal model rather than a predictive one? In that case, we need to do more than just regressions, right?


TD: Yes, that's how I read it as well. With an "estimation" model, the goal is to identify genes that are causally related to the outcomes, and it's less about predicting individual outcomes, more about fitting the average outcome. The genes identified may lead to development of new treatments but only if they are causal in nature. The modern predictive models are good at aggregating a variety of small signals; each small signal may not be the causal mechanism but merely correlated with it - such signals I think would not be useful for developing new treatments. His bigger point is that both sides have thought their models can serve both functions but it turns out not to be true - for now.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep