« Losing control of the data | Main | Using data to guess authorship of the Federalist Papers »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Antonio Rinaldi

"The new model assumes depression is indicated by a large set of genes each contributing a weak effect."
The new model is worse than the old one. The shift from the old one to the new one only adapts to the availability of new (big) data.
With a few suspects, it's relatively easy to find the culprit (the one that contributes the most), if he exists. But with a very big big number of accuseds, it's relatively easy (again!) to find the several culprits (the ones that contribute weakly together) because a large , large, large piece of evidence can always be read as supporting or not supporting some among the multuitude of suspects. But, as before, do they exists?
It's an entangled mix between genes and environment that causes depression and other illness. Other twenty years wasted await us. Then, when Internet of Things (smartphones, smartwatchs, cars, domestic appliances) will produces a new ocean of big big big data, a new model will replace the actual one, stating that depression is caused, other than genes, to "a large set of environmental elements each contributing a weak effect". And so we will be ready to another twenty years of failures.
Science should be theory -> data, not data -> theory.

Antonio Rinaldi

Another instance of false positive science:


The original study shows a couple of problems in medical research that become even more apparent with genetic research, The first is that the study only resulted in a p-value of 0.03 which is far from conclusive. The article even suggests that it should have been replicated. Medical researchers tend to believe that any p-value less than 0.05 is certain proof, because that is what is taught in statistics courses. That leads us to the second problem. These early studies there was a lot of dishonesty about the number of tests performed. It is a fundamental problem in epidemiology in general, people collect a lot of baseline data and then they perform every cross-sectional test they can and then every longitudinal as they collect more data. Then they write it up with an introduction that justifies the analysis based on previous research. Genetics makes this even worse as there are lots of genes, so these days there are established methods to make this less likely.

One of the possibilities of statistics and science is that we will get it wrong sometimes. What is bad is when this isn't corrected. I expect that when anyone found contradictory results they decided that it wasn't in their interest to publish them. When you have a grant and research program it isn't in your best interest to challenge its basis. Hard to publish for a start.

The comments to this entry are closed.


Link to Principal Analytics Prep

See our curriculum, instructors. Apply.
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR.

See my Youtube and Flickr.

Next Events

Jun: 5 NYPL Public Lecture on Analytics Careers, New York, NY

Past Events

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here


Numbersense: Statistical Reasoning in Practice, Principal Analytics Prep

Applied Analytics Frameworks & Methods, Columbia

The Art of Data Visualization, NYU

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee


  • only in Big Data