The Atlantic reports on the dynamics of yet another group of scientists coming to grips with having wasted time and resources chasing down a dead end. (link)
It's a good read but long. Here is the gist of it:
Almost 20 years ago, some researchers made a huge splash by claiming to have discovered the "depression gene". The one gene eventually engendered 450 publications, and when counting related genes, over 1,000 publications. A recent large-scale "validation" study is likely to bring down the entire cottage industry - the depression gene is found to have little explanatory power for depression after all.
Gene data is an example of a type of Big Data. Big Data can be big in terms of the number of individuals in the dataset, or the number of measurements per individual. Two decades ago, the scale was attained by virtue of more measurements, not more individuals. The original study looked at about 300 or so individuals but each person's genome is vast.
The basic analysis is to compare the average depressed individual versus the average not-depressed individual in the sample. The data analyst sifts through large numbers of genes to find one or a few that are highly correlated with having depression. This is a classic fishing expedition, because of the large number of candidate genes, and also because of the large number of ways to define depression.
Such an analysis rides on top of a "model" of the world in which a single gene is responsible for depression. Over the years, the scientific community has discovered that this model is wrong. The new model assumes depression is indicated by a large set of genes each contributing a weak effect.
This type of structure is very hard to elicit from the typical datasets of the past - those that have numerous measurements on few individuals. Nowadays, we have data on lots of individuals but the sourcing of the data and other problems pose formidable challenges. It's also not clear how to use a model that spreads the blame thinly around a large number of genes for treatment.
Science is proceeding as it should - weak theories are overturned with more research. The article laments that it took 20 years to turn the tide, earlier warnings were ignored, the publish-or-perish culture in academia creates perverse incentives, retraction of scientific studies, etc.
***
I recently wrote about the challenge of Big Data expanding the variety of measurements here. Also, in writing Numbersense (link), I was concerned that the explosion of data collection causes an avalanche of false-positive science.
"The new model assumes depression is indicated by a large set of genes each contributing a weak effect."
The new model is worse than the old one. The shift from the old one to the new one only adapts to the availability of new (big) data.
With a few suspects, it's relatively easy to find the culprit (the one that contributes the most), if he exists. But with a very big big number of accuseds, it's relatively easy (again!) to find the several culprits (the ones that contribute weakly together) because a large , large, large piece of evidence can always be read as supporting or not supporting some among the multuitude of suspects. But, as before, do they exists?
It's an entangled mix between genes and environment that causes depression and other illness. Other twenty years wasted await us. Then, when Internet of Things (smartphones, smartwatchs, cars, domestic appliances) will produces a new ocean of big big big data, a new model will replace the actual one, stating that depression is caused, other than genes, to "a large set of environmental elements each contributing a weak effect". And so we will be ready to another twenty years of failures.
Science should be theory -> data, not data -> theory.
Posted by: Antonio Rinaldi | 05/24/2019 at 03:07 AM
Another instance of false positive science:
https://twitter.com/math_rachel/status/1132326067643879424
Posted by: Antonio Rinaldi | 05/26/2019 at 07:37 AM
The original study shows a couple of problems in medical research that become even more apparent with genetic research, The first is that the study only resulted in a p-value of 0.03 which is far from conclusive. The article even suggests that it should have been replicated. Medical researchers tend to believe that any p-value less than 0.05 is certain proof, because that is what is taught in statistics courses. That leads us to the second problem. These early studies there was a lot of dishonesty about the number of tests performed. It is a fundamental problem in epidemiology in general, people collect a lot of baseline data and then they perform every cross-sectional test they can and then every longitudinal as they collect more data. Then they write it up with an introduction that justifies the analysis based on previous research. Genetics makes this even worse as there are lots of genes, so these days there are established methods to make this less likely.
One of the possibilities of statistics and science is that we will get it wrong sometimes. What is bad is when this isn't corrected. I expect that when anyone found contradictory results they decided that it wasn't in their interest to publish them. When you have a grant and research program it isn't in your best interest to challenge its basis. Hard to publish for a start.
Posted by: Ken | 06/16/2019 at 03:20 AM