A casualty of the Covid-19 emergency is scientific rigor, and in particular, certain long-established statistical principles in evaluating evidence, and specifically, that the weight of evidence depends on the method of research, and how data are collected and analysed.
The media, and the scientists they employ as talking heads, rarely identify the nature of the evidence behind health policies. On air, they make little or no distinction between results obtained from randomized clinical trials (RCTs), observational studies, surveys, lab experiments, animal experiments, etc. Like a drumbeat, they promote the impression that all evidence are created equal, despite their disparate origins.
***
While such an attitude towards scientific evidence is not new, it became normalized during the pandemic. A great illustration is found in this recent article reporting the results of a clinical trial on colonoscopy (screening) for colon cancer (link).
In the U.S., the standard of care calls for all men above 45 years old to get a colonoscopy every 10 years. Colonoscopies are both expensive and invasive. In some other countries, less invasive alternatives like stool testing are preferred.
American doctors have pointed to strong evidence that colonoscopies work. According to the article:
Past research always showed that colonoscopy could put a huge dent, on the order of 70%, in the incidence and mortality from colon cancer.
In other words, we can reduce colon cancer deaths by 70% - purely by screening, and presumably catching cancer in its early stage. A researcher was quoted saying the profession has expected colon cancer to become "extinct" if every man were to take colonoscopy.
***
Chances are men were not informed that all past evidence about this practice has come from observational studies only.
Observational studies involve comparing the outcomes of people who have taken colonoscopies and those who haven't. However, because people self-select into these procedures, a direct comparison of the two groups is biased. The people who choose to do it are likely to be different from those who don't, beyond the decision to be screened. Analyzing observational data involves finding ways to remove biases.
Take a current example. It is obvious that when comparing the Covid-19 outcomes of vaccinated and unvaccinated people, these two groups are not identical because self-selection introduces biases. There's age bias, political bias, etc. etc. It's challenging enough to enumerate all sources of biases. The hard nut to crack is to learn the magnitude of such bias. Even if we know older people are more likely to be vaccinated, and more likely to suffer more severe consequences from Covid-19, how do we know the degree of bias? If we don't know the size of the bias, how can we tell if we have adjusted too much or too little?
It's like asking the tailor to adjust the length of a pair of pants without telling her the height of the person wearing those pants. It's just too long, the tailor is told.
Back to the observational studies of colonoscopies. Doctors have relied on a series of peer-reviwed observational studies that converge on the conclusion that colonoscopies are extremely effective. What can go wrong?
***
The profession ignored the nature of the evidence, the deficiencies of observational studies.
Finally, some researchers in Norway had the good sense to check these results by running the gold standard of medical evidence, a randomized clinical trial. In these trials, patients are randomly selected to get invitations to take colonoscopies. The random assignment of treatment is an automatic adjustment for known and unknown biases. (Read this previous blog for a primer.)
And... after 10 years of observation, the RCT showed that colonoscopies has essentially zero effect on mortality from colon cancer, and only about 20% reduction in incidence, both metrics coming in far below 70%, the efficacy found in the observational studies. I'll dig deeper into the results in a future post. For this post, I take these findings at face value.
***
What gave me feelings when reading the Stat article is the response from various experts who effectively advocate trashing the RCT result in favor of the observational studies. This reaction is shocking given the RCT's status as the gold standard of medical evidence.
This RCT is not some summer project by a student intern. It includes 85,000 people, with data collected over 10 years. And yet, the article ends with a quote that says "colonoscopy is still king".
The negative reaction reflects the belief that all evidence are created equal, and illustrates the skepticism toward counter-intuitive scientific findings. It's not just in medicine. Anyone who have run experiments in industry has encountered this attitude.
The arguments for rejecting the inconvenient results can be classified into the following types:
(a) Some other related evidence exists that contradicts the current study - e.g. a randomized trial of a related procedure has previously shown better results
The problem with this type of argument is that it says nothing about the current study. A Chinese term for this strategy is "raising East to beat West". This strategy works under the doctrine that all evidence are created equal. The other study is on a different procedure and it is brought into this conversation purely because it obtained the more "acceptable" result. Imagine this: if that trial has produced worse outcomes than the current study, would the expert have mentioned it to the reporter?
If we allow prior experiments to overrule new experiments if the latter's conclusion differs from the former, then what's the point of running the new experiments?
(b) The treatment worked better for a subset of the test subjects - e.g. if we restrict the analysis to the older population, the positive effect of the treatment is higher
Whenever the overall results of an experiment disappoint, the sponsor of the experiment almost always asks for "deep dive" analyses, which typically means, hunting for a subset of the population for which the treatment made a difference. What's wrong with this? Imagine, what if the researcher discovered a subset of patients for which the effect of colonoscopy on cancer diagnosis was not 20% reduction but 5% reduction? This finding will be dismissed and ignored because it adds "nothing new" to the overall conclusion. The only additional findings that can get the attention of the sponsor are those that validate the desired outcome.
Besides, any such deep dive analysis runs into statistical problems. The sample size of the subset is usually too small, harming the reliability of the result. The filtering effect of statistical significance means any such finding is likely to be an outlier for the experiment.
Worse of all, the sponsor may take the positive outcome for the subset, and create the impression that it works for everyone. This actually happens a lot more frequently than one thinks. Take a Covid-19 vaccine study that uses matching to remove biases. While the matching procedure increases our confidence that the two groups being compared are similar, many records are dropped from the analysis because they are unmatchable. For example, because most older people got vaccinated, and many younger people did not, the matched subpopulation has a different age profile than the full population. While the comparison of the matched groups is not biased, the extrapolation from matched groups to the general population is biased. Nevertheless, experts used the matching study to proclaim that the vaccine is effective for all, rather than that the vaccine is effective for the matchable subpopulation.
(c) Our hypotheses are better than your data - e.g. because treatment methods are improving over time, we believe the benefit will increase over time, therefore in five years' time, the outcome will be better
Everyone is entitled to their hypotheses but only those running the experiment has submitted their hypothesis to testing with real-world data. The experts offer no data to support the alternative hypotheses. The experts are also ignoring hypotheses that reinforce the study's findings. Imagine this: what about the possibility that someone would develop a new screening method that might dominate colonoscopies in the next five years?
(d) The data should be analyzed in some other way - e.g. we should use a per-protocol analysis
This reaction makes a mockery of pre-registering the study methods. We face the same problem of ex-post hunting for a specific outcome. If the per-protocol analysis had yielded similar or "worse" results as the intention-to-treat analysis, the same expert would not have raised this critique.
***
Imagine this: the RCT for colonoscopy has confirmed the profession's longstanding view that the procedure cuts cancer deaths by 70% or more. What would the article look like? Would there be a line-up of experts raising concerns that the RCT outcome might be wrong?
All four types of counter-arguments could still have been raised. There is likely going to be at least one randomized trial of a related procedure that showed much lower efficacy. If the analysts did a deep-dive analysis, at least one subset of the patients probably experienced much lower benefits than the average. Various hypotheses could be proposed that suggest the RCT over-estimated the efficacy. Some other way of analysing the data would have yielded lower efficacy.
Nevertheless, we wouldn't be hearing these arguments.
***
Lost in this debate is the real question raised by the RCT; in fact, it hasn't even been asked. So I'm asking it here: What went wrong in those prior observational studies to have over-estimated colonoscopy's effectiveness to such a large extent? How did the machinery that "adjusts" for biases fail?
Recent Comments