The New York Times Magazine has a pretty good piece about the use of OCCAM data to solve medical questions, like diagnosis and drug selection. I'm happy that it paints a balanced picture of both the promise and the pitfalls.
Here are some thoughts in my head as I read this piece:
- Small samples coupled with small effects pose a design problem in traditional clinical trials. The subjects of the NYT article claim that OCCAM data can fill the void. If a treatment is highly effective, even small clinical trials will find the effect. So the underlying issue is less sample size than effect size.
- Counterfactual evidence is almost always absent from OCCAM data because of lack of controls (the first “C” in OCCAM). The lede in the story concerns a girl who was given an anti-clotting drug because a doctor suspected she had elevated risk of blood clotting, and the girl did not develop a clot. Statisticians are not impressed by such evidence, because we don’t know whether the drug was truly responsible for the outcome. (It's a correlation until proven guilty.) If the girl had not taken the drug, would she have developed a clot? This point is argued in the article by Chris Longhurst: “At the end of the day, we don’t know whether it was the right decision.” This ignorance puts us in a dangerous territory, making it a challenge to tell apart the prescient from the charlatan.
- The Big Data world is filled with "events data". You have a log of everyone who clicked on a particular button, or a log of everyone who called your call center, etc. You only have the cases but not any non-cases (e.g. the unhappy customer who did not call the call center). Heartwarming stories like the girl's avoidance of clotting get repeated (or become viral, in modern terminology) but stories of failure are not usually deemed worth reporting. The following table shows four possible stories:
The media imposes a filter so that only the one story will get through. Without mentally accounting for the other stories, one can't judge how important the reported story is!
***
In the July issue of Significance, the magazine by RSS and ASA, Julian Champkin contributed a great profile of Iain Chalmers, the founder of the Cochrane Collaboration, the organization that aggregates and summarizes trial results. I saw this fantastic quote, which speaks to the New York Times article:
Dr. Spock’s 1946 book, Baby and Child Care, was ... read by a huge proportion of [parents around the world]; throughout its first 52 years in print, it outsold every other book except the Bible. “It recommended that babies should be laid to sleep on their stomachs. Now we know that doing that increases the risk of cot (crib) death. Tens of thousands of babies died needlessly because of that advice.”
What is OCCAM data? I did Google it, didn't turn up anything in the first couple of pages. And a scan of the NYT article didn't help either.
Posted by: Andy | 10/09/2014 at 02:49 PM
Andy: see this link.
Posted by: junkcharts | 10/10/2014 at 01:15 AM
Kaiser, I think OCCAM is a great framework for thinking about big data issues, but I would recommend you provide that link every time you mention OCCAM. Not all of us are regular readers.
Posted by: zbicyclist | 10/10/2014 at 09:50 PM
Many of the problems seem to be missing data problems, and that is something that statisticians should be expert in.
Posted by: Ken | 10/11/2014 at 04:41 AM
zbicyclist: Point taken. I was in a bit of a hurry when I wrote that post.
Ken: yes, many issues like biased samples and censoring are nicely modeled as missing data problems. Another big issue I'm concerned about is having nuisance predictors while also missing relevant predictors. Because of observational data, we frequently don't have the right set of predictors. A third issue would be a network of direct and indirect effects. The general point is that the complexity of the analysis is increased, not decreased, by having "big data".
Posted by: junkcharts | 10/13/2014 at 11:43 AM
To add, the problem with the clotting agent is similar to one that occurs in diagnostic testing. Only those that have a positive result on a screening test actually have a better test. This effectively results in lots of missing data as most subjects don't have the better test, which results in severe biases if the data is not treated correctly.
When looking at big data for medicine we are going to have a mess of different tests and possibly treatments that will obscure any possible diagnosis, as you will never know if the disease just went away or was fixed if you don't know if there was a disease in the first place. I expect there will be some very nice Bayesian analyses that will demonstrate that standard data-mining techniques just aren't enough.
Posted by: Ken | 10/15/2014 at 02:03 AM