Currently (Tuesday), the top story on New York Times's website is the one about spinal taps as a predictor of Alzheimer's.
In short, the researchers are making claims of "perfection" (or near-perfection), that the presence of certain proteins in one's spinal fluid is certain proof that one will eventually develop Alzheimer's.
If you've read my book, especially Chapter 4 (and the associated stuff in the Conclusion), you should be able to think statistically about what is being printed on the page.
While I am quite sure that this finding is important (at least in stimulating further research), I don't think there is enough information for readers to be fully convinced. Every time someone trumpets "perfection", and particularly in forecasting and prediction, we ought to start from a position of skepticism. So in this spirit, here I go.
***
In this post, I focus my attention on how the numbers were reported in this article, and these two sentences in particular (I've numbered them for convenience):
[1] The new study included more than 300 patients in their seventies, 114 with normal memories, 200 with memory problems, and 102 with Alzheimer's disease.
...
[2] Nearly every person with Alzheimer's had the characteristic spinal fluid protein levels. [3] Nearly three quarters of people with mild cognitive impairment, a memory impediment that can precede Alzheimer's, had Alzheimer's-like spinal fluid proteins. And every one of those patients developed Alzheimer's within five years. [4] And about a third of people with normal memories had spinal fluid indicating Alzheimer's. Researchers suspect that those people will develop memory problems.
[1] - the numbers don't add up, and I'm confused by the placement of the commas. Is it that the "more than 300" were composed of three subgroups? Is it that there were four subgroups in the experiment, one of which consisted of people in their 70s? This inconsistency, in itself not deadly, can easily be fixed but it does smack of carelessness.
But what are the not-to-be-missed words in [1]? It is the qualifier in their seventies. Blink and you may miss it. This is of crucial importance because all of the study's subjects were the elderly who are most at-risk of developing Alzheimer's soon. Why is this important?
Recall that in Chapter 4, I discussed trying to pick a criminal out of a police line-up with say 10 suspects, versus trying to pick a thief out of a large-scale screening of thousands of employees at a company. It turns out the latter is a much more difficult task (because the chance of being a criminal is much lower here than in the former task), and predictive accuracy for this task is definitely worse than that for the other task, all else being equal.
Applied here, trying to predict who will develop Alzheimer's among people in their 70s is much easier than predicting among people, say in their 40s. So, we are a long way from full success.
In reporting the result, the journalist started with [2]. Bad idea. I once took a class from an experienced journalist and learned that newspapers always print the most significant news first. Supposedly, if the editor then chops off the bottom of your article, it leaves the key points intact. The study is purportedly to find a test to predict Alzheimer's. [2] tells us that if we already know that the patient has Alzheimer's, then the test will be "nearly" always positive. This is a group of people that cannot benefit from this test, or any other diagnostic test for that matter.
[3] reads like the clinching argument for the lede ("a spinal-fluid test can be 100 percent accurate in identifying patients with significant memory loss who are on their way to developing Alzheimer's disease."). The second part of [3] can be made clearer, for example, by replacing the period with a semi-colon, or by stating that it's every one of the nearly three quarters.
Now recall, in Chapter 4, I discussed the inevitable trade-off between false positive errors and false negative errors in any diagnostic system. [3] tells us the positive predictive value (PPV) of this test is close to 100%, meaning that if someone tests positive, one almost certainly will develop Alzheimer's, or put differently, the false positive rate is low.
But this is not enough! One thing we learn from the above statement is that the proportion of 70-somethings with mild memory-loss conditions will develop Alzheimer's with at least 75 percent chance. The test will indeed be "perfect" if in fact the chance is 75 percent, that is, none of the people who tested negative eventually develop Alzheimer's.
What if the chance is 80 percent (this is known as the prevalence of the disease in the population)? That means 5 percent of those who will have Alzheimer's will not be detected, i.e. have a negative spinal-fluid test result. That's 5 percent among the 25 percent who test negative. Thus, the negative predictive value (NPV) would be 20/25 = 80 percent, which is good but not perfect. (Someone can look up the journal article when it is published and let us know what the actual NPV is.)
This then reinforces the earlier point... that when say 80 percent of the tested population will develop Alzheimer's, the prediction problem is not as challenging as one might think. Even if the test were to declare everyone positive, the error rate would be only 20 percent.
The bigger point is that prediction systems must be evaluated looking at three legs of a stool: PPV, NPV and the selectivity (how aggressively are people giving positive results?) I wrote about this before in the context of terrorist prediction, reacting to a section in SuperFreakonomics.
***
Finally, [4] leaves much to be desired. The group of most interest for the prediction problem is precisely this group of people who are not currently exhibiting anything unusual. I'd be interested in knowing the research design... did they decide before the experiment is conducted for how long they will track this group? Or are they tracking this group now waiting for the moment to declare victory/failure? In any case, the verdict is not in yet.
In a future article, I will make some comments about causation vs. correlation, as suggested by this research.
Recent Comments