A recent article in USA Today is titled “Many with sudden cardiac arrest had early signs” (link). The signs include shortness of breath, faintness, chest pain, etc. Hold on to the headline because it’s the only thing believable in the entire article.
The words “early signs” imply to readers that were the men to heed these warnings, they could have prevented the cardiac arrests.
Think about the following two statements:
A1) Many with sudden cardiac arrest previously had symptoms.
B1) Many with symptoms subsequently had sudden cardiac arrest.
These two statements are far from equivalent, even though they describe the same sequence of events.
It’s easier to see the difference if we specify the symptoms:
A2) Many with sudden cardiac arrest had shortness of breath weeks before.
B2) Many with shortness of breath had sudden cardiac arrest weeks later.
It’s even easier to see if we include a number:
A3) 53% of those with sudden cardiac arrest had chest pain, shortness of breath, etc. (a direct quote from the article)
B3) 53% of those with chest pain, shortness of breath, etc. subsequently had sudden cardiac arrest.
B3) is clearly false. The universe of men who suffer from chest pain, shortness of breath, etc. is much larger than the population who have sudden cardiac arrest in any given week. B3) vastly exaggerates the number of sudden cardiac arrests.
***
How did the researchers come to make this type of claims? They were looking at a data set that had no control group.
“Clugh and colleagues studied medical records of 567 men from Portland, Ore., ages 35 to 65, who had out-of-hospital cardiac arrests between 2002 and 2012… 13% had [prior] shortness of breath… ” We have no way of interpreting whether 13% is a big or small number unless we know what proportion of middle-aged men with the same characteristics as those in the study but who did not have cardiac arrest suffered from shortness of breath.
One of the greatest challenges of the Big Data era is the absence of control groups. Without them, we don’t have a yardstick to judge.
Did some simple research: please test my thinking:
Ca - Cardiac Arrest
S- Symptons being shortness of breath and chest pains
P(S|Ca) = 53% probability of symptons given cardiac arrest
P(S) = 8% (overall population data, really rough)
P(Ca) = .8% (800/100,000 people suffer Ca)
Using a Bayesian analysis:
P(Ca|S) = P(S|Ca)*P(Ca)/P(S) = 53% x .8%/8% = 5.3%
Your chances of cardiac arrest given the symptons is 5.3%, meaning you may not need to run to the hospital. You certainly need a control group to factor out issues such as panic attacks, etc., that can cause the same symptons.
Posted by: S. Frazier | 12/02/2013 at 07:56 PM
SF: Thanks for your contribution. Always good to do back of the envelope. If we do a similar analysis on the other symptoms, the number would be even smaller given the much weaker correlation.
Posted by: Kaiser | 12/03/2013 at 11:38 PM
Big data doesn't imply a lack of control groups. Lazy analysts don't use the available data to build an appropriate control group.
Lazier journalists re-print this as useful information.
Posted by: Chris | 01/16/2014 at 12:52 PM
Chris: Big data is mostly observational data and it takes both a lot of time and a lot of statistical expertise to build "appropriate control groups" so I'm not surprised this is not being done. Sometimes you just can't build control groups from existing data. For example, if you launch a new version of an iphone app, Apple is not going to let you keep both new and old versions in the same store; if you want to measure the impact of the new app, you are forced to perform pre-post analysis. Any creation of a control group would require uncomfortably strong assumptions.
Posted by: Kaiser | 01/17/2014 at 01:07 AM