One of my favorite statistics-related wisecracks is: the plural of anecdote is not data.
In today's world, the saying should really say: the plural of anecdote is not BIG DATA.
In class this week, we discussed a recent Letter to the Editor of top journal, New England Journal of Medicine, featuring a short analysis of weight data coming from a digital scale that, you guessed it, makes users consent to being research subjects by accepting its Terms and Conditions. (link to NEJM paper, covered by New York Times)
The "analysis" is succinctly summarized by this chart:
Their conclusion is that people gain weight around the major holidays.
How did the researchers come up with such a conclusion? They in essence took the data from the Withings scales, removed a lot of the data based on various criteria (explained in this Supplement), and plotted the average weight changes over time. Ok, ok, I hear the complaint that I'm oversimplifying. They also smoothed (and interpolated) the time series and "de-trended" the data by subtracting a "linear trend". The de-trending accomplished nothing, as evidenced by comparing the de-trended chart in the main article to the unadjusted chart in the Supplement.
Then, the researchers marked out several major holidays - New Year, Christmas, Thanksgiving (U.S.), Easter, and Golden Week (Japan) - and lo and behold, in each case the holidays coincided with a spike in weight gain, ranging from a high of about +0.8% (U.S.) to a low of +0.25% (Easter).
Each peak is an anecdote and the plural of these peaks is BIG DATA!
Why did I say that? Look for July 4th, another important holiday in the States. If this "analysis" is to be believed, July 4th is not a major holiday in the U.S. On average, people tend to lose weight (-0.1%) around Independence Day. There is also no weight change around Labor Day.
In a sense, this chart shows the power of data visualization to shape perception. Labeling those five holidays draws the reader's attention. Not labeling the other major holidays takes them out of the narrative. Part of having numbersense is to have ability and confidence to make our own judgment about the data. Once one notices the glaring problems around July 4th and Labor Day, one no longer can believe the conclusion.
There is also "story time" operating here. The researchers only had data on weight changes. They did not have, nor did they seek, data on food intake. But the whole story is about festive holidays leading to "increased intake of favorite foods" which leads to weight gain. Story time is when you lull readers with a little bit of data, and when they are dozing off, you feed them a huge dose of narrative going much beyond the data.
The real problem here relates to the research process. Traditionally, you come up with a hypothesis, and design an experiment or study to verify the hypothesis. Nowadays, you start with some found data, you look at the data, you notice some features in the data like the five peaks, you now create your hypothesis, and there really is little need to confirm since the hypothesis is suggested by the data. And yet, researchers will now run a t-test, and report p-values (in this weight change study, the p-values were < 0.005.)
Even if it's acceptable to form your hypothesis after peeking at the data, the researcher should have then formulated a regression model with all of the major holidays represented, and then the model will provide estimates of the direction and magnitude of each effect, and its statistical significance.
PS. Some will grumble that the analysis is not "big data" since it does not contain gazillion rows of data, far from it. However, almost all Big Data analyses that are done following the blueprint outlined above. Also, I do not define Big Data by its volume. Here is a primer to the OCCAM definition of Big Data. Under the OCCAM framework, the Withings scale data is observational, has no controls, is treated as "complete" by the researchers, and was collected primarily for non-research reasons.
PPS. Those p-values are hilariously tiny. The p-value is a measure of the signal-to-noise ratio in the data. The noise in this dataset is very high. In the Supplement, the researchers outlined an outlier removal procedure, in which they disclosed that the "allowable variation" is 3% daily plus an extra 0.1% for each following day between two observations. Recall the "signals" had sizes of less than 0.8%.
I am traveling so have to make this brief. I will likely come back to these stories in the future to give a longer version of these comments.
I want to react to two news items that came out in the past couple of days.
First, Ben Stiller said that prostate cancer screening (the infamous PSA test) "saved his life". (link) So he is out there singing the praises of the PSA test, which has been disavowed even by its inventor (link), although still routinely used by many physicians.
One can't dispute that the PSA test result caused Ben Stiller to know about his cancer and he is better today because of that discovery.
However, imagine the following scenario: I invent my own screening test. The test consists of flipping a coin: heads, you have cancer, tails you don't. Amongst those people who came up heads, I can find one for whom he truly has cancer. I saved his life because my test alerted him to this fact. Because I saved this person's life, my test must be really good. (If one anecdote is too few, I could find a handful of people whose lives I have saved.)
Second, the FBI tells reporters that the Minnesota mall attacked "withdrew from friends in months before attack." (link)
Imagine that you are trying to predict who will be the next disgruntled attacker. Based on the FBI statement, you want to round up everyone who "withdrew from friends." How many people would that include? How many of them will eventually be attackers?
Same holds for all the other findings, such as "he converted to Islam recently", and "he posted something hateful on Facebook".
*** It is precisely when we want something badly, like information that saves our lives or that prevents terrorist attacks, when we become most susceptible to nonsense data analyses.
A GMO labeling law has arrived in the US, albeit one that has no teeth (link). For those who don't want to click on the link, the law is passed in haste to pre-empt a more stringent Vermont law. The federal law defines GMO narrowly, businesses do not need to put word labels on packages (they can, for example, provide an 800-number), and violaters will not be punished.
One of the arguments against GMO labeling is that it is unscientific because (some) scientists are 100% certain that GMO foods are safe. (e.g. this Boston Globe editorial)
Any good scientist knows that scientific "truths" are true until they are proven otherwise. Science is a continuous process of making hypotheses, and finding data to confirm or reject them. The Bayesian way of thinking is very useful here. Being true is a probability - more confirmatory data increases the probability that a given hypothesis is true.
So why is GMO labeling good science?
In fact, I'd go so far as to say that there is no science without GMO labeling.
How is nutritional science done today? What is the research that tells us coffee is good, butter is good, salt is bad, etc.? Granted, this is a shaky field that has issued lots of false results. But the usual form of analysis goes like this: conduct a large survey of consumers and ask them about their diet (e.g. how much red meat do you eat each week?); obtain information about their health status, either through the same survey, a different survey, or direct measurements if they are part of a research study; then correlate the dietary data and the health data.
Now, imagine you want to study if eating GMO foods affects your health, either positively or negatively. Your survey question will be something along the lines of "How much GMO foods did you eat last week?"
Without GMO labeling, there is no way to conduct such research. This is why GMO labeling is good science. Not labeling GMO is bad science - actually it mandates no science.
Theranos (v): to spin stories that appeal to data while not presenting any data
To be Theranosed is to fall for scammers who tell stories appealing to data but do not present any actual data. This is worse than story time, in which the storyteller starts out with real data but veers off mid-stream into unsubstantiated froth, hoping you and I got carried away by the narrative flow.
Theranos (n): From 2003 to 2016, a company in Palo Alto, Ca., the epicenter of venture capital, founded by Elizabeth Holmes, a 19-year-old Stanford University dropout, raised over $70 million to develop and market a "revolutionary" technology for blood testing that is said to require only a finger-prick of blood. The company grew its valuation to $9 billion without ever publishing any scientific data in a peer-reviewed medical journal. It turned out that the new technology was used only in 12 out of 200 tests on its menu, meaning that the business has been based on selling old technology at bargain basement prices subsidized by the venture-capital money. Further, it emerged that the new technology was not accurate, that the new technology has been shelved since last year, and in some cases when old technology was used, the lab personnel improperly handled the machines--all of which eventually led to a blanket retraction of two full years worth of test results. The company claimed that these results have been "corrected" in the last few weeks. It is unclear what "correction" means when such blood was taken from patients up to two years ago. The company is still in business, and Walgreens, one of its most prominent partners, continues its commercial relations with the company. For many years, the business and technology press has issued countless glowing reviews of the company (see this epic list just covering 2013-2015.) Until 2014, the company board consists entirely of politicians, former cabinet members and military leaders. All these individuals have been Theranosed.
The Wall Street Journal has done an exemplary job following this case, and deserves a Pulitzer for this effort. The latest revelation relating to the full-scale retraction is here.
The fad of standing while working may die hard but science is catching up to it.
The idea that standing at work will make one healthier has always been a tough one to believe. It requires a series of premises:
Using a standing desk increases the amount of standing
Standing longer improves one's health
The health improvement is measurable using a well-defined metric
The incremental standing is of sufficient amount to effect an improvement in health
No other factors are required to attain the said improvement
No other factors ameliorate the said improvement (e.g. standing more may expense more energy, causing one to snack more)
Now, the Cochrane foundation has looked at the evidence, and found it lacking (link). Cochrane researchers:
The quality of evidence was very low to low for most interventions mainly because studies were very poorly designed and because they had very few participants. We conclude that at present there is very low quality evidence that sit-stand desks can reduce sitting at work at the short term. There is no evidence for other types of interventions. We need research to assess the effectiveness of different types of interventions for decreasing sitting at workplaces in the long term.
Seems like they haven't passed the first hurdle - does using standing desks actually reduce the amount of sitting at work?
In our latest Statbusterscolumn for the Daily Beast, we read the research behind the claim that "standing reduces odds of obesity". Especially at younger companies, it is trendy to work at standing desks because of findings like this. We find a variety of statistical issues calling for better studies.
For example, the observational dataset used provides no clue as to whether sitting causes obesity or obesity leads to more sitting. Further, as explained in the column, what you measure, and even more importantly, what you don't measure makes and breaks the analysis.
These lessons are highly relevant to anyone working with "big data" studies.
One of the secrets of great data analysis is thoughtful data collection. Great data collection is necessary but not sufficient for great data analysis.
I recently had the unfortunate need to select a new doctor. Every time I had to do this, it has been an exercise in frustration and desperation. And after wasting hours and hours perusing the "data" on doctors, inevitably I give up and just throw a dart at the wall.
Every medical insurer points you to their extensive online resource called the doctors' directory. Apparently, we are supposed to pick a doctor from this directory. There is a lot of data in this directory. A casual search results in hundreds of matches. What are the data available for me to narrow down my selection?
Which school the doctor graduated from?
When did the doctor graduate?
What was the name of the degree?
How many languages the doctor speak?
What hospitals the doctor is affiliated with?
Which medical group the doctor operates within (if any)?
What are the fields of specialization?
The address of the office
Conspicuously absent are any data that measure the quality or outcomes of the doctor. There is neither quantitative nor qualitative measure of quality or patient satisfaction. We don't know anything about wait times. It is very challenging to know how big the doctor's practice is.
The data that are provided are essentially just that--data that convey almost no information. I don't think which school the doctor went to matters, nor the name of the degree. Age might be somewhat useful as it indicates amount of experience but the year of graduation is often suppressed. Ethnicity is perhaps useful but it is not present; in some cases, the name reveals this information but not usually.
Hospital affiliations could have been useful if many doctors are not affiliated with many hospitals. I asked a friend of mine who is a doctor whether there are more "selective" hospitals like there are more "selective" universities but he tells me hospital affiliation conveys no information.
Fields of specialization is also useless as I am not looking for a specialist.
Languages spoken is an oddity. If I interpret the data literally, it seems that American doctors have an obsession with learning foreign languages. It is incredible how many of them speak three, four or more languages, including relatively exotic ones. Chances are these doctors have people in the office who speak those foreign languages. In any case, since my primary language is English, I have no inclination to select doctors based on what other languages they (or their staff) speak.
So the only piece of data I can use is the address. Is the doctor close to my home or work?
And that seems to be a poor way of selecting doctors.
PS. While writing this, I am reminded of a continuous stream of useless real-time data: those signal bars on our cellphones. The number of bars and the speed at which a webpage loads are much less correlated than expected.
It's okay if we treat the data as a joke. But somewhere in the world, some data scientists are using the data to do serious work.
My co-columnist Andrew Gelman has been doing some fantastic work, digging behind that trendy news story that claims that middled-aged, non-Hispanic, white male Americans are dying at an abnormal rate. See, for example, this New York Timesarticle that not only reports the statistical pattern but also in its headline, asserts that those additional deaths were due to suicide and substance abuse.
It all began with the chart shown on the right. It appears that something dramatic happened in the late 1990s when the USW (red) line started to diverge from those of all the other countries. The USW line started to creep upwards, meaning that the death rate is increasing for US white non-Hispanic males aged 45-54. (The bolded blue line is for US white Hispanic males aged 45-54 and does not look different from those of other countries.)
Prompted by a lively discussion in the comments section, Andrew pursued a deeper analysis of this data. This has led to a series of posts in which he refined the analysis (see here, here, here and here.) I recommend reading the entire series, as it paints a full picture of how statistical thinking works. In the rest of this post, I will present a cleansed summary of his argument while leaving out details.
We first note that the veracity of the data is not at issue. We accept as a starting point the trends shown above to be true; this can easily be verified using public data. The debate is around why.
People who analyze age-group data are particularly sensitive to bias due to discretization. The original analysis, co-authored by Angus Deaton the recent winner of the economics Nobel, focuses on the age group 45 to 54. If you compute the average age in this age group over time, you may be shocked this is not flat; the average age of people aged 45 to 54 has been increasing over time. As the following chart shows, since 1990 or so, the average age in this age group moved up by about half a year. (Data from CDC Wonder.)
Because older people die at a higher rate, the death rate within age group 45 to 54 will increase just because of the increasing average age of this age group--without having to resort to other reasons such as suicides.
It is noted also that the Baby Boom in the U.S. caused large fluctuations in the age distributions over time. This observation provides nice color on why the average age is increasing but is not required. Aging population is another cause.
What is crucial to the reasoning is the steepness of the increase in death rate with increasing age. Surprisingly, it is not easy to find a chart plotting death rates by age. Wikipedia has this graph shown on the right. This is not empirical data but the Gompertz-Makeham law (link), which is described as accurate for the 30-80-year-old range. The key insight is that mortality rate increases exponentially after age 30.
Having a theory is not enough. In his first post, Andrew tested this theory by pulling a few numbers and working out a back-of-the-envelope calculation. The goal is estimate the magnitude of this average-age effect. How much of the observed anomalous trend does this explain? Do we need any other reasons?
Andrew estimated that the average age in the 45-54 age group moved up 0.6 year between 1989 and 2013, the period covered by the original study. From life tables, he found that mortality worsens by about 8 percent per extra year lived. Thus, over the research period, increasing average age contributes 0.6*8 = 4.8 percent per year to the death rate.
This level of increase explains most of the trend shown by that red line in the original chart. Thus, Andrew concludes that the data, after adjusting for age, showed that mortality rate among middled-aged, non-Hispanic, white male Americans has been essentially flat.
The original findings that this group behaves differently from those in other countries, and from the US hispanic male population are still interesting.
A number of techniques can be used to control for the shift in the underlying age distribution. Disaggregation of the data is one method. CDC releases data at the single age level, and analyzing the data year by year is the next step that Andrew undertook.
One result of this finer analysis is that in the years 1999-2013 (i.e. after dropping the first 10 years of the first chart), even after adjusting for age, there is still about a 4 percent increase in mortality rate among the U.S. middle-aged white non-Hispanics, roughly half of the trend shown in the original chart. In other words, in the shortened time frame, age adjustment explains half of the trend, not all of it.
This has led one of the original authors, Deaton, who just won the Nobel in economics, to say "the overall increase in mortality is not due to failure to age adjust."
This statement is a bit too loose for my liking. First, "is not due to" implies that age aggregation has zero effect when it does explain half of the trend. Second, one should always age-adjust if the underlying age distribution is changing. Even if the age adjustment did not explain anything at all, I'd argue one should still age-adjust. Doing so would help eliminate age aggregation as a potential reason for the observed trend.
One argument against age adjustment is that it involves a lot of work - finding the right data, processing the data, merging the data, etc. But unless one does this work, one can't know how strong the aggregation effect is. And if you have done the homework, why not show it?
Disaggregating all the data is annoying because now you have one chart per single age. The next method for age adjustment is "standardization". This requires creating a reference age distribution, which is then applied to all years. In effect, we are artificially holding the age distribution constant so age could no longer explain any effect.
This is what Andrew's age standardized rates look like:
For the age-adjusted line (in black), what he did is to "weight each year of age equally". This shows that the effect of increasing ages within this age group is growing over time.
Then, something really interesting happens when Andrew split the black line by gender:
So it turns out that middle-aged U.S. white non-Hispanic men are not where the story is. The age-adjusted mortality rate for the corresponding women has steadily climbed between 1999 and 2013!
Next, Andrew looked at the other age groups and found that an even more pronounced trend affecting U.S. non-Hispanic whites in the 35-44 age group.
He also looked at Hispanic whites, and African Americans, which I don't repeat here. Even after age adjustment, those groups show trends that are more in line with the rest of the world.
Finally, for those wondering how this is relevant to say the business world, let me connect the dots for you.
Imagine that you run a startup that sells an annual subscription. One of your key metrics is the churn rate, defined as the number of subscribers who quit during period t divided by the number of paying subscribers at the start of period t. So a monthly churn rate of 5% means that five percent of the paying subscribers quit the service during that month.
There are two reasons to age-adjust this churn rate. First, the shape of the churn rate curve is not smooth. In particular, almost no one churns during the first 12 months. Second, the startup is growing very rapidly. This means that a lot of new customers are being acquired, and each new customer has up to 12 months in which the churn rate is close to zero.
What happens is the churn rate will fluctuate based on the monthly growth rates of this subscription service. As the growth rate fluctuates, the average tenure of the user base fluctuates. The more new customers in their first year, the lower the churn rate.
If the churn rate is not age-adjusted, you don't know if customers are increasingly more dissatisfied with your service, or if you just have slower growth which leads to increasing average tenure!
In the first two chapters of Numbersense, I discuss how people game statistics, and why gaming is inevitable. I have also written about the placebo effect before. Another article has appeared covering the same topic -- the industry doesn't like the fact that more and more drugs fail to clear the "placebo" hurdle; and the industry thinks the problem is that the placebo effect is mysteriously increasing over time.
What is new in that BBC News item is the extensive conversations with people who run clinical trials. They reveal a variety of tricks they use to game the numbers.
Read our latest Statbusters column in the Daily Beast here.