Just leaving this quote from ASA President Jessica Utts here (Source: Amstat News Dec 2016):
A few days ago, I was in Vietnam and took a four-hour bus ride from Ha Long Bay to Hanoi. When I arrived, my fitness tracker had given me credit for taking 9,124 steps and climbing 81 flights of stairs during those four hours, even though I only left my seat once during a short rest stop. ...
In the opposite extreme, I once walked the full length of the Atlanta airport with my hand on my four-wheeled suitcase and got no credit for any steps. I've noticed a similar lack of credit when wheeling a grocery cart, and pushing a baby stroller allegedly has the same effect.
Great example of how (seemingly) complete data con the analyst. Imagine the data analysts and "scientific" researchers mining and squeezing every ounce of information out of such data with their algorithmic bags of tricks.
And this is not just fun and games, either.
The health plan where [her friend] works sets rates based on data acquired from employees' personal fitness devices!
What causes trouble is the nature of the data. Much of the data we analyze nowadays are "adapted," collected originally for some other purpose. Here, the fitness trackers were conceived as toys that have a potential health benefit, an objective in which the devices need only be marginally accurate. The data then get packaged up and eventually end up in some insurance company's database. An analyst now pulls the data out and is having a field day revamping the statistical models by adding a source of new data. The models may even improve a little in the aggregate because the data are somewhat accurate on average.
But at the individual level at which the data get utilized, there are many inaccuracies that bias the models in a discriminatory way. For example, people who walk around pushing baby strollers (i.e. people of a certain age and more likely women) are more likely to have underestimates, which in the insurer's new model are regarded as signals of lower enthusiasm for fitness.
Worse than that, if one knows that the health plan sets rates based on the number of steps taken, one can easily hang the device off one's dog, or design any number of tactics to fool the machine.
Much of the "smarts" in data analyses occur prior to the analyses. Being relentless in understanding how data were collected, especially when they are collected by third parties with different priorities and incentives, goes a long way. Business managers who buy the end products without inquiring about the data sources do so at their own peril. Lots of money can be lost by investing in counterproductive, Big Data-driven smart-playthings.
The Fitbit-type data is a great example of OCCAM: observational, no controls, seemingly complete, adapted and merged datasets that are the norm in the Big Data age - and such data should not be analyzed without a ton of thinking!
One of my favorite statistics-related wisecracks is: the plural of anecdote is not data.
In today's world, the saying should really say: the plural of anecdote is not BIG DATA.
In class this week, we discussed a recent Letter to the Editor of top journal, New England Journal of Medicine, featuring a short analysis of weight data coming from a digital scale that, you guessed it, makes users consent to being research subjects by accepting its Terms and Conditions. (link to NEJM paper, covered by New York Times)
The "analysis" is succinctly summarized by this chart:
Their conclusion is that people gain weight around the major holidays.
How did the researchers come up with such a conclusion? They in essence took the data from the Withings scales, removed a lot of the data based on various criteria (explained in this Supplement), and plotted the average weight changes over time. Ok, ok, I hear the complaint that I'm oversimplifying. They also smoothed (and interpolated) the time series and "de-trended" the data by subtracting a "linear trend". The de-trending accomplished nothing, as evidenced by comparing the de-trended chart in the main article to the unadjusted chart in the Supplement.
Then, the researchers marked out several major holidays - New Year, Christmas, Thanksgiving (U.S.), Easter, and Golden Week (Japan) - and lo and behold, in each case the holidays coincided with a spike in weight gain, ranging from a high of about +0.8% (U.S.) to a low of +0.25% (Easter).
Each peak is an anecdote and the plural of these peaks is BIG DATA!
Why did I say that? Look for July 4th, another important holiday in the States. If this "analysis" is to be believed, July 4th is not a major holiday in the U.S. On average, people tend to lose weight (-0.1%) around Independence Day. There is also no weight change around Labor Day.
In a sense, this chart shows the power of data visualization to shape perception. Labeling those five holidays draws the reader's attention. Not labeling the other major holidays takes them out of the narrative. Part of having numbersense is to have ability and confidence to make our own judgment about the data. Once one notices the glaring problems around July 4th and Labor Day, one no longer can believe the conclusion.
There is also "story time" operating here. The researchers only had data on weight changes. They did not have, nor did they seek, data on food intake. But the whole story is about festive holidays leading to "increased intake of favorite foods" which leads to weight gain. Story time is when you lull readers with a little bit of data, and when they are dozing off, you feed them a huge dose of narrative going much beyond the data.
The real problem here relates to the research process. Traditionally, you come up with a hypothesis, and design an experiment or study to verify the hypothesis. Nowadays, you start with some found data, you look at the data, you notice some features in the data like the five peaks, you now create your hypothesis, and there really is little need to confirm since the hypothesis is suggested by the data. And yet, researchers will now run a t-test, and report p-values (in this weight change study, the p-values were < 0.005.)
Even if it's acceptable to form your hypothesis after peeking at the data, the researcher should have then formulated a regression model with all of the major holidays represented, and then the model will provide estimates of the direction and magnitude of each effect, and its statistical significance.
PS. Some will grumble that the analysis is not "big data" since it does not contain gazillion rows of data, far from it. However, almost all Big Data analyses that are done following the blueprint outlined above. Also, I do not define Big Data by its volume. Here is a primer to the OCCAM definition of Big Data. Under the OCCAM framework, the Withings scale data is observational, has no controls, is treated as "complete" by the researchers, and was collected primarily for non-research reasons.
PPS. Those p-values are hilariously tiny. The p-value is a measure of the signal-to-noise ratio in the data. The noise in this dataset is very high. In the Supplement, the researchers outlined an outlier removal procedure, in which they disclosed that the "allowable variation" is 3% daily plus an extra 0.1% for each following day between two observations. Recall the "signals" had sizes of less than 0.8%.
I am traveling so have to make this brief. I will likely come back to these stories in the future to give a longer version of these comments.
I want to react to two news items that came out in the past couple of days.
First, Ben Stiller said that prostate cancer screening (the infamous PSA test) "saved his life". (link) So he is out there singing the praises of the PSA test, which has been disavowed even by its inventor (link), although still routinely used by many physicians.
One can't dispute that the PSA test result caused Ben Stiller to know about his cancer and he is better today because of that discovery.
However, imagine the following scenario: I invent my own screening test. The test consists of flipping a coin: heads, you have cancer, tails you don't. Amongst those people who came up heads, I can find one for whom he truly has cancer. I saved his life because my test alerted him to this fact. Because I saved this person's life, my test must be really good. (If one anecdote is too few, I could find a handful of people whose lives I have saved.)
Second, the FBI tells reporters that the Minnesota mall attacked "withdrew from friends in months before attack." (link)
Imagine that you are trying to predict who will be the next disgruntled attacker. Based on the FBI statement, you want to round up everyone who "withdrew from friends." How many people would that include? How many of them will eventually be attackers?
Same holds for all the other findings, such as "he converted to Islam recently", and "he posted something hateful on Facebook".
*** It is precisely when we want something badly, like information that saves our lives or that prevents terrorist attacks, when we become most susceptible to nonsense data analyses.
A GMO labeling law has arrived in the US, albeit one that has no teeth (link). For those who don't want to click on the link, the law is passed in haste to pre-empt a more stringent Vermont law. The federal law defines GMO narrowly, businesses do not need to put word labels on packages (they can, for example, provide an 800-number), and violaters will not be punished.
One of the arguments against GMO labeling is that it is unscientific because (some) scientists are 100% certain that GMO foods are safe. (e.g. this Boston Globe editorial)
Any good scientist knows that scientific "truths" are true until they are proven otherwise. Science is a continuous process of making hypotheses, and finding data to confirm or reject them. The Bayesian way of thinking is very useful here. Being true is a probability - more confirmatory data increases the probability that a given hypothesis is true.
So why is GMO labeling good science?
In fact, I'd go so far as to say that there is no science without GMO labeling.
How is nutritional science done today? What is the research that tells us coffee is good, butter is good, salt is bad, etc.? Granted, this is a shaky field that has issued lots of false results. But the usual form of analysis goes like this: conduct a large survey of consumers and ask them about their diet (e.g. how much red meat do you eat each week?); obtain information about their health status, either through the same survey, a different survey, or direct measurements if they are part of a research study; then correlate the dietary data and the health data.
Now, imagine you want to study if eating GMO foods affects your health, either positively or negatively. Your survey question will be something along the lines of "How much GMO foods did you eat last week?"
Without GMO labeling, there is no way to conduct such research. This is why GMO labeling is good science. Not labeling GMO is bad science - actually it mandates no science.
ABC News reported that Ricky Williams, former NFL star, proclaimed himself as holding "the world record for most times drug tested". (link) He said he was tested 500 times.
During this 11-year career, Williams failed the test four times. So there is one thing we know - the drug testing regime is not much of a deterrent.
Since the athlete knows when he is juicing or not, he is privy to an estimate of the false negative / false positive rates of the testing regime. If someone keeps rolling the die, he or she probably knows the tests are not that effective. In my book, I showed using some simple math that almost all juicers would pass these tests.
The number 500 itself is useless. It's all about the protocols. Is the testing really random? When are athletes informed about the test? How is the sample collected? (For example, his wife disclosed that the tester left the sample standing for 45 minutes to "go get stickers" to identify the source of the sample. Sure.) What checks are in place to prevent tricks like using other people's urine, diluting, etc.? Are there off-season testing?
It doesn't matter how many times he is tested. What we should care about are the protocols used in these tests.
Theranos (v): to spin stories that appeal to data while not presenting any data
To be Theranosed is to fall for scammers who tell stories appealing to data but do not present any actual data. This is worse than story time, in which the storyteller starts out with real data but veers off mid-stream into unsubstantiated froth, hoping you and I got carried away by the narrative flow.
Theranos (n): From 2003 to 2016, a company in Palo Alto, Ca., the epicenter of venture capital, founded by Elizabeth Holmes, a 19-year-old Stanford University dropout, raised over $70 million to develop and market a "revolutionary" technology for blood testing that is said to require only a finger-prick of blood. The company grew its valuation to $9 billion without ever publishing any scientific data in a peer-reviewed medical journal. It turned out that the new technology was used only in 12 out of 200 tests on its menu, meaning that the business has been based on selling old technology at bargain basement prices subsidized by the venture-capital money. Further, it emerged that the new technology was not accurate, that the new technology has been shelved since last year, and in some cases when old technology was used, the lab personnel improperly handled the machines--all of which eventually led to a blanket retraction of two full years worth of test results. The company claimed that these results have been "corrected" in the last few weeks. It is unclear what "correction" means when such blood was taken from patients up to two years ago. The company is still in business, and Walgreens, one of its most prominent partners, continues its commercial relations with the company. For many years, the business and technology press has issued countless glowing reviews of the company (see this epic list just covering 2013-2015.) Until 2014, the company board consists entirely of politicians, former cabinet members and military leaders. All these individuals have been Theranosed.
The Wall Street Journal has done an exemplary job following this case, and deserves a Pulitzer for this effort. The latest revelation relating to the full-scale retraction is here.
The fad of standing while working may die hard but science is catching up to it.
The idea that standing at work will make one healthier has always been a tough one to believe. It requires a series of premises:
Using a standing desk increases the amount of standing
Standing longer improves one's health
The health improvement is measurable using a well-defined metric
The incremental standing is of sufficient amount to effect an improvement in health
No other factors are required to attain the said improvement
No other factors ameliorate the said improvement (e.g. standing more may expense more energy, causing one to snack more)
Now, the Cochrane foundation has looked at the evidence, and found it lacking (link). Cochrane researchers:
The quality of evidence was very low to low for most interventions mainly because studies were very poorly designed and because they had very few participants. We conclude that at present there is very low quality evidence that sit-stand desks can reduce sitting at work at the short term. There is no evidence for other types of interventions. We need research to assess the effectiveness of different types of interventions for decreasing sitting at workplaces in the long term.
Seems like they haven't passed the first hurdle - does using standing desks actually reduce the amount of sitting at work?
In our latest Statbusterscolumn for the Daily Beast, we read the research behind the claim that "standing reduces odds of obesity". Especially at younger companies, it is trendy to work at standing desks because of findings like this. We find a variety of statistical issues calling for better studies.
For example, the observational dataset used provides no clue as to whether sitting causes obesity or obesity leads to more sitting. Further, as explained in the column, what you measure, and even more importantly, what you don't measure makes and breaks the analysis.
These lessons are highly relevant to anyone working with "big data" studies.
One of the secrets of great data analysis is thoughtful data collection. Great data collection is necessary but not sufficient for great data analysis.
I recently had the unfortunate need to select a new doctor. Every time I had to do this, it has been an exercise in frustration and desperation. And after wasting hours and hours perusing the "data" on doctors, inevitably I give up and just throw a dart at the wall.
Every medical insurer points you to their extensive online resource called the doctors' directory. Apparently, we are supposed to pick a doctor from this directory. There is a lot of data in this directory. A casual search results in hundreds of matches. What are the data available for me to narrow down my selection?
Which school the doctor graduated from?
When did the doctor graduate?
What was the name of the degree?
How many languages the doctor speak?
What hospitals the doctor is affiliated with?
Which medical group the doctor operates within (if any)?
What are the fields of specialization?
The address of the office
Conspicuously absent are any data that measure the quality or outcomes of the doctor. There is neither quantitative nor qualitative measure of quality or patient satisfaction. We don't know anything about wait times. It is very challenging to know how big the doctor's practice is.
The data that are provided are essentially just that--data that convey almost no information. I don't think which school the doctor went to matters, nor the name of the degree. Age might be somewhat useful as it indicates amount of experience but the year of graduation is often suppressed. Ethnicity is perhaps useful but it is not present; in some cases, the name reveals this information but not usually.
Hospital affiliations could have been useful if many doctors are not affiliated with many hospitals. I asked a friend of mine who is a doctor whether there are more "selective" hospitals like there are more "selective" universities but he tells me hospital affiliation conveys no information.
Fields of specialization is also useless as I am not looking for a specialist.
Languages spoken is an oddity. If I interpret the data literally, it seems that American doctors have an obsession with learning foreign languages. It is incredible how many of them speak three, four or more languages, including relatively exotic ones. Chances are these doctors have people in the office who speak those foreign languages. In any case, since my primary language is English, I have no inclination to select doctors based on what other languages they (or their staff) speak.
So the only piece of data I can use is the address. Is the doctor close to my home or work?
And that seems to be a poor way of selecting doctors.
PS. While writing this, I am reminded of a continuous stream of useless real-time data: those signal bars on our cellphones. The number of bars and the speed at which a webpage loads are much less correlated than expected.
It's okay if we treat the data as a joke. But somewhere in the world, some data scientists are using the data to do serious work.