Just leaving this quote from ASA President Jessica Utts here (Source: Amstat News Dec 2016):
A few days ago, I was in Vietnam and took a four-hour bus ride from Ha Long Bay to Hanoi. When I arrived, my fitness tracker had given me credit for taking 9,124 steps and climbing 81 flights of stairs during those four hours, even though I only left my seat once during a short rest stop. ...
In the opposite extreme, I once walked the full length of the Atlanta airport with my hand on my four-wheeled suitcase and got no credit for any steps. I've noticed a similar lack of credit when wheeling a grocery cart, and pushing a baby stroller allegedly has the same effect.
Great example of how (seemingly) complete data con the analyst. Imagine the data analysts and "scientific" researchers mining and squeezing every ounce of information out of such data with their algorithmic bags of tricks.
And this is not just fun and games, either.
The health plan where [her friend] works sets rates based on data acquired from employees' personal fitness devices!
What causes trouble is the nature of the data. Much of the data we analyze nowadays are "adapted," collected originally for some other purpose. Here, the fitness trackers were conceived as toys that have a potential health benefit, an objective in which the devices need only be marginally accurate. The data then get packaged up and eventually end up in some insurance company's database. An analyst now pulls the data out and is having a field day revamping the statistical models by adding a source of new data. The models may even improve a little in the aggregate because the data are somewhat accurate on average.
But at the individual level at which the data get utilized, there are many inaccuracies that bias the models in a discriminatory way. For example, people who walk around pushing baby strollers (i.e. people of a certain age and more likely women) are more likely to have underestimates, which in the insurer's new model are regarded as signals of lower enthusiasm for fitness.
Worse than that, if one knows that the health plan sets rates based on the number of steps taken, one can easily hang the device off one's dog, or design any number of tactics to fool the machine.
Much of the "smarts" in data analyses occur prior to the analyses. Being relentless in understanding how data were collected, especially when they are collected by third parties with different priorities and incentives, goes a long way. Business managers who buy the end products without inquiring about the data sources do so at their own peril. Lots of money can be lost by investing in counterproductive, Big Data-driven smart-playthings.
The Fitbit-type data is a great example of OCCAM: observational, no controls, seemingly complete, adapted and merged datasets that are the norm in the Big Data age - and such data should not be analyzed without a ton of thinking!