The article (link) in Science about the failure of Google Flu Trends is important for many reasons. One is the inexplicable silence in the Big Data community about this little big problem: it's not as if this is breaking news -- it was known as early as 2009 that Flu Trends completely missed the swine flu pandemic (link), underestimating it by 50%, and then in 2013, Nature reported that Flu Trends overstimated a spike in influenza by 50%.
The second reason why this article is important is the additional analysis they conducted (there is extensive supplementary material available from Science). The highlights are:
- Not only was the reported over-estimation in Oct 2013 a one-time event, but in fact, Flu Trends has over-estimated flu prevalence for 100 out of 108 weeks since August 2011 (ouch!).
- A simple model of projecting CDC data on a two-week lag would have done at least as well as Flu Trends, and no "Big Data" is needed for that.
The researchers further report the difficulty of assessing and replicating what Google researchers did because the information they have released about their algorithm is both incomplete and inaccurate. In reserved, professional language, they noted: "Oddly, the few search terms offered in the papers [by Google researchers explaining their algorithm] do not seem to be strongly related with either GFT or the CDC data--we surmise that the authors felt an unarticulated need to cloak the actual search terms identified."
Well, either the researchers made up data in the paper they published and did not disclose this fact, which amounts to fraud, or they didn't make up the data and the model is so inaccurate that the most predictive search terms from a few years ago are no longer predictive. They owe us an explanation.
People who attended my book talks in the last few months and my students will not be surprised by the current coverage as Flu Trends is one of many high-profile Big Data "success" stories for which readers will find very little documented evidence of success. It is as if data-driven decision-making is good for others but not for ourselves. So, my hats off to these researchers for the courage to put this issue into the public discourse.
It is clear that more data do not lead to better analysis. In fact, I argue at my public events that the revolution in data analytics is about five things, which I group under the acronym OCCAM. See the slide below.
Ask yourself what are the key differences between the datasets that underlie so-called Big Data studies and those we were using say 5-10 years ago. For me, the most important differences are that the data are collected without prior design, in an observational manner, usually by third parties, and often for a purpose different from our own ("adapted"). Further, different datasets are merged, exacerbating the issue of lack of definition and misaligned objectives. Controls are typically unavailable, and worse, analysis proceeds without an attempt to manufacture pseudo-controls.
Finally, in some cases, the data is "seemingly complete." This is the so-called "N=All" condition. The danger of this N=All talk is that its proponents confuse assumption for fact. We have seen this story before, in economics: assuming complete data is no different from assuming perfect information. Assuming something to be true doesn't make it true.
Before closing, I add a few words about why defining Big Data using a minimum size of datasets is absurd.
First, the problem of Big Data as defined by the likes of McKinsey is fundamentally unsolvable. If the current threshold is 100 terabytes, and we improve our processing power to tackle datasets of that size, then this definition calls for resetting the threshold of Big Data to 1000 terabytes, ad infinitum.
Second, some problem domains like education and medicine in which the units of measurement (schools, hospitals, students, patients) are upper bounded can never have a Big Data problem. Personal analytics is not a Big Data problem since no single person can produce that much data. And yet, some of the most exciting developments in data analytics are expected to come from those fields.
Third, no consumer will ever care about Big Data since no consumer will be exposed to terabytes or petabytes of data. Consumers (or citizens) will be impacted by this data revolution through better and more data analyses, but not by more data. Knowing the difference between those two things is fundamental to understanding this phenomenon.
Also of interest is my article on Big Data and Big Business, published in Significance last year. (link)
PS. Slate linked to this post. I'm not sure where I said "such 'big data' analyses are currently so abysmal as to be effectively useless." I'm saying there are lots of exciting things happening in data analytics right now but assuming that more data will solve all problems is where we are failing, and I offer an alternative way of framing Big Data which can be more productive.