There is now some serious soul-searching in the mainstream media about their (previously) breath-taking coverage of the Big Data revolution. I am collecting some useful links here for those interested in learning more.
Here's my Harvard Business Review article in which I discussed the Science paper disclosing that Google Flu Trends, that key exhibit of the Big Data lobby, has systematically over-estimated flu activity for 100 out of the last 108 weeks. I also wrote about the OCCAM framework, which I find useful to think about the "Big Data" datasets we analyze today versus more traditional datasets from the past.
The HBR article caught the attention of various outlets, including New York Times and ABC News.
Slate was probably the earliest to react, and noticed a post on this blog that was the precursor to the HBR article.
Readers who are specifically interested in GFT should read the source materials themselves, which are quite accessible. Start with the Science paper. After that, you can read the original research article by the Google team, hosted at google.org (click on the PDF link in the blue box at the bottom of the page). There are some bold claims in this paper, as well as caveats. They seemed to be concerned about "false alerts" at the time, such as news events rather than illness that drive certain searches. (For those statistically inclined, the underlying model involves only 1,152 points of data--128 weekly aggregates in nine regions--but a search through 450 million simple logistic models to not only define which search terms are important but also determine how many search terms to include in the final regression.)
Then read the article by Cook, et. al., which covers an update to the model made after the 2009 season when GFT totally missed the pH1N1 swine flu epidemic. Notice that this Big Miss is the opposite error to the "false alert" problem. (See Chapter 4 of Numbers Rule Your World for a thorough discussion of different types of prediction errors, and how to think about them.) From the charts in the Cook article, you can see that in the runup to the Big Miss, GFT systematically under-estimated flu activities for as many weeks as you can count.
The overhaul was drastic. The search term topics that accounted for 70% of the original model were reduced in importance to 6% while two other topics that counted for 8% originally were inflated to 69% in the updated model. This dramatically improved the "fit" statistic (RMSE) for the "first phase" of the Big Miss from 0.008 to 0.001.
Next, there is Butler's article for Nature, (Feb 2013) which precedes the Science article, but first pointed out the over-estimation problem for the 2012 flu season. One possibility is that the model update described above over-compensated for the Big Miss, making it more susceptible to the False Alert.
Other media coverage of Google Flu Trends include Guardian (which focuses on the need to understand causality), CRN, and the Economist (which talks mostly about Twitter data which is much more problematic than Google search data).
***
Tim Harford has probably the most educational of the pieces revisiting Big Data for the Financial Times. (When I wrote this line, the FT link wasn't working. The title of the article is "Big Data: Are we making a big mistake?" if you need to find a different link.) His is the longest and covers a lot of ground, and has great examples, including one of my own. Highly recommended.
***
One of the slogans of the Big Data industry (of which I'm a part) is the push toward "evidence-based" decision-making in place of "gut feelings" or "instincts". Until now, I'm afraid there has been plenty little "evidence" presented to support the assertions of universal, revolutionary goodness of Big Data (try searching for quantitative assessment of Big Data projects). I hope we are witnessing the birth of evidence-based decision-making inside the industry of dishing out evidence-based advice.
The reason I put out the OCCAM framework is to steer our community toward a more constructive approach to tackling "Big Data" problems. It requires a fundamental shift in how we define the problem. I have a moderately more technical take on some of the statistical challenges in an article published in Significance, earlier this year. This article discusses six technical challenges where we need substantial progress.
Statisticians sometimes dismiss these as "old news," claiming that the same problems exist in smaller datasets and the problems are well known. A recent example is Jon Schwabish's tweet saying that this discussion induces a "yawn". This reaction feels a bit like Fermat writing in the margin claiming he has a proof. The rest of the world don't want to wait 358 years to figure out what the goods are which these statisticians are hiding.
In my view, there has been some interesting work but nothing to settle debates. If we have great solutions, we won't be discussing these same problems today.
***
Back to flu prediction. It's really something that is well worth pursuing!
It's a nice-defined, self-contained problem that has social benefits and whose results can be easily measured. We should be grateful that the Googlers spent time working on it. It's a problem I'd love to work on if I have time and resources on my hands.
The researchers also have pioneered this type of research using search term data. This is highly significant, and the data represent a perfect example of what I call OCCAM data: the data is purely Observational (related to what Harford calls "found data" or what Dan Eckles calls "data exhaust"); it has no Controls; it is seemingly Complete; it wasn't collected for the purpose of predicting flu trends, that is, it is Adapted from other uses; and the search data was Merged with the CDC data (the matching of states and regions, and of weeks were not exact as you can tell from the original research article).
The several published versions of the predictive models are clearly failures but anyone who is in this business knows model building is an iteractive process. One can learn from these mistakes. I happen to think they need to wipe the slate clean and use an entirely different approach. It's a small price to pay if there is reward down the road.
I sincerely hope that this coverage will lead to improved modeling and analytical techniques rather than a retrenchment.
If you find other related links, please place them in the comments.
Comments