The current issue of Significance includes an article by me on the "pending marriage between statistics and Big Data". If you are a member of either the American Statistical Association or the Royal Statistical Society, you should be able to access the article via this link. If you don't belong to one of those, and you have a smartphone or tablet, you can download the Significance app -- because 2013 is the International Year of Statistics, this app is free for all right now.
If you don't belong to ASA or RSS, and do not have a smartphone or tablet, then I am able to print an excerpt of my article below. The excerpt looks at online experimentation (aka A/B Testing), pointing out where current practice runs into trouble, and how statisticians can play a role. In the article, I identify several other areas in which statisticians have the potential to help move the Big Data field forward. It won't be easy because fundamentally, the way computer scientists approach data is at odds with the way statisticans approach data.
David Walker presents the other side of the debate in the same issue.
More effective experiments
In 2012, Wired magazine eulogised the “A/B test”, declaring it to be “the technology that’s changing the rules of business”. The A/B test is known to every introduction to statistics student as the t-test of two means. Yes, the t-test is traced back to Gosset who developed it for the Guinness brewery in the 1900s. In the contemporary setting, a website delivers at random one of two pages to visitors, and measures if one page performs better than the other page, typically in terms of clickthroughs.
Brian Christian, the author of the Wired article, asked: “Could the scientific rigor of Google’s A/B ethos start making waves outside the web? Is it possible to A/B the offline world?” Any statistician will answer that many industries long ago implemented randomised, controlled experiments, and did so before the web existed, and at a higher level of sophistication than at most web companies. For example, direct marketers routinely run statistical tests to optimise their marketing vehicles such as catalogues and direct mail.
One of Christian's talking points holds: the web is indeed a nice laboratory in which tests can be executed at scale, and relatively painlessly (though see the section on randomisation below) [Ed: not excerpted]. And yet, in the A/B testing universe, few people are aware of the huge literature on statistical testing, or of [Sir Ronald] Fisher’s monumental contributions. This field is ripe for collaboration between computer scientists and statisticians. A quick flip through the Wired article reveals numerous fallacies about t-tests: fallacies of certainty, of automation, and of false positives among them.
The fallacy of certainty. Again and again, Christian stresses the certainty of test results, using words such as “incontrovertible”. Data from tests end all subjective arguments, we are told. How is it possible to have such definitive results when, as these web businesses claim, they run thousands of tests per year? One expects that most tweaks, such as changing the width of a border on a web page, have inconclusive results. It turns out that most practitioners of A/B tests use point estimates. If the test fails to achieve significance, the variation with the best performance is declared the “directional” winner. Sometimes, a test is run for such a length of time that tiny effects display significance by virtue of sample size.
The fallacy of automation. In Christian’s world, the summit of A/B testing is “automating the whole process of adjudicating the test, so that the software, when it finds statistical significance, simply diverts all traffic to the better-performing option – no human oversight necessary”. Twinned with this is the fallacy of real time. One of the deepest insights in statistics is the law of large numbers, which requires a sufficient sample size in order to detect a signal to a given precision. Real-time decisions imply undersized samples, and huge error bars. Furthermore, such decisions are biased, as Microsoft scientists explained in an important paper [PDF] on the “novelty effect” and the “primacy effect”, among other things. False positive results abound in small samples, turning statistical testing into witchcraft.
The plague of multiple comparisons. In the new world of “choose everything”, that is to say, “see what sticks”, Wired reports that “the percentage of users getting some kind of tweak may well approach 100 percent. Statisticians worry about false positive findings when so many tests are run at the same time. Given the complexity of correcting for multiple comparisons, it is not surprising that the software tools available to conduct A/B tests completely ignore this issue.
We should be excited that randomised, controlled tests have been embraced by the web community. Regrettably, only a few practitioners, such as Ron Kohavi’s team at Microsoft, and Randall Lewis and Justin Rao (in their work [PDF] at Yahoo!), have reflected on the practical challenges of this enterprise. Statisticians are well equipped to make important contributions to how experiments are designed, executed and analysed.
Cathy O'Neill at MathBabe has some related thoughts. In this post, she makes a distinction between business analytics (statistics) and Big Data. In this post, it's not clear that she thinks Big Data as defined in the other post makes sense.