Previously, I analyzed the data analysis by MailChimp on the impact of Gmail's new tabbing feature, and noted a potential data issue (link). In Part 2, I will look at the MailChimp study as a typical example of "Big Data" studies.
The Gmail study has several features that are hallmarks of Big Data. First and foremost, the analyst boasts of a staggering amount of data ("29 billion emails, 4.9 billion opens, 4.2 billion clicks, and 43.5 million unsubscribes"). Because of such "wealth", the analyst now feels empowered to ignore statistical variability. This is sometimes known as "N = All", the illusion that the analyst has collected all possible data points. In my talks, I like to call this "seemingly complete" data, calling it what it is. The assumption of "N=All" is akin to the assumption of "perfect information" or "completeness" in economic models, and we know where that led us.
Take this chart from MailChimp's post:
The difference in click rates ranges from 0.3% to 2.5% among the top 4 email providers. This is against a base rate in the 85% range (see Part 1 for why this base rate is probably miscalculated). So we are talking about tiny percentage differences. The email provider-to-email provider difference is even smaller: for example, it's less than 1% between Gmail and Hotmail.
Now imagine you are tracking the day-to-day changes in click rate for a single email provider (say Gmail, which according to the other chart clocks in at approx. 84%). Do you expect Gmail's click rate to be 84% day after day with no fluctuations? Probably not.
Here's the problem... the "seemingly complete" paradigm is a claim that there is no noise in the data. And that is the basis of the claim that email provider A is worse than email provider B in their click rates.
I see this as the "moral hazard" problem of Big Data. The comfort of abundance causes us to let our guards down; when we should see fewer false results, we end up seeing many more!
I will quickly run through a few other ingredients of the typical Big Data study. The data is purely observational, has no control group, and is coopted from other studies with different objectives. Just think about other Big Data studies, and you'll notice that they fit most of these criteria.
The MailChimp analyst, to his credit, attempted to manufacture a control group by appealing to other email providers who do not have the Gmail-style tabs. This is a simple and useful control but readers using numbersense would want to investigate these questions:
- are there differences in the types of people who have Gmail accounts from those who have AOL accounts?
- are there differences in the types of email marketers who spend on Gmail from those who spend on Hotmail?
- During the period under study, did any of the other email providers implement any changes to their email service, other than Gmail introducing the tabs?
- Did email marketers change their behavior on the Gmail platform as they react to the changes put in by Google, which presumably they would not have replicated on other email platforms?
- Did some email users migrate out of Gmail to the other email providers during the period under study?
Any of these factors, and more, can explain any gap in click rates between email providers during the period of the analysis. This is the discontent of not setting up a control group ex-ante. In Big Data, analysts often have to construct control groups after the fact. This is one area I hope to see theoretical advancement.
In the meantime, realize that whenever we create control groups like this, the resulting analysis is heavily laced with assumptions (e.g. Gmail and Hotmail have the same types of users and attract the same types of advertisers, and that none of the other platforms implemented any major releases during the period, and none of the marketers changed their Gmail marketing strategies, etc.)