Previously, I analyzed the data analysis by MailChimp on the impact of Gmail's new tabbing feature, and noted a potential data issue (link). In Part 2, I will look at the MailChimp study as a typical example of "Big Data" studies.
The Gmail study has several features that are hallmarks of Big Data. First and foremost, the analyst boasts of a staggering amount of data ("29 billion emails, 4.9 billion opens, 4.2 billion clicks, and 43.5 million unsubscribes"). Because of such "wealth", the analyst now feels empowered to ignore statistical variability. This is sometimes known as "N = All", the illusion that the analyst has collected all possible data points. In my talks, I like to call this "seemingly complete" data, calling it what it is. The assumption of "N=All" is akin to the assumption of "perfect information" or "completeness" in economic models, and we know where that led us.
Take this chart from MailChimp's post:
The difference in click rates ranges from 0.3% to 2.5% among the top 4 email providers. This is against a base rate in the 85% range (see Part 1 for why this base rate is probably miscalculated). So we are talking about tiny percentage differences. The email provider-to-email provider difference is even smaller: for example, it's less than 1% between Gmail and Hotmail.
Now imagine you are tracking the day-to-day changes in click rate for a single email provider (say Gmail, which according to the other chart clocks in at approx. 84%). Do you expect Gmail's click rate to be 84% day after day with no fluctuations? Probably not.
Here's the problem... the "seemingly complete" paradigm is a claim that there is no noise in the data. And that is the basis of the claim that email provider A is worse than email provider B in their click rates.
I see this as the "moral hazard" problem of Big Data. The comfort of abundance causes us to let our guards down; when we should see fewer false results, we end up seeing many more!
***
I will quickly run through a few other ingredients of the typical Big Data study. The data is purely observational, has no control group, and is coopted from other studies with different objectives. Just think about other Big Data studies, and you'll notice that they fit most of these criteria.
The MailChimp analyst, to his credit, attempted to manufacture a control group by appealing to other email providers who do not have the Gmail-style tabs. This is a simple and useful control but readers using numbersense would want to investigate these questions:
- are there differences in the types of people who have Gmail accounts from those who have AOL accounts?
- are there differences in the types of email marketers who spend on Gmail from those who spend on Hotmail?
- During the period under study, did any of the other email providers implement any changes to their email service, other than Gmail introducing the tabs?
- Did email marketers change their behavior on the Gmail platform as they react to the changes put in by Google, which presumably they would not have replicated on other email platforms?
- Did some email users migrate out of Gmail to the other email providers during the period under study?
Any of these factors, and more, can explain any gap in click rates between email providers during the period of the analysis. This is the discontent of not setting up a control group ex-ante. In Big Data, analysts often have to construct control groups after the fact. This is one area I hope to see theoretical advancement.
In the meantime, realize that whenever we create control groups like this, the resulting analysis is heavily laced with assumptions (e.g. Gmail and Hotmail have the same types of users and attract the same types of advertisers, and that none of the other platforms implemented any major releases during the period, and none of the marketers changed their Gmail marketing strategies, etc.)
Overall, everything you say makes perfect sense!
One small follow up question: Separate and apart from whether or not the base rates are correct, and the lack of attention to variation of those base rates, how do you know that a 1% or 2% difference in click rates is "tiny"? I don't know anything about e-mail marketing, but in direct (snail) mail marketing, a difference of 1% response rate is considered quite meaningful.
Posted by: Stephanie | 02/05/2014 at 01:38 PM
Stephanie: it depends on your reference point. 1% over 85% is likely to be noise but 1% over 5% is probably meaningful. One good way to gauge the underlying variability is to look at how much the rate moves around weekly, monthly, etc.
Posted by: Kaiser | 02/05/2014 at 06:13 PM
I still don't see how you could know what is "tiny" unless you have actual information on the variance of these particular data. Some data are quite stable, in which case a 1% change may well not be noise. Seems like you are guessing (perhaps in a reasonable, educated way if you have experience with this type of data), but who knows? I would not assume one way or the other myself. For example, assume most people have very predictable behavior and are either routine clickers or routine non-clickers. If there's something particular about a certain e-mail provider's setup that nudges a subset of people to switch behavior, well, there's a story of how 1 or 2% isn't noise.
Posted by: Stephanie | 02/05/2014 at 06:35 PM
How would I know? It's what I call numbersense. I wrote an entire book about it.
I'll just mention one aspect of it here. Go read the book for the rest.
There is no such thing as certainty in statistics. Every analysis, including the one derived from "actual" data, is part opinion.
Posted by: Kaiser | 02/05/2014 at 08:08 PM
A philisophical question on approach here. Part of the false sense of confidence in the data is supported by the way confidence intervals are created, no? Assuming we're using sqrt(p(1-p) / N) as N grows the confidence interval must shrink lending the assumption of statistical significance even if the practical significance of a 1% difference in click rate is small. This assumes any test for statistical significance was done, of course.
Would you advocate for a sampling rather than N=All approach to counteract this? Setting aside the "design" of the study (using someone else's data, lack of control), when is a lot of data a good thing and when is it simply misleading? Or is a lot of data only misleading due to the "design"?
Thanks!
Posted by: Adam Schwartz | 02/06/2014 at 08:41 AM
Subsampling like you said is something I do routinely. There is a better method, not always feasible. What those formulas do is to use a stochastic model to estimate the systematic variance. No one really checks whether the estimate is accurate or not. When N is very large, as you noticed, that estimate is almost surely wrong. The solution is to compute empirical estimates of your own systematic variance. That's why I talked about looking at your historical fluctuations. Box,Hunter,Hunter (1st editions) covers this even before they talk about the usual formulas.
Posted by: Kaiser | 02/06/2014 at 10:43 AM
All great points to consider when looking at any email analysis.
One way to combat cross time changes for provider comparison purpose (or any other "segmentation") would be to have a random control group. Lift metric would be impacted by the Gmail or other providers actions independently. MailChimp would probably never recommend this approach because their revenue is tied to volume. Lifts in product sales from email will be small (but likely still positive ROI) which would push their client to consider optimization and better targeting tactics... but that realization in client is very good for MailChimp's business.
Posted by: Jake S | 02/07/2014 at 12:28 PM