In a previous post, I explain the miracle of randomization for causal inference using a simple business example. I left out some side issues - bad things that could still ruin statistical experiments. These are real issues but edge cases.
As a recap, we pretend that Amazon is running an email test on Christmas eve, hoping that an extra email will lead to incremental holiday sales. One half of the targeted customers receive the email while the other half doesn't.
What randomization gives us is covariate balance, which supports an all-else-equal analysis. Accidents like Yahoo's spam filter unexpectedly suppressing our emails (see previous post) affect both arms of the experiment equally; in this sense, the two arms are still comparable despite the accident, and even if we are unware of its occurrence.
One issue that can arise in that scenario is if a majority of our business comes from customers with Yahoo email addresses. The spam filter did not destroy the covariate balance, and the difference between the test and holdout arms is still unbiased. However, the totality of the test and holdout arms no longer adequately represents our research population.
This situation is directly analogous to the problem I've been pointing out regarding the real-world studies of vaccine effectiveness (see this post). Recall that the matching strategy artificially creates a control group that is balanced with a test group (by construction). However, the totality of both groups no longer represents the entire population: that's because we inevitably fail to find appropriate matches for some people - if the observational dataset is sufficiently biased.
Returning to our A/B testing example, we will still have a good read on customers with non-Yahoo email addresses. We just don't know what the effect of the promotional email has on people with Yahoo email addresses. In the Clalit study, for example, since they excluded everyone living in nursing homes or confined to their homes, and they only matched 15% of the oldest age group, the published vaccine effectiveness doesn't tell us anything about the vaccine's effect on those subgroups.
***
A second problem pertains to the nature of the "accident". We are in serious trouble if the accident preferentially impacts one of the two test groups.
Let's say Amazon plans to track the performance of the promotional emails by routing customers to a special landing page, which eventually leads to the usual checkout process. Now, it turns out that the server that delivers the new landing page suffers intermittent outage during the week of the email promotion. Since this landing page is only seen by the test group (receiving the emails), this accident affects one arm but not the other in the A/B test.
Randomization cannot save us from this type of operational hiccup.
***
Bonus tip: think about how a testing analyst might discover the operational issue. It will never surface if all you're doing is looking at a dashboard that shows you the horserace between the two arms.
It goes down to having the courage to anticipate even unlikely mishaps, designing an auditing process that can identify unexpected problems, and keeping the faith that this work will one day save a wayward test! Ideally, you want to identify the outage during the early phases of the test and correct it.
Recent Comments