« The imprecision of data, subway edition | Main | Know your data 21: another example of data sleaze, straight from your home »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Rick Wicklin

Well said. Thank you for this insight. It sounds like these tools do not ensure an adequate number of samples prior to modifying the assignment probabilities in these sequential trials. To quote from the SAS documentation for the SEQDESIGN procedure, a well-designed experiment should acknowledge that
"the null hypothesis is more difficult to reject in the early stages than in the later stages. That is, the trial is rejected in the early stages only with overwhelming evidence because in these stages there might not be a sufficient number of responses for a reliable estimate of the treatment effect."

A second comment: not only do pharmaceutical companies want to stop trials early when a drug shows exceptional promise, but they also want to stop trials for which there is early evidence of possible harm to patients. Thus sequential trials serve an ethical role as well as an economic role.


RW: On your second comment, while that is true, it is also true that clinical trials are designed primarily to measure benefits and the ability to measure harm is weak. In the business world, I like to remind managers that if they want early stopping for benefit, there should be early stopping for harm too. That sometimes stop the request :)

Richard Ward

May this desire to make a decision early be an example of heuristic fallacies explored by Kahneman and Tversky?

Kevin Christopher Henry

In an experimental design a subject should see only one of the possible choices. The flaw with the multiple-campaign design is that the users may see more than one of the ads. Perhaps they see ad A, think about it, then see ad B and take the plunge. The A impression will be recorded as a miss and the B as a hit, when in fact they were not independent events.


"Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted [...] p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data." https://theconversation.com/why-hypothesis-and-significance-tests-ask-the-wrong-questions-11583

Surely this is a problem with the whole A/B testing paradigm!


KCH: Yes that is one of the problems with the alternative design. In the multiple-campaign design, FB sees each campaign as separate and thus there is no tagging of users to keep them in the same test cell. Even after this is solved, there is still one other big problem.

tc: I may write about this more in the future. There are definitely issues with the significance testing framework but most of the critique of it is sloppy and poorly reasoned.


Kaiser: I would love to see a post or two around this subject.

The problems of significance testing will most likely be poorly understood by those using these tools, who are often trained as developers and UX/I designers, not analysts or statisticians.

In my experience A/B testing is often used as a substitute for strategic decision making in start-ups.

Also, as an aside, all things being equal A/B testing of images, button colours, messages etc should coalesce around the mean over time...in other words all websites should start to look the same not different, therefore I feel there is little value in small digital startups doing A/B testing - just make your website look like Amazon, or Facebook, or Wikipedia, or HuffPost, or whatever other company does millions of A/B tests all the time.

Dror Atariah

I would argue that merely comparing multi-armed bandit and A/B testing is wrong. These two are different creatures designed to be used in different scenarios.


Using the AdEspresso how does FB know that individuals are to be randomly assigned to one of those four conditions? That is what does the assignment mechanism default to? Does it just fill each condition up sequentially? Without knowing more about the assignment mechanism in this new context it seems likely that you'd violate the requirement that participants in the experiment have some probability of being assigned to any one of the conditions. (I'm thinking in terms of Holland's discussion of Rubin's causal model).


AA: That is the other flaw of the setup as separate campaigns. One of the most important prerequisites of A/B testing (any statistical testing) is random assignment of treatment. FB does not promise that simultaneous campaigns to the same target population gets randomized. (Tom Diettrich also pointed this out on Twitter: we have no way of telling FB that those campaigns are part of a test.)

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep