I just did a guest lecture at a New School journalism class. While preparing for the class, I pulled the sad stock chart for GRPN (Groupon):
If you bought the hype in 2011, you'd have lost 70% of your investment ($25 to $7).
Given what we know today, it's hard for people to feel the hype that the media helped fuel in those days. As a reminder, here is the New York Times's David Pogue gushing about Groupon, just before its IPO: link. Pogue was one of many such commentators.
Around that time, I had this response to the Groupon boosters. There was a gaping hole in the win-win-win story from the start. Retailers are giving up sure profit for the probability that the coupon-users are not dealseekers and would come back for repeat business, at a higher price.
This is related to my current concern about the so-called "gas price stimulus". The hit to the oil and gas sector is immediate and certain. The shift of spending to other sectors, and the associated "multiplier effects", is a probability of multiple events occurring in the future.
In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.
In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.
For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.
Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.
As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.
A testing program using fixed samples may face one of several problems:
a) Too few tests are called significant.
b) Too many tests are called significant.
c) It takes too long to call a test.
You need to figure out what is your biggest problem.
Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)
Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.
The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.
On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.
The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.
[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]
How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.
Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.
Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.
I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.
In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:
When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.
Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)
Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:
a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.
b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.
Let me first explain why those are bad practices.
The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.
In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).
Now, if the multivariate test is being run to significance, you have a hydra of a head.
The Optimizely solution uses two key results from statistics:
a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.
b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.
What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.
If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.
In the next post, I have further thoughts for those customers who have more advanced protocols in place.
PS. This is Optimizely's official explanation of their changes on YouTube.
During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.
From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.
Here are some improvements they have made since I last wrote about the tool:
Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.
Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.
Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.
Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.
New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!
I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.
Here is something different: I wrote a piece on exam-taking tips. It's on a new website, Cafe, which has lots of good (non-quant) reads. The motivation for the piece is my observation that most American students are not taught how to take exams. As a professor, I notice that many students get lower scores than they deserve because of this.
In this article, I describe five things that are often neglected here, but are common knowledge in exam-heavy cultures.
There have been few updates as I was working on things for other people. One of these things showed up today. Here is an excerpt from the beginning of my new article on HBR:
For over 10 years and at three companies, I set up and ran A/B testing programs, in which we test a new offer with half a sample against a control group which doesn’t get a new offer. Executives quickly pick up on the headline benefit of testing: that A/B tests provide reliable answers to “why” questions. This comes as no surprise, as such testing has long been held up as the “gold standard” for learning cause-and-effect in scientific research, clinical studies and direct marketing. However, many executives eventually reach a mid-life crisis, developing doubts about the direction of the A/B testing program.
From my experience, here are three of the most common questions that arise from those doubts, and how managers should think about them.
What a lucky day I found time to catch up on some Gelman. He posted about the Facebook research ethics controversy, and I'm glad to see that he and I have pretty much the same attitude (my earlier post is here.). It's a storm in a teacup.
Gelman makes two other points about the Facebook study--unrelated to the ethics--which are very important.
First, he said:
if we happen to see an effect of +0.02 in one particular place at one particular time, it could well be -0.02 somewhere else. Don’t get me wrong—I’m not saying that this finding is empty, just that we have to be careful about out-of-sample generalization.
This statement is a reaction to learning that the measured response in the Facebook study, that is, the change in "emotions" of users due to the manipulation of positive/negative newsfeed items is tiny, on the order of hundredth of a standard deviation. Put differently, if the sample size of the study (~700K) were smaller, the effect would have been indistinguishable from background noise.
Sadly, this type of thing happens in A/B testing a lot. On a website, it seems as if there is an inexhaustible supply of experimental units. If the test has not "reached" significance, most analysts just keep it running. This is silly in many ways but the key issue is that if you need that many samples to reach significance, it is guaranteed that the measured effect size is tiny, which also means that the business impact is tiny.
For websites where the typical user does not sign on, tests are run based on using cookies to track unique users. It is often ignored how poorly cookies map to unique users (this ought to be a separate post). Let's say each user will show up at different times as different cookies. Then the iid assumption is violated; the correlation between units causes effect sizes to be over-estimated. The longer one runs a test, the more likely the same user shows up as multiple cookies.
Gelman's other observation is that studies with insignificant effects boostered by massive samples can get the p<0.05 and find their way to top journals. Add that to the pile of reasons why being published in a top journal is not necessarily a reason to trust a study.
On a separate note, I want to respond to a reader who asked me a question a while ago that I haven't answered. In one of my talks about the Netflix Prize, I remarked that the 10% targeted improvement was roughly equivalent to improving the accuracy of predicting the average rating by 1/10 of a star in the 5-star scale. Then, I forgot how I derived that number. Turns out it was in an earlier version of the talk. The accuracy metric used was RMSE which is on the same scale as the ratings data, i.e. the 5-star scale. The 10% improvement was roughly 0.1 on the RMSE scale, which is 1/10th of a star on the 5-star scale.
Pertinent to the tiny-effect issue discussed above, note that in the final phase of the Netflix Prize, several teams were just under the cusp of the 10-percent threshold. The eventual winners boosted their RMSE by 0.005 to get over the threshold. That is one half of 1/100th of a star on the 5-star scale. And it took 10 months.
Here are five amazing recommendations by Avinash Kaushik from a post about how to make Web analytics dashboards better by simplifying.
Dashboards are not reports. Don't data puke. Include insights. Include recommendations for actions. Include business impact. NEVER leave data interpretation to the executives (let them opine on your recommendations for actions with benefit of their wisdom and awareness of business strategy). When it comes to key performance indicators, segments and your recommendations make sure you cover the end-to-end acquisition, behavior and outcomes. Context is everything. Great dashboards leverage targets, benchmarks and competitive intelligence to deliver context. (You'll see that in above examples.) This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!
It's a long post but well worth reading. I also like these sentences:
Somewhere along the way we've lost our way. Dashboards are no longer thoughtfully processed analysis of data relevant to business goals with an included summary of recommended actions. They are data pukes. And data pukes are not dashboards. They are data pukes.
A LinkedIn contact and 538 reader pointed me to this demo video by Joe Hellerstein, from a Bay Area startup called Trifacta. They have a neat product that tries to automate data cleaning/processing tasks for analysts.
I love that people are working on this problem. It's an area that I'm interested in getting involved in. Also, they have a sleek user interface, well thought out, and innovative.
There is a long way to go still. The product is designed by computer scientists and it shows in several ways:
1. The data is, by and large, accepted as pristine. The tasks Joe chose to show during the 15-minute demo are about transforming or formatting variables. There is no "cleaning". All of the data are presumed correct. There was a brief moment of unease when he found missing values in a date field, which led to a difference in days being recorded as NA. This was quickly "solved" by replacing those differences by zero days! (In reality, this is probably right censored data for which the missing is informative.)
2. Visual inspection of the top 10 rows is central to the process. I already ranted about this practice here. In the Trifacta design, the top rows are not just used to check the data; they are also used to generate transformation rules.
I suggest that Trifacta hire a statistician to expand the list of tasks that need to be tackled. This is a good product that can be made great.
Also, New York Times recently wrote about the "unsexy" part of the job. (link)