In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.
In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.
For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.
Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.
As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.
A testing program using fixed samples may face one of several problems:
a) Too few tests are called significant.
b) Too many tests are called significant.
c) It takes too long to call a test.
You need to figure out what is your biggest problem.
Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)
Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.
The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.
On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.
The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.
[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]
How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.
Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.
Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.
I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.
In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:
When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.
Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)
Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:
a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.
b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.
Let me first explain why those are bad practices.
The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.
In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).
Now, if the multivariate test is being run to significance, you have a hydra of a head.
The Optimizely solution uses two key results from statistics:
a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.
b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.
What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.
If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.
In the next post, I have further thoughts for those customers who have more advanced protocols in place.
PS. This is Optimizely's official explanation of their changes on YouTube.
There have been few updates as I was working on things for other people. One of these things showed up today. Here is an excerpt from the beginning of my new article on HBR:
For over 10 years and at three companies, I set up and ran A/B testing programs, in which we test a new offer with half a sample against a control group which doesn’t get a new offer. Executives quickly pick up on the headline benefit of testing: that A/B tests provide reliable answers to “why” questions. This comes as no surprise, as such testing has long been held up as the “gold standard” for learning cause-and-effect in scientific research, clinical studies and direct marketing. However, many executives eventually reach a mid-life crisis, developing doubts about the direction of the A/B testing program.
From my experience, here are three of the most common questions that arise from those doubts, and how managers should think about them.
Here are five amazing recommendations by Avinash Kaushik from a post about how to make Web analytics dashboards better by simplifying.
Dashboards are not reports. Don't data puke. Include insights. Include recommendations for actions. Include business impact. NEVER leave data interpretation to the executives (let them opine on your recommendations for actions with benefit of their wisdom and awareness of business strategy). When it comes to key performance indicators, segments and your recommendations make sure you cover the end-to-end acquisition, behavior and outcomes. Context is everything. Great dashboards leverage targets, benchmarks and competitive intelligence to deliver context. (You'll see that in above examples.) This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!
It's a long post but well worth reading. I also like these sentences:
Somewhere along the way we've lost our way. Dashboards are no longer thoughtfully processed analysis of data relevant to business goals with an included summary of recommended actions. They are data pukes. And data pukes are not dashboards. They are data pukes.
For those in Boston/Cambridge, I will be speaking at the Chief Data Scientist meetup on Wednesday night. See you there.
Warning: this post may be hard to understand if you don't know SQL.
SQL is one of the most fundamental tools in data science. It is used to manipulate data. Its simplicity is a big reason for its popularity. There are lots of things it can’t do but the few tasks it supports cover the majority of required tasks.
Over the years, I have noticed some bad habits of SQL coders. These habits tend to prevent the coders from “seeing” the imperfections in their data. Here are a few:
“Select top N” to “spot check” the data
Most analysts realize that they need to check the integrity of a data set. The easiest “check” is to eyeball the top N rows of data. In most cases, the data set is ordered in some way, not necessarily known to the analyst, so the top N rows do not form a representative sample of all rows.
Even if those top rows were as if random, it’s not clear what checks the analyst is performing mentally as he or she scrolls up and down a printed list of say 100 rows and 20 columns of data. Is the analyst looking for missing data? For extreme values? For discontinuity in the distribution? For out-of-range values? None of these tasks are simple enough to do in one's head.
Further, if there is a problem with the data, it usually comes from extreme or missing values, which are rare. Or if the data contain text, it may be that a few rows contain bad characters that will trip up SQL during a routine task.
Here’s the bottom line: if the data problem affects a huge chunk of the data, you will find it using a spot check, or any kind of checking. But most data problems affect a small corner of the data. A spot check will almost always miss these, leading to a false negative problem. The real trouble is when the analyst issues a bill of health after a spot check.
Assume that a data table has no duplicate rows
When merging data sets, it’s very easy to generate duplicate rows, if one or both of the data sets contain duplicate rows of the same match key. For instance, the analyst is merging the customer sales history data with the customer contact information data. The match key is the customer id number.
It is normal to assume that the contact database has only one row for each customer (who would design this table in any other way?), and nine times out of ten, this assumption will be correct.
The one time is when your CEO needs an updated sales number right now for a board meeting. Oops, the sales number is double the expected value. The culprit is that duplicate customer ids made their way into the contact history table so that when it is merged with the transactions history, each sales record is replicated one or more times.
It may sound like a waste of time but before merging any data, check each table for duplicates.
Use open-ended time windows
In business analytics, we are always counting events over time, be they sales transactions, clickthroughs, responses to offers, etc. I have a pet peeve: code that does not have explicit accounting windows, meaning a starting time and an ending time.
Such code is not auditable. Every time you run the code, it will generate a different count (unless your business has infrequent events). If you wrote the code yesterday, and I ran it today, the counts would be different. How will I know if the difference is entirely due to the longer accounting window or if there are problems with the underlying data?
The usual excuse for this coding practice is that the business wants the “most updated” number, up to the very last microsecond. Let me assure you: a day-old number that has been verified is preferred to a second-old number that cannot be audited.
The other excuse is the code is hard to maintain since you have to hard-code the ending time. But that is letting the tool limit your analytical ability. There are plenty of tools for which this is not a limitation.
When I talk about numbersense, I am also talking about the habits of the analysts. Bad habits doom many analyses before takeoff.
Have you encountered these issues? Do you have your own list of bad habits? Let me know!
It seems like Seth Kugel's article in the New York Times about "Crunching the Numbers to find the Best Airfare" is quite popular. In this article, he said things like this:
The overall take on the best day to book tickets turns out to be somewhat underwhelming, if you look at the country as a whole. Hopper’s data shows it’s actually Thursday, but don’t expect that fact to save you much money. Reserve a domestic flight on Thursday and you’ll spend, on average, $10 less than if you reserve on Saturday, the worst day to book domestic flights. With international flights, you’ll save, on average, $25 over Sunday, the worst day to book flights abroad. (Those are “maximum averages” that assume you would have booked on the worst day and are now booking on the best.)
This is meaningless navel-gazing.
As I explained in my notes to my Kayak article on 538, talking about best or worst fares is meaningless unless one can describe a strategy with which the traveler could attain those fares. This strategy must work in real time, before it is known that a particular fare would be the best or the worst.
Without such a strategy, we are talking about paper gains and losses.
Analysts who follow Kugel's logic though rarely realize that they are talking about paper money. So, later in the article, Kugel said this:
For the vast majority of routes,... avoid booking on weekends and try midweek; for the average American flier, those savings will add up in the long run.
What savings? Those would appear to be the "maximum averages" defined above, the difference between the best and the worst days for given routes based on a lot of historical data. But there is no strategy to reliably attain the best fare; in fact, there is no strategy to reliably buy the worst-priced tickets either.
As I said before, if the goal is to gain provable savings, you need to write down how you are making the purchase decision today, then you need to define what your new strategy is--whether it is using Kayak, or using Google (which doesn't do predictions)--and then you should compare the two methods.
I saw Joe N.'s tweet asking me about a study of how professors spend their time, reported by Lisa Wade at Sociological Images. This is an anthropological study, something that I am not at all familiar with although the people in the field seem to believe that they can make statistically valid observations.
I'm glad the author of the study, John Ziker, wrote a (really) long article describing what he was trying to accomplish. The key point is that the study is a preliminary exploration, with important limitations; a follow-up study is planned which may give generalizable conclusions.
Here are some issues with the first study that makes a statistician nervous:
- the sample was between 14 and 30 professors (tiny): Wade reported it to be 16. Ziker definitely started with 30.
- the selection was non-random, based on the first 30 people who responded to a school-wide announcement
- about half the initial respondents did not complete the study, and provided only partial data (one to six days)
- despite the tiny sample, some analysis required slicing the data further into four segments by grade level! I wonder how many department chairs were in that sample. (See chart on right)
- each professor is followed for a two-week period but only every other day, thus each professor at most contributed one observation per day of week
- the interviews were every other day "so the time taken for the interview did not appear on the previous day’s report." This is a horrible problem to deal with! Because time allocation is the subject of the study, the measurement method (in-depth interviewing) interferes with the measured outcome. It seems to me impossible to believe that the time spent answering questions every other day did not affect time allocation on the non-interview days.
- Ziker reasoned: "While we cannot make a claim that all faculty have the same work patterns as our initial subject pool — they do not comprise a random sample — the results are highly suggestive because of the consistency across our subjects who did represent.". In order not to fall prey to the law of small numbers, a better way to say this is: we make the assumption that the small sample is representative on both mean value and dispersion, which then leads to the assumption that all faculty have consistent work patterns similar to the observed.
- "With our initial 30 Homo academicus subjects, we ended up with a 166-day sample with each day of the week well represented." I am assuming that Ziker did not drop the 16 professors with partial data and made charts like the one on the right by ignoring the identity of the professor and aggregating over days of the week. Let's review what lies behind this chart. Each respondent contributed at most one observation per day of week; about half of the respondents did not even contribute data for all seven days. So the time allocation on any particular day is averaged over anywhere from 14 to 30 professors. These professors span a variety of ranks, departments, tenure, backgrounds, etc. and were not randomly selected. It's hard for me to trust this chart at all.
In general, I am a big fan of shoe leather research in which the researcher goes out there and gather the relevant data they need to address their specific research question, rather than picking up what data they could find, and then tailoring the research question to avoid the imperfection in the data. So I don't want to sound too negative. It's a difficult research problem they are dealing with. What they learned from this first study is useful to inform future explorations but drawing conclusions at this stage is premature.
At the end of his article, Ziker described the "experience sampling" method that will form the next phase of this study. I am very excited about this methodology.
Roughly speaking, they will ask participants to install a mobile app, which pops questions from time to time asking them what they are doing at that moment. Instead of exhaustively tracking a small number of participants over the course of time, they will get little bits of data, incomplete schedules, for a large number of professors. If the sample is big enough and randomized appropriately, they can analyze the data ignoring the professor identity, and report results for the "average professor". This method also retains the other benefit of the original design, which is that the respondents report their activities close to the time in which they occurred.
Data scientists pay attention! You don't have to collect complete data at the user level to do proper research. Designs like this "experience sampling" approach produce statistically valid findings without the need for complete data. In fact, trying to collect complete data is counterproductive, leading to shaky conclusions as shown above.
MailChimp, a major vendor that companies use to send marketing emails to customers, published an analysis of the effect of Gmail marketing tabs (link). How should you read such a study?
I'd begin by clarifying what problem the analyst is solving. In May, Google rolled out to all Gmail users a tabbed interface, in which the inbox is split into three parts: the regular inbox, a "promotional" email box, and a "social" email box.
Immediately, everyone assumes that this change will hurt email marketers (we are talking about legitimate companies, not spammers here.) The MailChimp analyst is using data to validate this hypothesis, which is a wonderful endeavor in the spirit of this blog.
Next, I'd identify the analysis strategy used to arrive at the answer. This analyst is using a pre-post analysis while controlling (ex-post) for a single factor. In layman's terms, that means the analyst compares the open and click rates before the tabs rollout with those rates after the tabs rollout. But that difference can be misleading because the pre-post analysis by itself does not prove that the tabs rollout was the cause of any observed difference. For example, there may be a seasonal change in open rates regardless of the tabs rollout.
Recognizing this, the analyst used other email providers as a natural "control" for this single factor (seasonality). The idea is that if seasonality were the cause of the change in open rates, then the other email providers should exhibit the same seasonal change over the same time window. This is a reasonable supposition but you might already be questioning... why must the seasonal effect be identical across email providers?
Good question! It doesn't have to be, which tells you that the outcome of this analysis is valid only under the assumption that the seasonal effect is identical across email providers. (See post #2 for other strong assumptions needed due to controlling for only one factor.)
Once I am satisfied with the analysis strategy, I look at the quality of the data. I did notice one red flag here. Looking at the click rate chart (please imagine that this is a line chart, not a column chart with an axis not at zero!), I am shocked that the average open rate was in the 85% range. This is saying that almost all of the people who open emails click on something inside the email. Since I have seen email clickthrough data before at various companies, I am skeptical that these rates are correct.
I did leave a comment at the blog asking them to check their data but as of today, it looks like it got lost in cyberspace - or censored. My friend who originally shared the blog post left a comment and it went through.
The analyst seems to have little sense of what real-world clickthrough rates look like! He convinced himself that the rate must be correct since it is what the data say, and further threw in a distraction -- that there are two ways to measure click rates, one is based off the number of emails sent and the other is based off the number of emails opened. Not surprisingly, the latter is much higher than the prior number.
By his count, the number of clicks to emails sent is in the 10 to 20 percent range. That too is way too high. If you tell me there are a few email campaigns that achieve such a high rate, I'd believe it. But given that his study is "BIG DATA", with 29 billion emails, 4.9 billion opens, 4.2 billion clicks, and 43.5 million unsubscribes, presumably across a large number of clients and many different industries, it is hard to fathom what it means to say one out of every five to ten emails get clicked on.
I'm not bashing the analyst here. Every data analyst will encounter this type of situation over and over. You are convinced that your number must be correct - because you know the data, you know the steps you took, you know the care you took to compute the rates.
When someone else points out the rates don't sound right, you're scratching your head. You know it's just a simple formula, the sum of clicks divided by the sum of opens, so you think there are only a few ways it could go wrong. Further, the person raising the doubt has no data so what could he/she know?
In reality, there are many ways to skin the cat of a simple formula. Have the data been cleansed of bots, and suspicious clicks? What are the time windows for counting each item? How are multiple opens or clicks by the same entity treated? etc.
This is the test of how good an analyst someone is. This is when the analyst demonstrates numbersense. How much time does it take to figure out what is driving these numbers crazy?
The reason I'm not bashing the analyst is this: I'd say if you tally up each time the person with no data raises doubt about analytics data, I'd say probably 80 percent of the time, the data is fine; and possibly 5 percent of the time, the data has serious errors (defined as, the conclusion changes after the fix).
Of course, if you are a manager of a data team, you want to manage to those ratios. If your analysts are wrong much more often, some remedial action should be taken to improve the performance.
In my next post, I'll look at the MailChimp study from the perspective of Big Data.
In theory, the availability of data should improve our ability to measure performance. In reality, the measurement revolution has not taken place. It turns out that measuring performance requires careful design and deliberate collection of the right types of data-- while Big Data is the processing and analysis of whatever data drops onto our laps. Ergo, we are far from fulfilling the promise.
This is such an important point that I'm repeating it at the top of this post.
The title of the post is taken from Sean J. Taylor's on the same topic. Highlights:
Making your own data means you are creating new facts about the world which gives you privileged access to scientific findings.
If you are the creator of your data set, then you are likely to have a great understanding the data generating process. Blindly downloading someone’s CSV file means you are much more likely to make assumptions which do not hold in the data.
The last point is a major takeaway from Numbersense(link), and in particular, read the chapters on economic indicators.