I am a guest at the New School's Journalism + Design program this semester.
The students conducted interviews about the question of what makes someone famous. Their interviewees were asked to name five famous people. We had images of these people up on the wall.
Then, we put the pictures into clusters. We tried two different ways of doing it.
At the end, we compared our result to what a computer program generated.
Here are some interesting applications of cluster analysis in the press: FiveThirtyEight made clusters of the intra-season performance profiles of NFL quarterbacks; Pew Research Center made clusters of political values and attitudes.
Data Science and Business Analytics are red-hot in the business world -- there are definitely more jobs than there are qualified people. Is this the right field for you? How do you find a job in this field? More importantly, how do you build a lasting career in analytics? I'll be addressing these questions in a new course I'm developing to be held in late March.
Here is the tentative outline of the one-day seminar. Please let me know if you have any suggestions. There is still time to change the contents.
A One-day Seminar, Mar 28, 2015 (Saturday), New York City
By Kaiser Fung
Data Science and Business Analytics have seen tremendous growth in the last few years, and there is fierce competition for the best talent. Numerous academic programs have sprung up, and many workers are making a career transition. The job market, while vibrant, is confusing because the field is fast evolving, and because the field encompasses different job types and career paths. The goal of this seminar is to help students develop a plan for finding a job in Data Science and/or Business Analytics, and building a lasting career in this field.
Specific objectives include: understanding the nature of the job, and the state of the job market; determining whether the job is right for you; creating a networking strategy; improving your resume; developing your pitch; making the decision to invest in further education, and building a lasting career.
The seminar is divided into four sessions. Interaction is encouraged throughout the day. Students complete hands-on exercises, especially in the third session; and receive handouts on tips and resources.
I just did a guest lecture at a New School journalism class. While preparing for the class, I pulled the sad stock chart for GRPN (Groupon):
If you bought the hype in 2011, you'd have lost 70% of your investment ($25 to $7).
Given what we know today, it's hard for people to feel the hype that the media helped fuel in those days. As a reminder, here is the New York Times's David Pogue gushing about Groupon, just before its IPO: link. Pogue was one of many such commentators.
Around that time, I had this response to the Groupon boosters. There was a gaping hole in the win-win-win story from the start. Retailers are giving up sure profit for the probability that the coupon-users are not dealseekers and would come back for repeat business, at a higher price.
This is related to my current concern about the so-called "gas price stimulus". The hit to the oil and gas sector is immediate and certain. The shift of spending to other sectors, and the associated "multiplier effects", is a probability of multiple events occurring in the future.
In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.
In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.
For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.
Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.
As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.
A testing program using fixed samples may face one of several problems:
a) Too few tests are called significant.
b) Too many tests are called significant.
c) It takes too long to call a test.
You need to figure out what is your biggest problem.
Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)
Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.
The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.
On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.
The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.
[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]
How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.
Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.
Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.
I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.
I was asked to adapt my earlier post for the HBR audience, and the new version is now up on HBR. Here is the link.
I'm happy that they picked up this post because most business problems concern reverse causation. A small subset of problems can be solved using A/B testing, but only those in which causes are known in advance and subject to manipulation. Even then, Facebook got into trouble for running such an experiment (not in my eyes though).
Thanks to the editing team at HBR. I like the new version a lot.
Neil Paine at Fivethirtyeight did a round-up of the key articles that attempt to use data to make arguments about Deflategate.
PS. It is a little hard to find the HBR article right now. Seems like it takes time for their website to update the search engine and navigational tools.
PPS. For those in California, come meet me next week. Here are the details (look also on the right sidebar). The talk is about visualizing data.
In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:
When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.
Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)
Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:
a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.
b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.
Let me first explain why those are bad practices.
The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.
In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).
Now, if the multivariate test is being run to significance, you have a hydra of a head.
The Optimizely solution uses two key results from statistics:
a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.
b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.
What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.
If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.
In the next post, I have further thoughts for those customers who have more advanced protocols in place.
PS. This is Optimizely's official explanation of their changes on YouTube.
Dragged by infectious incuriosity, the financial press ran with the story that falling gasoline prices (50% drop in 6 months) is "the best economic stimulus one can get". See former Deputy Treasury Secretary Robert Altman on CNBC, Business Insider's "cheap gas boost", Wall Street Journal citing the "low oil prices as an effective tax cut for consumers", New York Times quoting a Citigroup analyst claiming a global > $1 trillion stimulus, etc. etc.
This is the kind of story that one should believe only if half asleep. Here are three reasons why this conjecture is likely to be wrong:
1. Forgetting the big picture
There was a McDonalds next to a Burger King in a small town. The Burger King went out of business. The McDonalds suddenly did twice the usual business. Surely, McDonalds was the winner here but did the economy of the town expand? Unfortunately not. The consumers merely shifted their spend from Burger King to McDonalds.
Now, consider a household that spends $200 a month on gas before the oil price crash. Let's say the same amount of gas now costs $100. According to those rosy-cheeked economists and journalists, the household now has an extra $100 to spend on other things, and this "extra" spending stimulates the economy.
But the total amount of expenditure is still $200. The only thing that changes here is the mix of spending. GDP is based on total spend, not the mix of spend. Some sectors of the economy will benefit but at the expense of the oil and gas sector.
2. Imperfect substitution
Consider our household again. The total economy size remains the same only if the household spends every dollar of the $100. If the household saves even one of those dollars, the economy shrinks, compared to before.
3. Making bad assumptions about the future
It's unclear from any of those articles how the analysts came up with the size of this oil-drop stimulus. Every one of them must make a forecast about future oil prices. I bet many of them take the current price as the new normal, and use that price as the future price.
If I tell you, you should not take an extreme value and treat it as the average, you'd scold me for stating the obvious.
As with most economic arguments, one could posit a much more complex chain of relationships that would argue how one goes from 50% drop in oil prices to trillions of economic stimulus. It is the business journalist's job to explain that complicated chain. The connection is clearly not as simple as reported. If one establishes a chain, such as A up ->B down ->C down -> D up, etc., each one of those causal links should be supported with evidence.
The same type of fallacious thinking pervades the business sector. For example, we keep hearing about the growth in retail sales from mobile devices. We don't know if consumers are shifting from the Web channel to the mobile channel, and how much of the mobile sales are incremental.
During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.
From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.
Here are some improvements they have made since I last wrote about the tool:
Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.
Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.
Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.
Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.
New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!
I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.
That is the question in my head when I read an article like USA Today's "Jobless Claims Fall, Suggests Strong Hiring". (link)
The headline makes the connection between newly-released jobless claims data and the conclusion of "strong hiring". But it turns out the new data is merely window-dressing, and the conclusion is based on longer-term trends.
Here is the new data, as reported by the USA Today reporter:
applications for unemployment benefits fell 4,000 last week to a seasonally adjusted 294,000.
The four-week average, a less volatile measure, slipped 250 to 290,500
Without even looking up the source, one should immediately see that a change of 250 on the four-week moving average is just statistical noise. The 4000 change for the last week is also statistically insignificant because the weekly series is highly volatile. The proper conclusion about this release of data is that the employment situation was stable from last week to the week before.
Now, one could go backwards in time and make an argument for "stronger hiring". This is exactly what the journalist did, by citing "total job growth in 2014 at just shy of 3 million, the best performance since 1999" and "[The 4-week moving] average [of jobless claims] has plunged 16 percent in the past 12 months, as averages have stayed at historically low sub-300,000 levels since September".
Take a look at this chart of the 4-week average in the last five years. The trend has been the same for five years (just draw a straight line through the series) and there is nothing at the right tail of the time series to indicate that the latest data release changed anything:
I'm also amazed that at this point, a journalist can write an article about employment without once mentioning the workforce participation rate. (Anyone who is excluded from the work force is not eligible to be "unemployed". The workforce participation rate has gone down without recovering.)
Notice that this time series was essentially flat until the recession.
I have a whole chapter on employment statistics in Numbersense (link).