Cathy O'Neil may need no introduction to blog readers. She's the author of the hard-hitting MathBabe blog, and she shares my passion for explaining how data analysis really works. She is co-author of the recent book Doing Data Science (link), with Rachel Schutt. Cathy has a varied career spanning academia and industry, as she explains below.
KF: How did you pick up your impressive statistical reasoning skills?
Thanks for the flattery, but I wouldn't call my skills impressive. I've always done my best thinking by assuming I understand nothing and starting from scratch. The best I can say about myself is that I have learned how to think abstractly and a few cool methods, or better yet rules of thumb, that help me get at very basic information.
What I know about thinking abstractly happened mostly during my mathematical training, first at math camp in high school, then as an undergrad in a highly welcoming and mathematically vibrant community at UC Berkeley in the early 1990s, and then during grad school at Harvard and to some extent my post-doc at MIT and my two-year Assistant Professor stint at Barnard College.
To be honest most of the last few years of being a "grownup" academic was spent learning non-math stuff like how to teach and write letters of recommendation.
Then I learned a bunch explicitly in the realm of statistical reasoning when I first got to D.E. Shaw, from my boss Steve Strong. Since then I feel like I've just been corroborating what Steve explained to me early on, which is that people fool themselves into thinking they understand stuff they don't.
KF:How would you rate the relative importance of academia and real-world experience in training your data interpretation skills?
I'd say that, on the whole, learning to think abstractly has been at least as important to me as rules of thumbs, and certainly more important than a given algorithm or technique. For example, from my experience working in industry, the most common mistake is answering the wrong question, not using the wrong technique.
I routinely tell people that, as a mathematician, you are a professional learner, with an added advantage of getting used to being wrong and feeling stupid. I'm sure the same can be said about other disciplines, but I'll stick to what I think I know on that score.
KF: What advice would you give to a young graduate with a BS in a quantitative field: get an advanced degree in Statistics, or go find a job in analytics?
I don't think it's a waste of time to get a Ph.D. and then an industry job, because although you're not honing specific skills in your future line of work, you are honing brain paths and habits of mind which don't come easy under time pressure and/or with money on the line. And of course, there are some people who love the feeling of getting things to work so much that they don't have patience for the thesis thing, and that makes sense for those people, as long as they don't give short shrift to the high-level perspective.
KF: What is your pet peeve with published data interpretations?
Better question: what isn't my pet peeve with published data interpretations?
I'm a huge complainer about everyone and everything, in spite of the fact that I think data and data analysis techniques are powerful and can and should be used for good.
I guess if I had to pinpoint my single most massive peeve, which really cannot be termed "pet," it would have to be hiding perverse incentives (and almost all incentives are perverse in some way) behind what people present at "objective truth". In my experience, outside of the world of sports where everything is transparent (except steroid use), there is always some opacity and gaming going on and someone's either making money off of it, gaining status from its publication, or wielding power through it.
And come to think of it, you've asked me the wrong question altogether. My biggest peeve with data interpretations is how many aren't published at all. For example, the Value-Added Model for teachers is being used to evaluate teachers but I can't seem to get my hands on the source code to save my life. Not to mention the NSA models.
[Edit: David Spiegelhalter also complained about what studies don't get published but for a different kind of concern. See this interview. In the recent furor over Google Flu Trends, the researchers expressed dissatisfaction that the underlying algorithm isn't properly documented in the public domain.]
KF: Which source(s) do you turn to for reliable data analysis?
I don't trust anything or anyone, including my own analysis. Everything comes with caveats. Having said that I usually trust people more when they are open about their caveats. On the other hand, even admitting that opens me up to being fooled by people who write up fake caveats to seem trustworthy. It's really an endless loop.
So for example, I like raw data, especially when I know how the data was gathered. For example, look at this gif, which shows a map of death penalty executions. In some sense that's as good as it gets, but of course it is also misleading in a sense since there are way more people in, say, California (38M in 2013) than in Nevada (3M in 2013), so even though they look similar on the map it's not so.
Bottomline is, never trust anything until you've checked it, and even then only trust your own memory of it for about 20 minutes.
KF: What advice do you have for the average reader of this blog? Surely, checking everything they read is not too realistic.
Of course, we don't have time to check everything. My suggestion is to remain skeptical of anything that you haven't checked through. And of course, don't confuse skepticism with cynicism, but also don't confuse skepticism with evangelism.
KF: Thank you so much. I've really enjoyed our conversation.
PS. I subsequently wrote about the chart that Cathy referenced in this interview. See here.
This article printed by VentureBeat is too much. The title claims "The Internet is killing off marketing surveys & it's for the best". This article is tagged as "Big Data". Big delusion is what it is.
This is a great example of the kind of revisionist history that is practised in the name of Big Data. You'd also notice that there is no data or evidence presented to support any of its many far-reaching claims.
First comes the howler:
about eight years ago, people started raising concerns about respondent quality [of traditional marketing surveys]; and as social media took off, some dared to wonder aloud whether online ratings and reviews were eliminating the need for surveys altogether.
Eight years ago would be 20062008. What happened in 20062008 that made people question instruments that have been used for decades? What kinds of concerns? How do "social media" and online ratings and reviews solve those problems? We will not learn.
Apparently, those eight-year-old concerns were not sufficient to sink the sorry enterprise until "recently". We're told -- again with no data -- that "we’re witnessing the demise of the lengthy, grid-question littered, rating-scale driven survey as we know it."
Raise your hand if you have done an online survey for a market research company. Is the survey "lengthy"? Does it contain a "grid"? Does it ask you to use a "rating scale"? Well, I thought so. Just in the last week, I reviewed an online survey design submitted by an outside agency which contains thirty questions, replete with multiple grids and multiple rating scales.
Later on, the author attacks blinded designs. I'm not kidding. This is the charge: '[Consumers] ask themselves, “Why should I invest time and candor in responding to questions posed by some person or entity that won’t even reveal their identity, let alone respond?”'
Apparently, tweets and online reviews are the new way. Don't worry if the tweets are unrelated to your research question or that most of those online reviews were purchased by your social media marketing agency and written by people who have not ever used your product.
And may we ask how they propose to measure "unaided brand awareness"? It is well known that the following two questions lead to vastly different responses:
(A) Name three services which can be used to create an online survey
(B) Have you heard of the following companies that provide online survey tools? SurveyMonkey, Zoomerang,...
Next comes the obligatory Big Data moment: "Big Data is also reducing the need for quantitatively rigorous, predictive surveys." What could that even mean? We now prefer quantitatively weak, unpredictive surveys?
The author explains, today we can just "harvest and analyze the masses of behavioral data already available." From where? Do log data or tweets inform us about attitudes, motivation, and psychology? As usual, we are asked to assume Big Data solves some problems, therefore Big Data solves those problems.
Now I have no doubt that "Big Data" will impact the market research field. This does not excuse poor arguments presented without evidence.
Making Big Data add value is not as simple as "harvesting". Tweets and reviews have all the characteristics of the OCCAM framework: they are observational in nature, lack controls, adapted from their original purpose, merged with other datasets, and to the deluded, they are complete (N=All).
I am excited to chat with Professor David Spiegelhalter, who is no strangers to our UK audience, and our statistics colleagues. Perhaps his most well-known contribution is the DIC criterion for model selection, introduced by a paper by him and collaborators. He holds the impressive title of Winton Professor for the Public Understanding of Risk at the University of Cambridge (link). He also writes a blog called Understanding Uncertainty (link), and as the accompanying photos show, is someone who knows how to enjoy life.
I mean, a statistician who appeared on Winter Wipeout (link to Youtube for spectacular splashes at 15:12 and 16:10), who'd have thought? Yes, Wipeout is that obstacle course show held over a pool of water. He also made this rather more educational Youtube (link).
KF: How did you pick up your impressive statistical reasoning skills?
Well that's your label and not mine. I started off doing pure maths, but that got too hard, and then I did mathematical statistics, and then got both too hard and too boring, and so for years now I have preferred getting involved in real problems that people are trying to handle using data.
But generally the data are messy, incomplete, and not as relevant as desired. While some technical insights are vital, I think any skill comes mainly from an apprenticeship of dealing with many problems, making many mistakes, trying to explain things, and far too much time spent critiquing studies.
KF: What is your pet peeve with published data interpretations?
That's easy to identify - it's a non-scientific approach to science reporting. I have a naive view that scientists should do investigations to answer a question, and they should be pleased whatever the answer. But it seems clear from many publications that some researchers set out to prove a point, and do everything they can to do so: in the worst instances they write an inflated abstract, the journal puts out a press release, and the media lap it up.
I feel the public get fed a diet of highly selected and biased studies (often, ironically, on diet) that have gone through so many filters that they become very unrepresentative of the bulk of research conducted. In my more cynical old-man moments, I would say that the very fact that a study is reported in the media is a reason to ignore it - almost certainly you would not have heard about it if the results had different.
KF: That last point sounds counterintuitive. Let's take a diet example. The media has been telling us new research suggests that four or more cups of coffee each day is great for you. If the research result were null, surely it wouldn't get picked up by the media. Why would that be a bad thing?
Say the media tells us four cups or more of coffee every day is great for you, and I judge that, if the study had shown no effect of coffee, it would not have been press-released and the media would not have picked it up. This probably means there is an unknown number of studies out there that showed the opposite to what I am being told by the media, but I am not hearing about them because they are not newsworthy enough. Therefore ignore the media. It also saves a lot of time.
KF: That's rather sobering. Which source(s) do you turn to for reliable data analysis?
These would tend to be individuals and teams that I know and trust: Andrew Gelman (link to my interview) comes to mind, and there are other great scientists whose opinions I value. I also respect people who are trying to produce good odds for future events, without pushing for one side or another. A purely financial interest produces objectivity, and so sports-betting sites are good examples - it will be interesting to see how 5-38 develop.
KF: Thank you very much for sharing your insights.
David and Michael Blastland just published a new book called "The Norm Chronicles", which I had a chance to preview. It's an idiosyncratic look at the idiosyncratic risks of modern living.
Reader Daniel T. is unhappy about this analysis of the intraday Internet usage by OS and device types. He doesn't like their choice of index, which I'll get to in a second post. (Link appears here when ready.)
There is something else wrong with this type of analysis.
Let's do a thought experiment. If you are a marketer interested in the diurnal variability in Internet usage, what are some of the factors you might investigate? My list would include whether the user is logging in from work or from home; whether the user is working or unemployed or on vacation; whether the user is male or female, young or old, a student, retired, etc.
Does your list include operating system? (i.e. Apple, Microsoft, Unix,...) I bet not.
But OS is exactly what the blogger analyzed, and thousands of marketers around the world do so on a daily basis. That's because they are using what data they can get their hands on. Web log data are adapted, that is to say, they were collected by engineers for the purpose of debugging, and now they are used by marketers to explain consumer behavior. It's not hard to see why such data cannot tell the full story.
This goes back to the O and the A in my OCCAM framework for Big Data (link). Web log data is the prototypical example of data collected by tracking devices indiscriminately without purposeful design, and then adapted to marketing applications.
One way to cope with using adapted data is to be clear about our model of the world. Assume OS really does affect Internet usage. How does OS affect Internet usage? Are you assuming that the features of an OS directly condition a user's behavior? Or are you assuming that the choice of OS is an indicator of the type of user?
Another way to cope with adapted data is to find or collect the data you really want (e.g. demographics, occupation) rather than analyzing data you don't understand. Recall Sean Taylor's advice to collect your own data (link).
A revised version of my previous post is picked up by Harvard Business Review (link). This post introduces the OCCAM framework for Big Data that I have been speaking about at my book talks (upcoming events are listed on the right column of the blog). The OCCAM framework identifies five elements of today's data sets that present challenges compared to traditional data sets. These challenges are not new but they are amplified by the nature of Big Data. Little attention has been paid to these challenges by the Big Data industry but as we hear about Big Data fails, we will surely hear about these elements.
On Wednesday, I'm participating in a free Webinar organized by Agilone on data-driven marketing.
I'm excited that Nate Silver has finally relaunched fivethirtyeight.com. (He announced his move from NYTimes to ESPN last year.) The site has a clean look and is easy to navigate. He has pieces on the NCAA bracket, an early discussion of the Senate race, the time-wasting statisticians' debate over data is/are, and this nice piece about reading economic data, and some others.
Apparently a bunch of economists has already pronounced its death (Krugman's "Tarnished Silver" is an example of these pieces.) I find these reactions premature and immature. The sample size of articles right now is very small - and they need time to find their audience.
The challenge for Nate is to figure out a balance between long articles based on thorough statistical analyses, and shorter pieces that get to the point quickly but still fundamentally driven by quantitative thinking. For example, I thought the piece about economic data useful for the average reader: the median tenure on the job going up does not necessarily mean workers are keeping their employment but can be a result of workers with less experience being selectively laid off is not something an average reader thinks about on an average day. Tyler Cowen (Marginal Revolution) snarked: "What it says is fine, but it won't interest me." But how many of Nate's readers are professors of economics?
Writing data-driven stories is a challenge that I am familiar with. In fact, some critics also think I have failed in this regard. The editor of Significance, in a generally favorable review, praises the front part of Numbersense, which is extremely restrained in the required mathematical background but considers the last two chapters of the book a "sudden death ending". Those parts on the other hand have been received favorably by data analysts. It's hard to satisfy both constituents.
I sense that Nate and I share a common mission, which is to show the general reader how to interpret data and data analyses. This means that what we show must be replicable by the reader. By this criterion, the analytical work would not be publishable in an academic journal. It might be a partial analysis, for example, based on a sample that is feasible for readers to collect, or a simplication of a model that is too complex to describe in 1000 words. We also assume that the general reader is not inclined to plod through academic journals looking for the supreme proof of something.
The big change to 538 is that it is now a team production. It was easy to keep a consistency of quality and a niche when Nate's the only contributor. Other writers have their own styles and inclinations. It remains to be seen whether this effort will develop an identity.
I fail to understand the complaints by the economists that the 538 pieces on economics are not informed by expert opinion at the same time that these economists profess to be fans of Nate's work in election forecasting, which is a field that also has a history of academic work. It sounds like "not in my backyard".
It's Spring Break at NYU, which for professors, is not a break. I have been marking midterms for my business analytics class. Since I like to set open-ended questions (are there anything else in statistics?), I get a variety of answers. One of the questions helps clarify what I mean by numbersense.
The question asks students to comment on the distribution of a variable (median income) in a dataset of customers. Every student should know how to generate a histogram and a boxplot, plus summary statistics and percentiles for this data. The figure below shows what each student was looking at. Before you read further, think about what features of this distribution attract your attention.
The responses I received fell into several categories. Let me list them out:
The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897.
The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897. There is a large range of incomes from $0 to $200,001, with a lot of high outliers.
The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. Almost a quarter of the sample has $0. Based on the age distribution (skewing older people), I think these may be retirees.
The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Some of the entries with zero income may have been missing values coded as zeroes, because they correlate with unknowns or zeroes in other variables.
The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Since the data are not collected at the individual level but at the Zip+9 level, meaning it measures the median income of the residential blocks around each customer, $0 surely does not mean zero. The zero-income segment has average values of other variables not too different from the positive-income segment and so most likely, zero means unknown.
These answers are ordered from demonstrating least numbersense to most. Response types #1 and #2 make no mention of the spike of zeroes despite the strong hint in the question: "Give plausible explanations for any parts of the distribution that is not smooth". Response #2 notices but is not bothered enough to explain it.
Responses #3-#5 all attempt to explain the observed anomaly. Response #3 has a good theory ("retirees") but somehow looks past the fact that the zero-income segment spans a wide age range. (The highlighted parts of the histogram below are the zero-income customers.)
In fact, this chart was used by several to prove that retirees accounted for the zero-income segment. This is a "strong priors" problem: it's all too easy to take weak evidence in the face of a strong theory.
One student divided the customers into zero-income versus not. This allows us to examine the distribution of other variables. For example, the median home value of those with "zero income" is almost the same as those with positive income.
Think about the people you hire to do analytics. While any of the answers above are acceptable, if you find someone who can give you Response #3-#5, you are in much better shape. That's what I mean by hiring for numbersense.
The article (link) in Science about the failure of Google Flu Trends is important for many reasons. One is the inexplicable silence in the Big Data community about this little big problem: it's not as if this is breaking news -- it was known as early as 2009 that Flu Trends completely missed the swine flu pandemic (link), underestimating it by 50%, and then in 2013, Naturereported that Flu Trends overstimated a spike in influenza by 50%.
The second reason why this article is important is the additional analysis they conducted (there is extensive supplementary material available from Science). The highlights are:
Not only was the reported over-estimation in Oct 2013 a one-time event, but in fact, Flu Trends has over-estimated flu prevalence for 100 out of 108 weeks since August 2011 (ouch!).
A simple model of projecting CDC data on a two-week lag would have done at least as well as Flu Trends, and no "Big Data" is needed for that.
The researchers further report the difficulty of assessing and replicating what Google researchers did because the information they have released about their algorithm is both incomplete and inaccurate. In reserved, professional language, they noted: "Oddly, the few search terms offered in the papers [by Google researchers explaining their algorithm] do not seem to be strongly related with either GFT or the CDC data--we surmise that the authors felt an unarticulated need to cloak the actual search terms identified."
Well, either the researchers made up data in the paper they published and did not disclose this fact, which amounts to fraud, or they didn't make up the data and the model is so inaccurate that the most predictive search terms from a few years ago are no longer predictive. They owe us an explanation.
People who attended my book talks in the last few months and my students will not be surprised by the current coverage as Flu Trends is one of many high-profile Big Data "success" stories for which readers will find very little documented evidence of success. It is as if data-driven decision-making is good for others but not for ourselves. So, my hats off to these researchers for the courage to put this issue into the public discourse.
It is clear that more data do not lead to better analysis. In fact, I argue at my public events that the revolution in data analytics is about five things, which I group under the acronym OCCAM. See the slide below.
Ask yourself what are the key differences between the datasets that underlie so-called Big Data studies and those we were using say 5-10 years ago. For me, the most important differences are that the data are collected without prior design, in an observational manner, usually by third parties, and often for a purpose different from our own ("adapted"). Further, different datasets are merged, exacerbating the issue of lack of definition and misaligned objectives. Controls are typically unavailable, and worse, analysis proceeds without an attempt to manufacture pseudo-controls.
Finally, in some cases, the data is "seemingly complete." This is the so-called "N=All" condition. The danger of this N=All talk is that its proponents confuse assumption for fact. We have seen this story before, in economics: assuming complete data is no different from assuming perfect information. Assuming something to be true doesn't make it true.
Before closing, I add a few words about why defining Big Data using a minimum size of datasets is absurd.
First, the problem of Big Data as defined by the likes of McKinsey is fundamentally unsolvable. If the current threshold is 100 terabytes, and we improve our processing power to tackle datasets of that size, then this definition calls for resetting the threshold of Big Data to 1000 terabytes, ad infinitum.
Second, some problem domains like education and medicine in which the units of measurement (schools, hospitals, students, patients) are upper bounded can never have a Big Data problem. Personal analytics is not a Big Data problem since no single person can produce that much data. And yet, some of the most exciting developments in data analytics are expected to come from those fields.
Third, no consumer will ever care about Big Data since no consumer will be exposed to terabytes or petabytes of data. Consumers (or citizens) will be impacted by this data revolution through better and more data analyses, but not by more data. Knowing the difference between those two things is fundamental to understanding this phenomenon.
Also of interest is my article on Big Data and Big Business, published in Significance last year. (link)
PS. Slate linked to this post. I'm not sure where I said "such 'big data' analyses are currently so abysmal as to be effectively useless." I'm saying there are lots of exciting things happening in data analytics right now but assuming that more data will solve all problems is where we are failing, and I offer an alternative way of framing Big Data which can be more productive.
Some time ago, there was a lot of hype about how new tech will demolish the superstar effect in entertainment sales because all the little titles in the long tail will be exposed to consumers. I recall Amazon being labeled the shiny example of a company that made profits off the long tail (as opposed to the boring top of the distribution). I still remember this graphic from Wired (link):
A reader Patrick S. pointed me to a study of music services that pronounces "the death of the long tail" (Warning: they want your email address in order to read the full report. The gist of the report was written up in this other blog.). Reading these pieces, one wonders whether this long-tail miracle even existed in the first place. The main thrust of the argument is that the new digital subscription/music services have not changed the allocation of spoils amongst artists. The little guys out in the long tail are still earning much less of a (shrinking) pie.
The long tail is an example of those intuitive, elegant scientific concepts that are much less impactful in the real world than claimed. Here is what I think caught some smart people on the wrong foot:
The distribution of profits has always been much more extreme than the kind of ballpark graphics (like the Wired chart above) shows. The new study for example suggests that the top 1 percent earned 77 percent of all the money. This is much more extreme than the 80/20 rule. From the graphical perpective, you can think of the distribution as one very tall spike and a very flat, very long tail.
The cumulative weight of the very flat, very long tail is still not that heavy compared to the one spike. Even if you manage to increase the size of the tail by 10 percent, it still amounts to a small number.
The above assumes you can increase the size of the tail. But it is quite hard to do. One reason is that the tail consists of millions of little pieces, which don't necessarily move in sync.
The second, and more important reason, is that titles or artists don't randomly end up in the tail. If a title is in the tail, it's an indicator that the artist or title is not appealing to the mass audience.
We fell prey to the romantic notion that there are some unjustly neglected artists, and rejoice in the idea that the long-tail effect may allow a few of these to reverse their fortunes. But a few outliers do not change the overall distribution.
The report's authors also make this observation:
Ultimately it is the relatively niche group of engaged music aficionados that have most interest in discovering as diverse a range of music as possible. Most mainstream consumers want leading by the hand to the very top slither of music catalogue. This is why radio has held its own for so long and why curated and programmed music services are so important for engaging the masses with digital.
While I believe this story, I should note that there is no quantitative evidence provided (at least not in the summary). If this is true, it has important implications for anyone in the business of "personalizing" marketing to consumers.
At the start of the year, The Atlantic published a very nice, long article about Netflix's movie recommendation algorithm. You may remember this algorithm (internally known as Cinematch) received a $1 million makeover several years ago (the Netflix Prize), only that the prize-winning entry was deemed too complex--and does not generate sufficient incremental value--to be put into production.
The reporter, Alexis Madrigal, noticed that Netflix has shifted attention from the queue of recommended movies to providing (micro-)genres of movies you might be interested in. His article is a great example of powerful data journalism: he reverse-engineered the internal structure of Netflix's new algorithm by extracting all of the keywords ("About Horses", "Critically Acclaimed", "Visually Striking", to name a few), and then creating all sensible combinations of these keywords (e.g. "Critically Acclaimed, Visually Striking Movies About Horses"), producing the roughly 80,000 possible microgenres used by Netflix. (It's clear that Netflix management endorsed this exercise and article but it's not clear how much proactive support they provided.)
One of my favorite columnists, Felix Salmon, reacted negatively to the change in algorithms, titling his post "Netflix's Dumbed-Down Algorithm". He interpreted the change as foreshadowing the day when Netflix no longer could offer any movie any user places in his/her queue because the third-party content providers have ratcheted the costs too high. It's a longstanding weakness in Netflix's streaming business model.
Felix lamented that the genre-driven recommendations would be far inferior to the original recommendations:
The original Netflix prediction algorithm — the one which guessed how much you’d like a movie based on your ratings of other movies — was an amazing piece of computer technology, precisely because it managed to find things you didn’t know that you’d love. More than once I would order a movie based on a high predicted rating...
The next generation of Netflix personalization, by contrast, ratchets the sophistication down a few dozen notches: at this point, it’s just saying “well, you watched one of these Period Pieces About Royalty Based on Real Life, here’s a bunch more”.
Felix is right on the business model but misses the mark on the analytics. As someone who builds predictive models, I had the opposite reaction when reading The Atlantic's piece. I thought Netflix's data engineers learned something from the Netflix Prize "fiasco".
The major change to the analytical approach is shifting from predicting whether you'd like a movie to whether you'd watch a movie. This shift makes a lot of sense to Netflix as a business. It is sensible even from the user's perspective: since when is it that we never watch a bad movie? (Even the movies we place in the queue ourselves could turn out to be bad.)
One big problem with the Netflix Prize was its singular focus on the RMSE metric, which roughly speaking measures the average error of the predicted ratings against actual ratings. The ratings data, though, is extremely skewed, making an average error criterion worse than misleading. By skew, I mean (a) a very small number of popular movies receives the majority of the ratings and (b) a small number of highly active users contribute the majority of movie ratings. Put differently, missing data is far and away the most important feature of the data.
Because of missing data, it is next to impossible to get good predictions for niche movies (with few ratings) or for users who do not actively feed signals into the algorithm. Improving RMSE by 10 percent does not mean every user's prediction improved by 10 percent. The improvement is likely concentrated to user-movie pairings for which there is sufficient data to work with. It would be enlightening if someone does an analysis of the performance of the winning algorithms by segments of users (based on the amount of prior data to work with).
Now, consider predicting what you'd watch next based on the viewing behavior of you (and other users). For every user and movie combination, the user either have or have not watched the movie. Just like that, the missing-data issue vanishes. The result of what Felix sees as "dumbing down" may be a stoking up.
As I pointed out in Chapter 5 of Numbersense (in talking about Groupon's bid to personalize offers; link), every business faces a set of conflicting objectives when trying to "personalize" marketing to customers. I believe this Netflix shift shows they have found a good balanced solution.
I will be speaking at the Agilone Data Driven Marketing Summit (link) in San Francisco on Thursday. I will be talking about hiring for numbersense. Drop by if you are in the area. Future events are listed on the right column of the blog >>>
I feel bad piling on the "good guys" in the sports doping spectacle but sometimes, you need someone to point you to the mirror.
Here are the breathtaking first sentences from an article in Canada's The Globe and Mail about the scarcity of positive doping results in Sochi 2014:
At the midpoint of the Sochi Games, not yet marred by a single case of doping, the IOC’s top medical official said its efforts to catch drug cheats were so successful they had scared them all away.
A week later, after the disclosure of a fifth doping case on the final day of the games, IOC president Thomas Bach cited the positive tests as the sign of success.
If you have been reading this blog, you already know the people in the anti-doping business set themselves a really low bar. The title of Chapter 4 of Numbers Rule Your World (link) contains the phrase "timid testers" for a reason.
The statement by the unnamed "top medical official" is the more shocking. If there are no positive test results, and such is considered an accurate portrayal of the doping situation, then we must believe that there are no dopers. Apparently, this official believes no athlete that has been tested doped. Not a single one.
Who’s right? To [IOC president] Bach, it doesn’t much matter.
“The number of the cases for me is not really relevant,” Bach said. “What is important is that we see the system works.”
Now, it's Bach's turn to display his ignorance of the statistics of anti-doping. As I explained years ago in the book and also on this blog, the proportion of tests that come back positive is one of the most important numbers to look at when judging the success of an anti-doping program. So far, we know that six out of 2,630 athletes tested positive, meaning the rate of testing positive is 0.23%. (Much less than 1 percent is the norm in all large international events.)
What does that mean? If one percent of athletes doped, then we should expect 26 positives if the tests were 100% accurate. Since they only caught six, at least 20 of the 26 dopers passed the test. Yes, that means over 80% of dopers passed. (And I'm only assuming one percent doping, and not allowing the possibility of false positives.)
This leads me to the as-yet unrecognized scandal. Lance Armstrong, Ryan Braun, Mark McGwire, Alex Rodriguez, etc. etc. None of these confirmed dopers were caught by steroids tests. In fact, all of them boasted at one point or another that a long string of negative test findings proved that they were innocent.
Rather than gloating about the "success" of anti-doping measures, they should try explaining how the most notorious dopers in sports were repeatedly given a clean bill of health.
I am a supporter of anti-doping. I just want some discussion of the false negative problem.