Cathy O'Neil may need no introduction to blog readers. She's the author of the hard-hitting MathBabe blog, and she shares my passion for explaining how data analysis really works. She is co-author of the recent book Doing Data Science (link), with Rachel Schutt. Cathy has a varied career spanning academia and industry, as she explains below.
***
KF: How did you pick up your impressive statistical reasoning skills?
CO:
Thanks for the flattery, but I wouldn't call my skills impressive. I've always done my best thinking by assuming I understand nothing and starting from scratch. The best I can say about myself is that I have learned how to think abstractly and a few cool methods, or better yet rules of thumb, that help me get at very basic information.
What I know about thinking abstractly happened mostly during my mathematical training, first at math camp in high school, then as an undergrad in a highly welcoming and mathematically vibrant community at UC Berkeley in the early 1990s, and then during grad school at Harvard and to some extent my post-doc at MIT and my two-year Assistant Professor stint at Barnard College.
To be honest most of the last few years of being a "grownup" academic was spent learning non-math stuff like how to teach and write letters of recommendation.
Then I learned a bunch explicitly in the realm of statistical reasoning when I first got to D.E. Shaw, from my boss Steve Strong. Since then I feel like I've just been corroborating what Steve explained to me early on, which is that people fool themselves into thinking they understand stuff they don't.
KF: How would you rate the relative importance of academia and real-world experience in training your data interpretation skills?
CO:
I'd say that, on the whole, learning to think abstractly has been at least as important to me as rules of thumbs, and certainly more important than a given algorithm or technique. For example, from my experience working in industry, the most common mistake is answering the wrong question, not using the wrong technique.
I routinely tell people that, as a mathematician, you are a professional learner, with an added advantage of getting used to being wrong and feeling stupid. I'm sure the same can be said about other disciplines, but I'll stick to what I think I know on that score.
KF: What advice would you give to a young graduate with a BS in a quantitative field: get an advanced degree in Statistics, or go find a job in analytics?
CO:
I don't think it's a waste of time to get a Ph.D. and then an industry job, because although you're not honing specific skills in your future line of work, you are honing brain paths and habits of mind which don't come easy under time pressure and/or with money on the line. And of course, there are some people who love the feeling of getting things to work so much that they don't have patience for the thesis thing, and that makes sense for those people, as long as they don't give short shrift to the high-level perspective.
***
KF: What is your pet peeve with published data interpretations?
CO:
Better question: what isn't my pet peeve with published data interpretations?
I'm a huge complainer about everyone and everything, in spite of the fact that I think data and data analysis techniques are powerful and can and should be used for good.
I guess if I had to pinpoint my single most massive peeve, which really cannot be termed "pet," it would have to be hiding perverse incentives (and almost all incentives are perverse in some way) behind what people present at "objective truth". In my experience, outside of the world of sports where everything is transparent (except steroid use), there is always some opacity and gaming going on and someone's either making money off of it, gaining status from its publication, or wielding power through it.
And come to think of it, you've asked me the wrong question altogether. My biggest peeve with data interpretations is
how many aren't published at all. For example, the Value-Added Model for teachers is being used to evaluate teachers but
I can't seem to get my hands on the source code to save my life. Not to mention the NSA models.
[Edit: David Spiegelhalter also complained about what studies don't get published but for a different kind of concern. See this
interview. In the recent
furor over Google Flu Trends, the researchers expressed dissatisfaction that the underlying algorithm isn't properly documented in the public domain.]
***
KF: Which source(s) do you turn to for reliable data analysis?
CO:
I don't trust anything or anyone, including my own analysis. Everything comes with caveats. Having said that I usually trust people more when they are open about their caveats. On the other hand, even admitting that opens me up to being fooled by people who write up fake caveats to seem trustworthy. It's really an endless loop.
So for example, I like raw data, especially when I know how the data was gathered. For example, look at
this gif, which shows a map of death penalty executions. In some sense that's as good as it gets, but of course it is also misleading in a sense since there are way more people in, say, California (38M in 2013) than in Nevada (3M in 2013), so even though they look similar on the map it's not so.
Bottomline is, never trust anything until you've checked it, and even then only trust your own memory of it for about 20 minutes.
KF: What advice do you have for the average reader of this blog? Surely, checking everything they read is not too realistic.
CO:
Of course, we don't have time to check everything. My suggestion is to remain skeptical of anything that you haven't checked through. And of course, don't confuse skepticism with cynicism, but also don't confuse skepticism with evangelism.
KF: Thank you so much. I've really enjoyed our conversation.
PS. I subsequently wrote about the chart that Cathy referenced in this interview. See
here.
Comments
You can follow this conversation by subscribing to the comment feed for this post.