There was a lively, fun discussion after my talk yesterday night in New York. For those who couldn't attend, let me review some of the conversation. Here you go:
Q: Tell us more about the chapter in Numbersense titled "Are They New Jobs When No One Can Apply?" Related to economic data, can you talk about the idea that we still need to import foreign workers because there aren't enough skilled labor available domestically?
A: That chapter is really about the use of statistical adjustments. Some people say they are skeptical of the official unemployment numbers, the seasonally adjusted numbers, because it includes jobs that are made up--there isn't a job posting that you can apply to. These people think that raw data is always better than adjusted data, because statisticians are doing naughty things to the data. In reality, it's the opposite. Raw data is bad and adjusted data is better.
In terms of the skilled labor issue, this is related to the argument about structural unemployment, the claim that the unemployment problem cannot be solved because people need retraining.The proposed solution is to send more people to college. This is usually supported by data showing that the unemployment rate of college graduates is much lower than that of non college graduates. However, there is a cohort problem here. If we drill down to recent college graduates, there are many who aren't finding jobs. So if we make more college graduates, there will be even more competition for those jobs, and it will suppress the income of those jobs.
Unemployment rate and economic indicators are complex things. The chapter [Chapters 6-7 in Numbersense] goes into quite a bit of detail to address the question of why they seem out of touch with reality.
Q: How is Big Data going to impact the movie industry? Can you predict what people will watch?
A: To some extent. But remember the example of the Target model predicting pregnancy I just described [Chapter 5 in Numbersense]. If you use a statistical criterion, like the predictive lift, you can congratulate yourself on having built a great model. But then if you apply a more common-sense metric, like the hit rate, you notice the really high proportion of errors even when the lift is high.
In social-science problems like this, I always advocate the combination of quantitative and qualitative data. It is not sufficient to just use frequencies. They don't tell you why someone watched something. [Then I went on a rant about why correlation is still not causation in the Big Data era.]
Q: What can you say about how the Democratic and Republican parties used Big Data in the recent presidential elections? Can you talk about the likely voter models?
A: Somewhere in the book (Prologue), I discussed the likely voter models. I used this to illustrate the point I was making earlier about understanding what part of the argument is data-driven and which part is theory-driven. In the book, I compared the work of Nate Silver and the guy who created the UnSkewedPolls.com website. They both used the same data set of the poll results but came out with really different projections.
This points to the important idea that any data analysis involves theory--you can't avoid it. There is this myth out there that says when you have loads of data, the data itself is objective and stops all debate. That's so far from the truth. Anyone who has worked with data knows that you have to make assumptions. In the case of likely voters, the Republican guy made assumptions about how their party members would be motivated to come out to vote. This is a necessary assumption because if you just use the data, you will always predict that the future is the past, and you will never be able to predict a surge in interest. In this case, the theory part of the analysis turned out to be spectacularly wrong.
Q: Big Data is impacting the education sector in many ways, such as Value Added Models being applied to evaluate teachers. What do you think of these models?
Education is a good example of Big Data. It fits the five criteria that I just laid out for what is a Big Data study. The value-added models for instance are silly, partly because the data is co-opted--test scores are originally intended to measure student performance but then they're being used to evaluate teachers. Also, the whole pay-for-performance concept doesn't work when you can't measure performance well; it backfires and causes rampant cheating. The first chapter of Numbersense was originally going to be about how teachers and principals all over the country were turned into cheaters; but then I ended up writing about a different kind of fraud--how schools game the school rankings.
Q: On one of your slides, you mentioned eHarmony trying to port its algorithm from matchmaking to hooking up employees and employers. That sounds like a promising thing. What do you think of it?
A: Just like all predictive models, you have to be careful in understanding how accurate the predictions are. The media do a poor job of reporting the accuracy. The eHarmony model can be evaluated in the same way that I evaluated the Target model for predicting pregnancy. You have to think about both false positives and false negatives, and the fact that those two trade off each other. In the media, you often read about one of those two metrics, and they hide the other one.
Q: Did the Gates Foundation stop funding the "small schools movement" because of [Howard Wainer's] statistical analysis? [See Prologue of Numbersense.]
A: Not really. However, they did use statistics to come to that conclusion. To their credit, the people in the Gates Foundation ran some rigorous analyses comparing the small schools they funded with larger schools. And they learned that the small schools were not better than the larger schools, and in some cases, even worse. So they started to spend the money on other things like curriculum development.
Q: Do you have an opinion on the recent Malcolm Gladwell piece about doping in sports?A: I do have reactions to the Gladwell piece. Those of you who came to this talk three years ago for my other book remember that anti-doping testing was one of the main examples I spoke about. In fact, other people have requested that I write about Gladwell's assertion that doping should be made legal. I will be putting up my response hopefully in the next few days. Look for it on my blog.
I'm sure there were more questions. So, please accept my apologies to any attendee whose question I could not recall. Thanks for attending.
Find out more about the book here.