In Chicago, I spoke about the impact of Big Data on marketing. I was going to summarize the key points here but then I noticed that Chris Rollyson has already done the work, and did a much better job than I can. Here are his copious notes.
One question from the audience that I didn't address fully was the use of data in education. I got as far as the farce of "value-added models" in measuring teacher performance. In addition to numerous statistical problems (see here for example), the most basic problem is using inappropriate data. Standardized tests are designed specifically to measure student aptitude, and not teacher performance. One of my points yesterday is that bad data is worse than no data.
VAM is a great case study for how not to use Big Data. The logic of the proponents seems to be: we need to measure teacher performance. We have some observational data (test scores) lying around that may be somewhat related. It's better than nothing. It's a huge data set so it must be useful. We know it's observational data, but if we control for every conceivable thing, we should be fine.
Unfortunately, this kind of attitude is prevalent in the Big Data industry. If there is data, there must be something interesting in the data. Except that what's typically interesting is the realization that the way the data was collected has so many flaws that nothing useful could be conceived from it.
My advice is to make your own data. In the case of measuring teacher performance, rather than take useless test score data, come up with some hypotheses, and go out and collect the data needed to prove/disprove those.
Measuring teacher performance can be framed as a reverse causal problem (see my post here and Gelman's original post.) If I hold my nose and accept test scores as a measure of the output of teachers, then the first-order problem is to explain why performance varies. We can come up with a lot of different hypotheses, and test them.
Alternatively, to the extent possible, we should run experiments to learn things like does the average student learn more using traditional boring method of teaching math or using New Math. This is the forward causal question.
The good folks at Observational Epidemiology may have misunderstood my Moneyball analogy (link). They promise to clarify what they mean later. I'll respond when that happens.
I agree with Fung that the need to develop NUMBERSENSE is more urgent now than ever before as what was once characterized as an information blizzard has since become a data tsunami. Here in a single volume is about all the information, insights, and counsel anyone needs to do that.