Stephen Stigler, the preeminent historian of statistics, gave a great talk at JSM, the annual gathering of statisticians on Monday afternoon in Boston. He outlined seven core ideas ("pillars of wisdom") in statistical research that sets the field apart; these are ideas developed by statisticians that represent significant advances to science and to human knowledge.
As he remarked, each of these advances overturned then-established science, but even today, many people outside statistics are not aware of these learning.
I will briefly recount three of these discredited beliefs about data and uncertainty:
1. Fallacy: Throwing individual level data away reduces the amount of information. What Statisticians Learned: Throwing individual level data away, such as aggregating or averaging, can increase information.
2. Fallacy: Information increases linearly and proportionately with the number of samples. i.e. small data, a few insights; big data, a lot of insights. What Statisticians Learned: Information increases only at a rate of square-root of the sample size, meaning there is diminishing returns to increasing sample size.
A corollary of this is that it will take a lot more effort to squeeze out ever decreasing amounts of marginal information in big data.
3. Fallacy: The only correct way to run an experiment is to alter one factor at a time while keeping everything else unchanged. What Statisticians Learned: The one-factor-at-a-time dogma is wrong; we should ask Nature many questions at the same time.
I loved Stigler's talk. I do wonder if we also need to look ourselves in the mirror. If we were to test our students after they took Intro Stats about the seven pillars of wisdom, I suspect we will learn a very unfortunate result, that they will not have appreciated any or most of these points. The intro curriculum is much too focused on mechanics.
Here are some of the talks I attended so far:
- David Banks showed an application of LDA (topic models) to a corpus of posts from political blogs. The topic distribution is allowed to vary with time. They attempted a semi-successful automated naming of topics by matching words to Wikipedia articles.
- Madeleine Cule described how Google uses a mixture of experiments and statistical adjustments to compute the effect of display advertising on brand interest. Brand interest, since this is Google, is defined as searching for branded terms on Google. How times have changed... digital advertisers have returned to the old world of measuring indirect brand metrics, abandoning clicks and direct responses.
- Phillip Yelland showed how Google created a machine that generates sales pitches for its sales team whose objective is to increase advertiser's spending with Google. Like Cule's talk, this methodology is heavily influenced by end-user input: in this case, those are explicitly represented in a Bayes Belief Network. For those paying attention, the Google researchers discussed how they do not have all of the data, how true "controls" are almost impossible on the Web, how they are restricted by data collection practices of third parties (i.e. adapted data), how they write "lots of SQL".
- Phillip Yu gave an overview of statistical models of ranking data.
- Cynthia Rudin described a novel predictive model to find clusters (in space and time) of crimes in Cambridge, MA. Good questions from the floor.
- Someone from Nielsen (I only heard the second half of this talk) mentioned a lot of practical problems with set-top box data. In this setting, you are supposed to have "all the data". In reality, you don't and what you have are problematic. Just one example: lots of people turn off their TV but not their set top boxes, and it's hard to know if you are still watching the same channel or have gone to bed. Also, the really tricky business of adjusting such data: you need different models at the user level from the aggregate level but those models then are inconsistent with each other.
- R. Mazumder offered a reformulation of Factor Analysis as an optimization problem.
- Pouliot tries to predict which restaurants in San Francisco may have health problems using text analysis from Yelp reviews. Another application of LDA although in my mind, not successful. First attempted as an unsupervised problem, then as a supervised problem using past inspections as the training data. The problem with using past inspections is that the model is now conditioned by past rules. Not sure if Yelp reviews is the way to go but this is an interesting open problem.
- Twitter was supposed to give a talk but cancelled.
- Facebook is as far as I know absent, perhaps because of the weird controversy about scientific research. If so, it's tragic.
Here are other sessions I attended, including one on statistical graphics I somehow missed in the first roundup:
- Grace Wahba gave the Fisher Lecture. Her research from decades ago on kernels and splines has made a great impact on the field. It's one of those ideas that has lifted off once we reached a certain level of computational power. The support vector machine has been found to be a specific implementation of her more general ideas. She presented a paper on using smoothing-spline ANOVA to look at mortality within family members.
- There was an important session on reproducibility of statistical research, and I heard talks by Phillip Stark and Y. Benjamini. Stark lays out the issues nicely, and calls a spade a spade: a lot of today's research, especially but not limited to computational statistics, amount to "hearsay" or "advertising" for unpublished content. This is because referees or readers do not have the ability to replicate such findings. He introduces the Berkeley Common Environment (BCE), which is a set of tools that his and other research teams use to keep track of codes, environments, etc. to make research replicable. Sounds interesting to me.
- Benjamini's talk is primarily about what he calls "selective inference". One way to describe this is that one should not use the same data to both parametrize one's model and to generate estimates from that model. If ignored, the estimates will very likely to too optimistic (i.e. over-fitted to the observed data). Perhaps this is more understood in the context of modeling but it also is widely practised in statistical testing, where we first select significant effects based on p-values, and then issue interval estimates on those selected effects. As his provocative subtitle asserts, "it's not the p-value's fault". He then ran out of time but was getting into another issue of needing to estimate interaction effects due to clusters, e.g. medical centers in a clinical trial.
- Another session brings together the business, legal and ethical dimensions of data privacy. I wish this session was structured more like a conversation than a series of presentations with a few questions at the end. The lawyer, Paul Ohm, contends that anonymization is impossible, and the notion of PII (personally identifiable information) is useless. I don't think I heard what the solution is. Dick Deveaux, the statistician, reviews recent controversies, such as the Google Maps, Facebook research, and OK Cupid experiments. Clearly our community is ill-prepared to make constructive contributions right now to this debate.
- Leland WIlkinson discussed "scagnostics" which apparently originated with John Tukey. Scagnostics are summary statistics of scatter plots. Once you reduce those plots to numbers, you can then classify and cluster these plots, and Wilkinson shows how this is done and then how to explore a corpus of scatter plots in this way. Useful if you have that type of data.
- Max Ghenis presented Glassbox, an R package created by Google HR analytics group, for visualizing the response surface of models. I have used this sort of tool for years. Definitely a worthy project.
- Andrea Kaplan demonstrated Gravicom, a tool to manually generate clusters of nodes from network data. This is still preliminary work. Would like to see it again once she incorporates algorithmic approaches to complement the manual approach.
- Github and Shiny win the Supporting Actors awards for that section of Statistical Graphics that I attended.
- Carson Sievert presents a way to explore topics and words in LDA topic models. I like what I saw and think it would be very useful to people building these models. I hope he will expand this project from purely exploratory to allowing users to take actions based on what they see.
- M. Majumder's talk was entitled Human Factors Influencing Visual Statistical Influence. It's really about "line-up plots" which is a brilliant concept by A. Buja. You generate multiple sets of simulation data from the null distribution and insert the observed data into this set at a random position. Then the analyst tries to pick out the observed data from the noise. If you can't do it, then we don't have a signal. This research used Mechnical Turks to test what are covariates that affect the performance of this task. I was a bit disturbed by the extreme variance of the "percent accurate" metric - the boxes in many of the boxplots went from almost 0 percent to almost 100 percent. I believe Majumder's conclusion is that the variance is almost entirely explained by the difficulty level of the tasks at hand. I would like to see this work repeated with a choice of tasks that do not range so much in difficulty level.