Here was how I spent last weekend:
At college reunions in beautiful Princeton on a glorious sunny day.
I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:
In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.
I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.
Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.
The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.
I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.
The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.
Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.
In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.