At college reunions in beautiful Princeton on a glorious sunny day.
I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:
In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.
I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.
Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.
The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.
I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.
The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.
Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.
In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.
I only read nutrition studies in the service of this blog but otherwise, I don't trust them or care. Nevertheless, the health beat of most media outlets is obsessed with printing the latest research on coffee or eggs or fats or alcohol or what have you.
Now, the estimable John Ioannidis has published an editorial in BMJ titled "Implausible Results in Human Nutrition Research". John previously told us about the crisis of false positives in medical research.
Oops, here are some statistics on nuitrition "science":
In 52 attempts at using randomized experiments to validate findings from observational studies, the number of times the findings were replicated: 0
In the NHANES questionnaire (the basis of all those findings), two-thirds of the participants provided answers that imply an energy intake that is "incompatible with life". I haven't read this paper; seems like worthwhile reading.
There are at least 34,000 papers on PubMed with keywords "coffee OR caffeine" which means this one nutrient has been associated with almost any interesting outcome.
Almost every single nutrient imaginable has peer reviewed publications associating it with almost any outcome. A statistician should never give the advice "If at first you don't succeed,..."
Many findings are entirely implausible (and still get published in top journals)... for example, the idea that a couple of servings a day of a single nutrient will halve the burden of cancer is clearly "too good to be true," even more so for anyone who is familiar with this literature
"Big datasets just confer spurious precision status to noise"
Randomized experiments offer hope but are woefully undersized (like requiring 10 times the current sample).
Just to nail home the point, John concludes: "Definitive solutions will not come from another million observational papers or a few small randomized trials."
Last time we heard about Deflategate on this blog, Warren Sharp compiled some statistics on fumble rates, showing that the Patriots were unusually good at avoiding fumbles. (link, link) I thought the level of analysis was "above average" and remarked that statistical evidence of this type can only get you so far. The metric is indirect, and it does not speak to causation.
The official investigators have now issued their report. New York Times has its coverage here. As one reader commented, this article, currently nearing 800 comments, has more comments than most articles with more serious subject matter. The NYT article is one of the better ones out there on this subject.
Two set of new evidence has emerged.
The first, which is getting most of the headlines and attention, are text messages involving two Patriots employees who discussed their deflating operation. These text messages are highly incriminating for the two involved and for me, also incriminating for Tom Brady, the team's superstar quarterback (who refused to release his own text messages or other correspondence to the investigators). The text messaging evidence shores up the causal evidence in a way that numbers by themselves could never accomplish.
The takeaway from the text evidence is the power of "metadata". Metadata is data about the text messages (sender, recipient, date and time of sending, length, etc.), as distinct from the content of the texts. Metadata went mainstream when the U.S. government was revealed to have been massively scooping up metadata on domestic phone calls, but denied collecting contents of said phone calls (See thesecoverage, for example). The investigators can use metadata to learn who else is in the circle of insiders, how often they communciate, when they communicate, etc. Notice that these pertinent questions do not require knowing the contents of the text themselves. (This is not to say knowing the contents of at least some of the text messages is important--at the minimum, to zoom in on the relevant texts.)
But these investigators could not determine when the deflator operation started, how often it occurred or the full scope of the operation. This is likely to do with selective disclosure of the text messages by selected parties (e.g. none from Brady).
Another takeaway is the inherent bias in surveillance data. Simply put, you only know what you can measure, and there is much that are not being measured. To get the "full scope", the investigators would need phone records, emails, and even wiretap evidence following the key players around (just kdding).
The second set of evidence is also extremely important to the story but it has received far less attention. One reason I like the NYT coverage is that the reporter gets to this evidence before talking about the text messages. For the first time, I see direct evidence of football tampering. The NFL rule requires footballs to be inflated to between 12.5 and 13.5 pounds per square inch. According to the NYT report, after the Colts raised suspicion at half-time of the Patriots-Colts matchup, all of the footballs were found to be underinflated (below 12.5 pounds), with the minimum vaule of 10.5.
This is the first time I see a clear admission that all of the footballs were underinflated. This is much more convincing evidence that someone tampered with the footballs than any of the fumble analysis.
Further, the referee had already weighed the balls before the game, and at the time, found all of the Colts-supplied footballs to be about 13 pounds, and only two of the Patriots-supplied footballs to be under-inflated.
Once tampering is established, the investigators can move on to finding the cause. Here, they are helped by videotapes from surveillance cameras, and also the texts.
One nitpick about the sentence: 'The report uses the nebulous phrase “more probable than not” several times in making its conclusions.' To a statistician, this is a very precise statement, not nebulous at all! I interpret the investigators to mean there is more than 50% chance. That is the standard of "preponderance of evidence."
FiveThirtyEight has a lengthy discussion of the report. They helpfully showed a screenshot of the measured ball weights:
For those who have found it tough to keep up with Andrew Gelman's prolificacy, here are some brief summaries of several recent posts:
On people obsessed with proving the statistical significance of tiny effects: "they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down." (link)
[I left a comment. In Big Data, we have thousands, no millions, of kangaroos jumping out of sync, but still one feather.]
On people testing a zillion things hoping to land on the one that "works": "I suggest you should fit a hierarchical model including all comparisons and then there will be no need for such a corrections." (link)
[This is something Andrew has been advocating for a while. The idea is that such models have in some sense a built-in correction for the multiple comparisons problem. Unfortunately, some researchers are wrongly interpreting Gelman. I recently read a report that cites Gelman's paper as evidence that "multiple comparisons" is not a real problem, and then proceed to fit dozens of regressions without any mechanism to control for multiple comparisons!]
On when to throw out all your data, the lot of it: "Sure, he could do all this without ever seeing data at all—indeed, the data are, in reality, so noisy as to have have no bearing on his theorizing—but the theories could still be valuable." (link)
I discovered Hans Rosling's Gapminder work when I first started Junk Charts almost ten year ago, with this series of posts. So I was very excited to meet Hans yesterday at the Data, Children and Post-2015 Agenda Event hosted by the UNICEF Data and Analytics Section. And he gave a marvellous talk. I came away touched in equal parts by his humanity, the animated passion for his subject, and the insatiable desire to communicate.
Before getting to Hans, the event's host also made an impression. The UN has put a lot of effort into the Open Data movement. They revamped the website that hosts data from MICS (Multiple Indicator Cluster Surveys), which can be a good source of data for classes and projects. An older resource called DevInfo also appears to be very useful for data about the plight of children (link). The home page for UNICEF data is here.
Hans is a straight talker. And here are a few zingers from his talk.
It's a personality disorder for someone to be interested only in the data. Data is not enough.
He came back to this point at the end of the talk, pointing out that great work comes from people who understand the statistical reasoning and how the data is collected.
We don't need Big Data. We need Basic Data.
Here is an example of Basic Data, presented in the simplest possible way:
You have all the granular data and yet the majority of people continue to harbor myths about world social statistics. The above chart, for example, makes the point that in the last thirty years, Asia (which holds more than half of the world's population) has dramatically reduced fertility rates to reach the same level of the Americas. And yet, when Rosling quizzes his audience about world population growth, 80 to 90 percent still hold the impression that global population will continue to grow at a either linear or sub-linear rate.
Throughout the presentation, I noticed a further cleansing of his visual palette. This leads to another provocation:
The passion of the people plus Excel were all you need. You don't need fancy software.
He was talking about the Ebola crisis in Liberia, where he worked with locals to help measure and staunch the emerging epidemic. Many Western news outlets did not do enough homework and reported vastly inflated numbers during the course of the epidemic. As of yesterday, there are no known cases in Liberia. Hurray!
Saving the best for last. My favorite quote of the evening:
Big Data is a big bag of numerators without denominators.
This gets at the heart of the first C in OCCAM datasets: we are in desperate need of controls.
Meanwhile, Gelman found an elegant way to describe the mentality of statisticians:
I talk about Big Data, statistics education, and business analytics as the second part of the interview with KDnuggets came on line. See here.
I argue that introductory statistics should be taught as a liberal arts course. Reflecting on Rosling's disappointment that the majority of highly-educated people are so ignorant of basic world facts, I also wonder whether the education sector will find a way to teach students these facts. Thinking back to the my own college days, the introductory courses in statistics, economics, psychology, etc., were great at training me how to think theoretically but none bothered to connect the theories with any real-world statistics! Here is a past post about the dearth of a Census 101 class.
One of the questions I pose most frequently to my team members is: Do you think or do you know? In the spirit of stacking this post with quotes, I offer:
Thinking comes before knowing but knowing doesn't come from thinking.
The first part of the KDnuggets interview is here.
Ben Alamar reflects on the rise of data analytics in the NBA (link).
I like this passage very much, which really nails home the point that good analytics requires intuition:
The hours of waiting [during draft meetings] were often filled with watching film of prospects. It helped me refine my analysis, as I soaked up details from scouts that I never would have seen on my own. ("Rewind that. ... Did you see his foot placement there, getting ready for the rebound? That's NBA ready.") During one of these sessions, we were watching film of Syracuse point guard Jonny Flynn. I mentioned that, based on the rate at which he collected steals, he was likely a good defender. But one of the scouts explained that Flynn's steal total was likely higher than other point guards' because Syracuse played mostly zone defense, which allowed guards to attack the ball more. I checked that insight against the data and it seemed true, so I adjusted my defensive statistics to account for the dominant style of defense used by a player's team.
I'm glad to hear that the style of play is included in the models. I cringe every time I hear a (usually English) football (i.e. soccer) commentator claiming that a team "deserves" to be in the lead because it is dominating the time of possession when in fact, the other team is using a counter-attack strategy. When the other team ekes out a 1-0 victory on a sneak attack, the commentator loses his wit.
Alamar sees the next big challenge in NBA analytics as deriving value from the SportsVU data. What is SportsVU? Alamar tells us they installed cameras everywhere that "capture the coordinates of 10 players plus the ball 25 times every second." This is the typical "Big Data" scenario--data is collected without any design or any research question in mind. It raises a few intriguing questions:
The granularity of such data (here it is 25 times a second, that is, to say, four-hundredth of a second apart) can be arbitrarily small. When have we reached the point of picking up just background noise?
The very act of relating such data as "predictors" to outcomes such as scoring statistics presupposes the model in which the precise movements of the players or balls are correlated with those outcomes. Whether we like it or not, any resulting analysis will take on a causal interpretation--this is what separates trivia from an actionable insight. Is this type of predictor the most relevant to explaining outcomes? If not careful, we may just believe this story because that's the one we start with.
Last week, I pointed out the futility of using data as proof or disproof in Deflate-gate. Emphatically, a case of "N=All" does not make things better. I later edited the post for HBR (link).
In this post, I want to address a couple of more subtle technical issues related to the Sharp analysis, which can be summarized as follows:
1. New England is an outlier in the plays per fumbles lost metric, performing far better than any other team (1.8x above league average).
2. Different ways of visualizing and re-stating the metric yield the same conclusion that New England is the outlier.
3. There is a dome effect of about 10 plays per total fumbles, meaning that teams who play indoors ("dome") typically suffer 10 fewer fumbles than teams who play outdoors ("non-dome"). New England is an outdoor team that performs better than most dome teams on the plays per total fumble metric. If dome teams are removed from the analysis, New England is an outlier.
4. Assuming that the distribution of the metric by team is a bell curve, the chance that New England could have achieved such an extraordinary level of play per fumbles lost is extremely remote.
5. Therefore, it is "nearly impossible" for any team to have the New England type ability to prevent fumbles... unless the team is cheating.
Focus on Point 4 for the moment. This is a standard technique used by statisticians, and the basis of any analysis of "statistical significance". In statistical significance testing, we appeal to the normal distribution (bell curve) to estimate how close the observed sample is to the "average sample". The big question being addressed is: IS THIS AVERAGE?
Let's say we want to measure the effect of genetic modification on the size of fish. If the Fracken-fish sample is far from the average of natural fish samples, we conclude that Fracken-fish is statistically different (larger) from natural fish. A crucial requirement of this analysis is that the samples are randomly drawn.
But for Deflate-gate, the big question is: IS THIS EXTREME? The statistical significance tool is not designed to answer this question. The analysis tells us that the Patriots do not look like the average random sample from the NFL. Saying that something is not average is far from saying that it is an outlier! Indeed, statistical significance testing is frequently (and controversially) used to detect "small effects".
If the Patriot sample were randomly drawn out of the NFL, then Point 4 would have provided evidence of an extreme value but there is no random selection here. This takes us back to the point of my first post: the Patriots could belong to a group of elite NFL teams that have more "skill" in preventing fumbles, or there could be many other possibilities.
The other point of interest is that Points 1-4 say essentially the same thing: that the Patriots are far different from the rest of the NFL on the play per fumbles metric. Point 2 is the visual equivalent of the mathematics of Point 4.
Point 3 sounds different but it really isn't. Points 2 and 4 say the Patriots don't fumble much. But dome teams fumble less because they play indoors; thus, their presence in the analysis makes the Patriots advantage (a non-dome team) less pronounced. Thus, in constructing Point 3, Sharp removed dome teams. It's the same data, viewed from a different lens.
Repetitively stating the same statistic does not make an argument. I'm not saying Sharp should not have performed these steps. I'd have done many of these analyses myself. But they play the role of quality control. The reiterations don't strengthen the argument, and they sound a bit like Sunday morning talk shows.
During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.
From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.
Here are some improvements they have made since I last wrote about the tool:
Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.
Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.
Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.
Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.
New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!
I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.