Last year, Gizmodo capitalized on the fallout from the Ashley Madison hacking scandal and published a sensational article claiming the website that, if you haven't heard, promotes adultery, has "almost no" real women on it. The subtext is that millions of gullible, disloyal males were paying monthly fees to the website to do nothing or, cue the laugh track, to converse with "badly-designed robots." These men, according to Gizmodo, were buying a "fantasy," and "almost no" hookups were ever consummated.
That conclusion was ridiculous on its face. It assumes that men have no common sense. In fact, not one man but over 30 million men with zero common sense. (Ashley Madison has already been in business for over a decade.)
It didn't however stop the journalist from all kinds of emoting, such as:
the more I examined those 5.5 million female profiles, the more obvious it became that none of them had ever talked to men on the site, or even used the site at all after creating a profile [italics from the original]
In case that isn't extreme enough, she elaborated:
Actually, scratch that. As I’ll explain below, there’s a good chance that about 12,000 of the profiles out of millions belonged to actual, real women who were active users of Ashley Madison.
In casual conversations, I keep hearing this story. Except that the story has been debunked within a week of its publication, but as per the state of the media today, the debunking got a fraction of the press lavished on the original, dreadful piece of data journalism. Most of the outlets that helped spread the initial nonsense never bothered to print the retraction.
What the journalist faced was reality. As soon as the piece got published, a number of readers, both male and female, commented on their personal experiences with the website. There were couples who found love and eventually got married. There were female users who refuted the conclusion that Ashley Madison was "a science fictional future where every woman on Earth is dead." Besides, people with inside knowledge pointed out how the data were completely misinterpreted.
For those interested in "numbersense" in data analysis, it is very instructive to read both the original article and the retraction that is thinly disguised as further juicy finding. How can a data analyst avoid falling into the traps that lead them to utterly invalid results?
A lot of "numbersense" has to do with how you process the information you have, and the information you don't. Of the data that you have, what do you believe and what you don't. Of the data you don't have, what assumptions you make.
The journalist believed the hackers when they boasted that the data dump, 20 compressed gigabytes and all, contained all of Ashley Madison's customer data. This turned out to be wrong. Belatedly, it was shown that at least 550 tables exist in the data infrastructure, and the journalist analyzed just four tables! Tellingly, this additional knowledge did not stop her from issuing even more "insights" in the second article of the series.
Further, people with inside knowledge gave the reporter the hint that the information she needed was hiding in plain sight. There was a column in the data table called "ishost" (i.e. "is this user a host?"), and a "host" is internal jargon for a "chat bot" (incidentally, these are making the news by way of Facebook, Microsoft. and Microsoft.) According to the ishost column, there were only 70,000 or so bot accounts, far from the millions of accounts claimed by the journalist!
In the original article, the journalist cheerfully related her process of discovering that no real women used the website. The high point was: "three data fields changed everything." These fields supposedly measured the frequency of specific actions on the site, such as sending emails to other users. It turned out that the columns did not measure any human activity at all. They recorded bot activity, thus invalidating her entire analysis.
For example, "mail_last_time" did not mean "a timestamp indicating the last time a member checked the messages in their Ashley Madison inbox," as asserted in the original article. In fact, insiders told the journalist it indicated the last time a bot sent an email to an Ashley Madison member.
This is amateur hour: to infer the content of a column of data by the name of the column. One can never guess the intention of the developer who names the column, let alone know whether the column has deviated from its initial definition over time.
And this little situation illustrates perfectly why analysts of Big Data owes it to consumers to be extra careful. Much of Big Data is observational, which means the origin of such data is obscure, or obscured by the organizational layering, or washed away by time. The current practice in database development scoffs at data dictionaries or data flow diagrams or any kind of documentation. The spirit of "agile" development devalues stability in the data environment. So it has become even more arduous than ever before to understand one's data.
The essence of numbersense is captured here. Are you the analyst who looks at "mail_last_time" and convinces yourself that it measures human activity and thus proves that no female humans exist on the website? Or are you the analyst who ceaselessly asks questions to get to the bottom of what that column measures?
Andrew Gelman and I have published a piece in Slate, discussing the failure to replicate scientific findings, using the recent example of the so-called power pose. The idea of the "power pose" is that people develop psychological and hormonal changes by making this "power pose" before walking into business meetings, whereupon these changes make them more powerful.
As you often read here and at Gelman's blog, the fact that someone got a paper published in a scientific journal, based on a statistically significant result, doesn't automatically make it a believable result. Here, a different group of scientists tried to replicate the finding, with a sample size five times larger, and their replication did not come close to being statistically significant.
The original researchers wrote a response detailing differences between the two studies, which misses the point. While there are differences, as would be the case in any replication attempts, the key issue is whether readers should believe a study result that is so fragile. For example, what year the study was conducted is described as a difference, so is the proportion of females in the study (62% in the original study versus 49% in the replication). There are also differences of execution such as how many minutes the pose was held and what type of regression was used in the analysis. Even if these differences explain the inability to replicate the original finding, they would imply that the conclusion is dependent on those very conditions, and does not engender trust in its generalizability.
This situation is not unique to the "power pose" study. Over the years, Andrew and I have discussed many other studies with similar problems. This one is one of the few in which a replication has been attempted.
One of the points made in the Slate article is important to reiterate:
Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset. A replication is cleaner: When an outside team is focusing on a particular comparison known ahead of time, there is less wiggle room, and results can be more clearly interpreted at face value.
This is a subtle point often missed by non-statisticians. There is a huge difference between a replication study for which researchers know in advance what is being analyzed, and a typical scientific study for which researchers may have measured an array of metrics, and then selectively reported ones that are "statistically significant."
Here is an analogy that may help understanding it:
You wanted to buy a woolen sweater during the after-Christmas sales at Macy’s Time Square. At the store, you discovered that woolen sweaters were hard to come by but cashmere scarves were the deal of the decade; in addition, Macy’s was running a buy-one, get-two free promotion on dress pants. So when you checked out, you purchased one cashmere scarf and three pairs of pants (none of which you had intended to buy). So we can ask whether the shopping trip was successful.
Imagine you have a tradition of going to Macy's every year to get a new sweater. If your metric of success is whether you purchased a sweater, your success rate would not be high. However, if your metric is whether you purchased something, your success rate would be much higher. A replication study has a fixed metric, fixed by the prior study: it is like measuring success based on whether you purchased a sweater.
It is much easier to prove that one of many things could happen than to prove that one specific thing would happen. The trouble is that in the reporting of scientific findings, one of many things is typically presented as one specific thing. This means replication is important: until we know that the one specific thing can be reliably replicated, we really don't have solid science.
Here is the link to our Slate article.
PS. This phenomenon is always hard to get across to students. I am not totally satisfied with this analogy. If you know of different ways to explain this, let me know.
PPS. This phenomenon is especially tricky with "big data" style studies. For example, a lot of people run "A/B tests" in which they simultaneously measure hundreds if not thousands of measures, and selectively report on the differences that are statistically significant.
The news is out that Uber got fined by the New York Attorney General's office for data breaches and privacy concerns. The headline writer for ZDNet nailed this one: "Uber fined peanuts in God View surveillance" (link). And the sub-lead has the kicker: "For a company with a valuation of over $50 billion, a $20,000 fine over user data protection is laughable."
This settlement tells us one of the following (or both) is not serious: the NYAG's attitude toward protecting consumer data privacy, or the valuation of the so-called unicorn.
But let's review the events that got Uber in trouble, and what it says about the state of ethics in data science.
As any Uber user knows, the service runs on an app, which means that the company has data tracking every Uber ride you took. Many users are fine with having such data collected and compiled by Uber, probably with the understanding that such data would not be used for purposes contrary to the consumer interest. Because the app runs on the smartphone, and payment is via credit cards, Uber certainly knows your identity, and also any additional information you provide on your profile. (Such data don't need to be merged with your travel logs but they almost surely will be merged by data scientists.)
At Uber offices, they are apparently very proud of a data visualization tool called God View. This is an aerial view of where all the Uber cars are, and where the Uber riders are.
There is a legitimate use of such a tool for managing the supply and demand of Uber cars, and for routing of cars.
Then the story gets muddier. Ethical issues often arise not because someone deliberately did something bad, but that something with a legitimate purpose is used, perhaps by other people, for more dubious purposes.
In November 2014, a Buzzfeed News reporter hired a Uber car to go meet with the GM of Uber NYC, Josh Mohrer. Upon arriving, Mohrer held out his iPhone, and remarked: "There you are, I was tracking you."
The underlying data processing technology is the same but two important distinctions must be made between God View and Mohrer's view.
First, for operations management, there is absolutely no need to use or even record any customer information--it doesn't matter whether it is John or Jane who is waiting for a Uber car at the corner of 14th and Madison; it only matters that John or Jane is one of 15 people all looking for rides within a two-block radius. But when Mohrer tracked the reporter, he was looking for a specific person, and this incident reveals that the travel log data are not anonymized.
Second, it is one thing for Uber to use the data internally for legitimate business reasons, such as managing supply and demand; it is a different thing to assess a customer's travel log, especially without first asking for that person's permission. The fact that a third party is able to do this without permission is quite disturbing. There are not many consumer-friendly use cases I can think of that requires looking up someone's past Uber rides.
Even before the Buzzfeed controversy, Peter Sims, a venture capitalist, already filed a similar complaint. Here's what he had to put up with. He was in a Uber car in New York City when an industry acquaintance, whom he "barely know," texted him from Chicago, and proceeded to trace his whereabouts during the trip.
It turned out that Uber was holding a gathering in Chicago in which someone was using Sims to demonstrate the power of Uber's data.
The informat remarkably showed no understanding of the privacy intrusion. She said the Uber Chicago event was "cool" and that Sims should be honored to have been selected for the demo.
Just think for a second, in his line of business, Sims could have been going to a meeting to strike an important business deal, which, for both financial and legal reasons, has to remain private until it is announced publicly.
One of the most concerning aspects of the massive data collection is the fact that once we allow data to be collected and stored in the "cloud" or corporate servers, we lose control of our own data. The idea that corporations are benevolent do not, and will not, stand the test of time.
And now I come to the juiciest part of this Uber story, what really got the NYAG involved. A Buzzfeed reporter was invited to a meeting between Uber SVP Emil Michael, and "an influential New York crowd."
During this meeting, Michael suggested that Uber would spend "a million dollars" to hire four top "opposition researchers" and four journalists in order to "help Uber fight against the press."
Specifically Michael was unhappy with Sarah Lacy, who was a website editor critical of Uber. Michael claimed that Uber's team "could prove a particular and very specific claim about her personal life."
People at the meeting suggested that such a move might present publicity problems, to which Michael said, "Nobody would know it was us."
After the Buzzfeed report, Michael and Uber's PR team now said that he made a mistake, and that the company "does not do oppo research of any sort on journalists" and "has never considered doing it".
Regardless of whether they have done it, it is clear that the data are available to dirt-diggers.
Uber also protested that the meeting was a "private dinner" and remarks were supposed to be "off the record." I am not sure how to interpret this. Are they saying it would be okay to carry out that plan so long as no one knows about it?
Look, these are not simple issues, and Uber is not an outlier. The answers to these questions go to the values of our society, and involve complicated trade-offs between conflicting goals. In particular, there are many things that corporations say they will not do which are well within their capability of doing. Unfortunately, in this case, the NYAG is setting a precedent that even when a company is found to have done something improper, it gets a light slap on the wrist.
In this week's Statbusters, my column with Andrew Gelman in the Daily Beast, we take note of Slate's recent rant about "wasteful" anti-smoking advertising, and demonstrate how to think about cost-benefit analysis. The key point is: if you are going to make an extreme claim, you better have some numbers to back it up.
These numbers can be approximate, and based on (potentially dubious) Googled data. Not every analysis needs to be super precise.
The column is here.
My co-columnist Andrew Gelman has been doing some fantastic work, digging behind that trendy news story that claims that middled-aged, non-Hispanic, white male Americans are dying at an abnormal rate. See, for example, this New York Times article that not only reports the statistical pattern but also in its headline, asserts that those additional deaths were due to suicide and substance abuse.
It all began with the chart shown on the right. It appears that something dramatic happened in the late 1990s when the USW (red) line started to diverge from those of all the other countries. The USW line started to creep upwards, meaning that the death rate is increasing for US white non-Hispanic males aged 45-54. (The bolded blue line is for US white Hispanic males aged 45-54 and does not look different from those of other countries.)
Prompted by a lively discussion in the comments section, Andrew pursued a deeper analysis of this data. This has led to a series of posts in which he refined the analysis (see here, here, here and here.) I recommend reading the entire series, as it paints a full picture of how statistical thinking works. In the rest of this post, I will present a cleansed summary of his argument while leaving out details.
We first note that the veracity of the data is not at issue. We accept as a starting point the trends shown above to be true; this can easily be verified using public data. The debate is around why.
People who analyze age-group data are particularly sensitive to bias due to discretization. The original analysis, co-authored by Angus Deaton the recent winner of the economics Nobel, focuses on the age group 45 to 54. If you compute the average age in this age group over time, you may be shocked this is not flat; the average age of people aged 45 to 54 has been increasing over time. As the following chart shows, since 1990 or so, the average age in this age group moved up by about half a year. (Data from CDC Wonder.)
Because older people die at a higher rate, the death rate within age group 45 to 54 will increase just because of the increasing average age of this age group--without having to resort to other reasons such as suicides.
It is noted also that the Baby Boom in the U.S. caused large fluctuations in the age distributions over time. This observation provides nice color on why the average age is increasing but is not required. Aging population is another cause.
What is crucial to the reasoning is the steepness of the increase in death rate with increasing age. Surprisingly, it is not easy to find a chart plotting death rates by age. Wikipedia has this graph shown on the right. This is not empirical data but the Gompertz-Makeham law (link), which is described as accurate for the 30-80-year-old range. The key insight is that mortality rate increases exponentially after age 30.
Having a theory is not enough. In his first post, Andrew tested this theory by pulling a few numbers and working out a back-of-the-envelope calculation. The goal is estimate the magnitude of this average-age effect. How much of the observed anomalous trend does this explain? Do we need any other reasons?
Andrew estimated that the average age in the 45-54 age group moved up 0.6 year between 1989 and 2013, the period covered by the original study. From life tables, he found that mortality worsens by about 8 percent per extra year lived. Thus, over the research period, increasing average age contributes 0.6*8 = 4.8 percent per year to the death rate.
This level of increase explains most of the trend shown by that red line in the original chart. Thus, Andrew concludes that the data, after adjusting for age, showed that mortality rate among middled-aged, non-Hispanic, white male Americans has been essentially flat.
The original findings that this group behaves differently from those in other countries, and from the US hispanic male population are still interesting.
A number of techniques can be used to control for the shift in the underlying age distribution. Disaggregation of the data is one method. CDC releases data at the single age level, and analyzing the data year by year is the next step that Andrew undertook.
One result of this finer analysis is that in the years 1999-2013 (i.e. after dropping the first 10 years of the first chart), even after adjusting for age, there is still about a 4 percent increase in mortality rate among the U.S. middle-aged white non-Hispanics, roughly half of the trend shown in the original chart. In other words, in the shortened time frame, age adjustment explains half of the trend, not all of it.
This has led one of the original authors, Deaton, who just won the Nobel in economics, to say "the overall increase in mortality is not due to failure to age adjust."
This statement is a bit too loose for my liking. First, "is not due to" implies that age aggregation has zero effect when it does explain half of the trend. Second, one should always age-adjust if the underlying age distribution is changing. Even if the age adjustment did not explain anything at all, I'd argue one should still age-adjust. Doing so would help eliminate age aggregation as a potential reason for the observed trend.
One argument against age adjustment is that it involves a lot of work - finding the right data, processing the data, merging the data, etc. But unless one does this work, one can't know how strong the aggregation effect is. And if you have done the homework, why not show it?
Disaggregating all the data is annoying because now you have one chart per single age. The next method for age adjustment is "standardization". This requires creating a reference age distribution, which is then applied to all years. In effect, we are artificially holding the age distribution constant so age could no longer explain any effect.
This is what Andrew's age standardized rates look like:
For the age-adjusted line (in black), what he did is to "weight each year of age equally". This shows that the effect of increasing ages within this age group is growing over time.
Then, something really interesting happens when Andrew split the black line by gender:
So it turns out that middle-aged U.S. white non-Hispanic men are not where the story is. The age-adjusted mortality rate for the corresponding women has steadily climbed between 1999 and 2013!
Next, Andrew looked at the other age groups and found that an even more pronounced trend affecting U.S. non-Hispanic whites in the 35-44 age group.
He also looked at Hispanic whites, and African Americans, which I don't repeat here. Even after age adjustment, those groups show trends that are more in line with the rest of the world.
Finally, for those wondering how this is relevant to say the business world, let me connect the dots for you.
Imagine that you run a startup that sells an annual subscription. One of your key metrics is the churn rate, defined as the number of subscribers who quit during period t divided by the number of paying subscribers at the start of period t. So a monthly churn rate of 5% means that five percent of the paying subscribers quit the service during that month.
There are two reasons to age-adjust this churn rate. First, the shape of the churn rate curve is not smooth. In particular, almost no one churns during the first 12 months. Second, the startup is growing very rapidly. This means that a lot of new customers are being acquired, and each new customer has up to 12 months in which the churn rate is close to zero.
What happens is the churn rate will fluctuate based on the monthly growth rates of this subscription service. As the growth rate fluctuates, the average tenure of the user base fluctuates. The more new customers in their first year, the lower the churn rate.
If the churn rate is not age-adjusted, you don't know if customers are increasingly more dissatisfied with your service, or if you just have slower growth which leads to increasing average tenure!
It appears that Google Flu Trends (GFT) has slipped quietly into the night. In a short post to the Google Research blog, the team behind GFT announced that they are "no longer publishing" flu estimates, effectively ending the seven-year-old experiment. The GFT home page now links to some historical datasets.
The post was dated August 15, and it appears that mainstream media completely missed it.
GFT was one of the canonical examples of "Big Data" at work. You have unimaginably massive amounts of search data that are accumulating in real time. The GFT project was a bold attempt to turn this data into a public good. However, the experiment encountered a lot of technical problems. As documented here, GFT suffered from crippling inaccuracies which necessitated several overhauls in its brief life.
Despite the teething problems, I had hoped that the GFT team would continue to work on this problem. Google is one of the few companies that can afford the resources required to push ahead with this project. It appears that the engineers working on it found the challenges too hard or not interesting enough to overcome.
This development does not bode well for Big Data projects. The OCCAM nature of the data presents a host of challenges, and until we find ways to cope with them, we will continue to have false starts.
Really enjoying Propublica pieces lately. There are several articles about topics of great interest to me, and those who read my books will be familiar with these themes.
My favorite is an article that speaks a truth about data projects -- much as we sweat about data collection, data integrity and statistical models, the true challenge is in persuading the rest of the world to adopt our endproducts. The title of this piece says it all "The FBI built a database that can catch rapists--and almost nobody uses it. " (link).
The data project in question is an early effort to link data from multiple sources to leverage correlations to solve the problem of identifying serial offenders. However, less than 10% of local police departments contribute data to the system, rendering it toothless. In my experience, it is common to find data projects stuck in first gear, and failing to make any real-world impact.
Kudos to the authors for asking the dirty question of the return on investment of such a system. It is believed that in 12 years, the system may have helped solve 33 crimes. It costs $800,000 per year to maintain (most likely, contractor expenses). You do the math!
For managers, the key is to diagnose properly the reasons for inaction. Lack of adoption is frequently blamed on technology but the reality is much more complicated.
A must-read article.
David Epstein reports on raids on steriods labs. (link) Law enforcement is the most effective way to catch cheaters in sports. In Numbers Rule Your World, I explained why anti-doping tests are ineffective, in the sense that false negatives are rampant, letting lots of dopers off the hook. This conclusion comes from a simple statistical calculation. In the chapter on a lottery cheat, I described how statistics can be used to prove that "someone has beyond reasonable doubt cheated" but physical evidence is required to nail the perpetrator.
Epstein then expanded the conversation: "World-class athletes are merely the fine layer of frost atop the iceberg’s tip when it comes to the steroid economy." The headline of the piece is "Everyone's Juicing".
I find it interesting that Epstein said "In years of reporting on performance-enhancing drugs, I’ve frequently been asked why athletes in smaller sports or facing lower stakes would dope, given that there’s little money in it for them." This feels odd because when I was researching my book five or six years ago, I heard the opposite claim, that elite athletes couldn't possibly be doping because they don't need steroids. (Think Barry Bonds back in the days.) This tells me that (a) public opinion has shifted due to the Armstrong revelations and (b) the human mind will rationalize any story even if the story flips.
Epstein has another article in August about false negatives, which should be familiar territory for my readers (link).
Joaquin Sapien reports on the case of one Ruddy Quezada, who was released after spending 24 years in prison for murder. This case reminds me of the Innocence Project, whose amazing work I featured in Numbers Rule Your World. In the current scenario, though, we don't know if Quezada was innocent, only that the prosecution lied about how they coerced the witness to testify. The witness testimony was the only piece of evidence in that case, which means that the prosection is left with no avenue to re-try the case.
The case I used in my book concerns false confessions so both cases deal with coerced evidence.
Gabe Murray wrote to Andrew Gelman, asking for comments about the accusations hurled at the current Tour de France front-runner Chris Froome. He said:
Andrew Gelman has a great post about the concept of statistical significance, starting with a published definition by the Department of Health that is technically wrong on many levels. (link)
Statistical significance is one of the most important concepts in statistics. In recent years, there is a vocal group who claims this idea is misguided and/or useless. But what they are angry about is the use (and frequently, mis-use) of p-values, which is one way to measure statistical significance. In my view, this concept is never as important as it is today, in the world of Big Data.
Statistical significance codifies a core principle--that the observed dataset is not sufficient to answer your question (no matter how big the dataset is). It says that your observed sample is only one of many possibilities. If you could repeat your data collection, your dataset would look different. It might look just a slight bit different, or it might look vastly different. This "chance" that people keep talking about is merely the variations you get between one sample and another. By its very nature, this is a thought experiment -- you only have one observed sample.
But that is the point. Sound statistical reasoning requires you to think beyond your one observed sample.