This piece is part of the StatBusters column written jointly with Andrew Gelman. Hope they fix the labeling soon. In it, we talk about two recent studies on data privacy, which leads to contradictory conclusions. How should the media report such surveys? Is the brand name of the organization enough? In addition, we debunk the notion that consumers will definitely get something valuable out of sharing their data.
In the last installment, I embarked on a project--perhaps only a task--to assemble a membership list for an organization. It sounded simple: how hard could it be to merge two lists of people? Of course, I couldn’t just stitch one list on top of the other as there are members who subscribed to the newsletter as well as joined the Facebook group. These duplicate rows must be merged so that each individual is one row of data.
With barely a sweat, I blew past my initial budget of two hours.
After a half day, I produced a merged list by matching Facebook usernames to email usernames. It felt like running an obstacle course, with one annoying issue popping after another was resolved. Stray punctuation, ambiguous names, case sensitivity, and so on. Most of these problems lacked clear-cut solutions. Some periods (full stops) were redundant but not all; some middle names were part of the last name but not all. Tick, tick, tick, tick. These data issues demanded consideration, and considerable time.
At the start of Day 2, I executed a planned U-turn. Starting with the two lists of people, I attempted to match first and last names. I tried usernames as the key first because only a small portion of the Email list included names. However, a match of first and last names is a more confident result than a match of usernames.
Immediately, I stepped into text-matching quicksand. I must process the Facebook names (previously scraped) the same way I fixed up the names in the Email list.
As before, I tried a “full outer join.” Disaster. The output data had a crazy number of rows. I sensed missing values. Sure enough, there were some Facebook members for whom I did not have names (for example, they provided names in Chinese or Korean characters.) Each of these members with missing names matched, erroneously, the whole set of email subscribers who also did not provide names.
One way out of this mess was to extract only people with non-missing names from either list, and then merge those subsets. This path was not easy though. I had created four types of members: those with matching names on both lists; those having a Facebook name which didn’t match to anyone with an email name; those having an email name which didn’t match to anyone with a Facebook name; those who provided no usable names in either list.
The challenge was to combine those four groups of members in such a way that each unique member is just one row of data. For each such member, I also wanted to gather all other information from both Facebook and email lists. This required defining a number of dummy columns and also various columns sourcing the data.
I experienced a soothing satisfaction when the output data appeared as expected.
But the job was not yet finished. I ended up with two merged lists, one based on username matching, and the other, name matching. It was time to merge the merged. I spare you the details, most of which resembled the above.
Knowing my client’s name was on the list, I looked him up. There he was, again and again, occupying four or five rows. This might make your heart sink since I had tried so hard to maintain one row per member. But don’t worry. I was simplifying things a little bit. If someone provided multiple email addresses, as my client did, I had decided to keep all of them.
At long last, the master list of members was born. This exercise bore instant rewards. It is very useful to know which members are on both lists and which members are on just one. We have a rough measure of how involved a member is. The hard work lies ahead since our goal is to gain a much deeper understanding of the members.
An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.
The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.
Sounds simple enough.
But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.
What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).
Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.
The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)
Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.
It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.
Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.
I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.
I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.
The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.
But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.
What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.
If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.
A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.
In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?
And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.
I chose path B. It was ugly and unsexy but more of a sure thing.
I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.
By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.
At college reunions in beautiful Princeton on a glorious sunny day.
I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:
In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.
I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.
Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.
The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.
I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.
The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.
Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.
In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.
I only read nutrition studies in the service of this blog but otherwise, I don't trust them or care. Nevertheless, the health beat of most media outlets is obsessed with printing the latest research on coffee or eggs or fats or alcohol or what have you.
Now, the estimable John Ioannidis has published an editorial in BMJ titled "Implausible Results in Human Nutrition Research". John previously told us about the crisis of false positives in medical research.
Oops, here are some statistics on nuitrition "science":
In 52 attempts at using randomized experiments to validate findings from observational studies, the number of times the findings were replicated: 0
In the NHANES questionnaire (the basis of all those findings), two-thirds of the participants provided answers that imply an energy intake that is "incompatible with life". I haven't read this paper; seems like worthwhile reading.
There are at least 34,000 papers on PubMed with keywords "coffee OR caffeine" which means this one nutrient has been associated with almost any interesting outcome.
Almost every single nutrient imaginable has peer reviewed publications associating it with almost any outcome. A statistician should never give the advice "If at first you don't succeed,..."
Many findings are entirely implausible (and still get published in top journals)... for example, the idea that a couple of servings a day of a single nutrient will halve the burden of cancer is clearly "too good to be true," even more so for anyone who is familiar with this literature
"Big datasets just confer spurious precision status to noise"
Randomized experiments offer hope but are woefully undersized (like requiring 10 times the current sample).
Just to nail home the point, John concludes: "Definitive solutions will not come from another million observational papers or a few small randomized trials."
I mentioned the Harvard Business Reviewarticle on business use of customer data in the "Big Data" era. In the previous post, I looked at the nature of the evidence used by the authors. In this post, ignoring my discomfort with some of the evidence, I examine the conclusions of the article.
The report has a three-part structure: the first section describes the issues; the second section communicates results from a few surveys conducted by frog - a global strategy and design agency - on various issues related to data privacy; and the third section presents examples of their recommendations for clients, which they offer generally to businesses involved in collecting and monetizing customer data.
The survey results are revealing (although the sample size of 900 in five countries is tiny so I'm not sure you should believe them). The agency found that 97% of the people surveyed are concerned about businesses and governments mis-using their data. Seventy-two percent of Americans are reluctant to share information with businesses because they "just want to maintain their privacy".
The authors also learned that consumers have grossly under-estimated the extent of data collection. Only 25% of the respondents said they knew businesses tracked their location, and only 14% said they knew businesses shared their web-surfing history. Finally, their analysts attributed dollar value to the privacy of different types of data.
I follow them up to this point. In fact, the authors summed it up very nicely at the beginning of the article: most [companies] prefer to keep consumers in the dark, choose control over sharing, and ask for forgiveness rather than permission.
Unfortunately, I am let down by the list of recommendations that follow. They feel to me like tweaks on failed ideas, rather than paradigm shifts.
The first recommendation is "educate the consumers". The authors gave an example of one of their own consulting clients who required "customers" to watch a video and give preliminary consent before sharing their own (genomic) data. And the personal data is withheld until the "customer" returns a hard-copy agreement.
We don't need to be reminded that every day, we "voluntarily" sign Terms and Conditions which no ordinary person actually reads. Frequently, we are told not to use a website if we don't agree with any part of a lengthy agreement written in one-sided language favoring the business.
The "new" solution doesn't change the status quo. In fact, it gives businesses a stronger case for arguing that their users have voluntarily given up the right to their own data. In my view, until businesses confront the issue of properly disclosing how they collect data, what information is being collected, and how such data are being sold or traded, consumers will continue to find such practices creepy.
The second recommendation looks good on paper but is impractical. Another one of frog's client is featured here. This client allows customers to specify which pieces of data can go to whom.
Assume there are 100 variables (only!) being collected and five levels of access control. That amounts to 500 yes/no questions each user is required to answer in order to gain full control of the data. In practice, most users will decide not to bother because it is too complex and time-consuming. The solution is a form of suffocation by paperwork.
For the data analysts, such a solution creates headaches. It generates self-selected data of the worst kind. Each variable has its own source of bias as different subsets of users decide to withhold their data for their own reasons.
To implement such a system properly requires a herculean effort. Say I reviewed the list of 100 variables and divided them into five groups of 20 variables using the five levels of control (from allowing anyone to see my gender to hiding my age from everyone). Two months later, I changed my mind. I removed access to 80 of the 100 variables from everyone. Now, the database administrator should find all instances of those 60 variables and delete them. Some of the data may already have been sold to other entities, and what if those other entities re-sell my data after I asked for the data to be deleted by the original source?
The last recommendation is an argument that businesses should not need to pay users for their data. Given the finding in the second section that users assign meaningful dollar values to their data, this seems to be a solution for businesses rather than for consumers.
Pandora's free advertising-supported service is used as an example of customers' willingness to exchange their privacy for "in-kind value". The article failed to mention just how much money Pandora has been paying for such data! As this other HBR article tells us, Pandora is "13 years, 175 million users, little profit". It has never been able to establish a profitable business model because while 80% of its revenues come from advertising to those "free" accounts, 60% of its revenues immediately goes out the door as royalty payments for the "free" music! It's not surprising that many consumers are willingly engaging in this lop-sided exchange with Pandora.
I often wonder if consumers realize that over-sharing their data works to their disadvantage, would they become more interested in how businesses use their data?
For instance, insurance companies will be very interested in acquiring data from personal analytics devices, like Fitbit. They will use the data to predict whether you have health risks, and they will charge you more for insurance. Everyone is at risk for something.
The Uber app gives its users the ability to track their drivers -- in Manhattan, it's like watching a horse-race when your driver tries to negotiate the city gridlock. The same data is used by Uber to get an accurate picture of supply and demand, which drives their surge-pricing algorithms. That's how you end up paying five to ten times the normal cab rate.
Businesses use personal data to reduce information asymmetry, which in the past prevented them from extracting maximum value from consumers.
Today, the data privacy question is phrased as "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Are you willing to share such data with Company X?"
Imagine you are asked a different question: "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Being notified of heart-rate irregularity may help you but 80% of the warnings will be false alarms. Also, your heart rate data will be used by our insurance arm to adjust your insurance premiums. There is a 50% chance that your premium will increase after sharing your data. Are you willing to share such data with Company X?"
For those who have found it tough to keep up with Andrew Gelman's prolificacy, here are some brief summaries of several recent posts:
On people obsessed with proving the statistical significance of tiny effects: "they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down." (link)
[I left a comment. In Big Data, we have thousands, no millions, of kangaroos jumping out of sync, but still one feather.]
On people testing a zillion things hoping to land on the one that "works": "I suggest you should fit a hierarchical model including all comparisons and then there will be no need for such a corrections." (link)
[This is something Andrew has been advocating for a while. The idea is that such models have in some sense a built-in correction for the multiple comparisons problem. Unfortunately, some researchers are wrongly interpreting Gelman. I recently read a report that cites Gelman's paper as evidence that "multiple comparisons" is not a real problem, and then proceed to fit dozens of regressions without any mechanism to control for multiple comparisons!]
On when to throw out all your data, the lot of it: "Sure, he could do all this without ever seeing data at all—indeed, the data are, in reality, so noisy as to have have no bearing on his theorizing—but the theories could still be valuable." (link)
In my latest piece for Harvard Business Review (link), I tackle this common problem in the interactions between data scientists and business managers:
A typical big data analysis goes like this: First, a data scientist finds some obscure data accumulating in a server. Next, he or she spends days or weeks slicing and dicing the numbers, eventually stumbling upon some unusual insights. Then, a meeting is organized to present the findings to business managers, after which, the scientist feels disgruntled or even disrespected while the managers wish they could take the time back.
Using analyses of the popular baby names dataset as an example, I contrast the kind of analysis that generates click bait (e.g. the most "poisoned" names, the most "trendy" names) with the kind of analysis that generates potentially real business value.
Are science journalists required to take one good statistics course? That is the question in my head when I read this Science Times article, titled "One Cup of Coffee Could Offset Three Drinks a Day" (link).
We are used to seeing rather tenuous conclusions such as "Four Cups of Coffee Reduces Your Risk of X". This headline takes it up another notch. A result is claimed about the substitution effect of two beverages. Such a result is highly unlikely to be obtained in the kind of observational studies used in nutrition research. And indeed, a glance at the source materials published by the World Cancer Research Fund (WCRF) confirms that they made no such claim.
The headline effect is pure imagination by the reporter, and a horrible misinterpretation of the report's conclusions. Here is a key table from the report:
The conclusion on alcoholic drinks and on coffee comes from different underlying studies. Even if they had come from the same study, you cannot take different regression effects and stack them up. The effect of coffee is estimated for someone who is average on all other variables. The effect of alcohol is estimated for someone who is average on all other variables. The average person in the former case is not identical to the average person in the latter case. So if you add (or multiply, depending on your scale) the effects, the total effect is not well-defined.
In addition, you can only add (or multiply) effects if you first demonstrate that the two factors do not interact. If there is interaction, the effect of alcohol is different for people who drink less coffee relative to those who drink more. The alcohol effect stated in the table above, as I already pointed out, is for an average coffee drinker. Conversely, the protective effect of coffee may well vary with alcohol consumption.
The reporter also misrepresented the nature of the analysis. We are told: "In the study of 8 million people, cancer risk increased when they consumed three drinks per day. However, the study also found that people who also drank coffee, offset some of the negative effects of alcohol."
The reporter made it sound like a gigantic randomized controlled study was conducted. This is a horrible misjudgment. WCRF did not do any study at all, and certainly no researcher asked anyone to drink specific amounts of alcohol or coffee. The worst is the comment on people who drank coffee as well as alcohol. I can't find a statement in the WCRF report about such people. It's simply made up based on the false logic described above.
At one level, the journalist misquoted a scientific report. At another level, the WCRF report is rather disappointing.
The authors of the executive summary repeatedly use the language of causation. For example, "There is strong evidence that being overweight or obese is a cause of liver cancer." Really? Show me which study shows obesity "causes" liver cancer?
Take one of their most "convincing" findings: "Aflatoxins: Higher exposure to aflatoxins and consumption of aflatoxin-contaminated foods are convincing causes of liver cancer." The causation is purely an assumption of the panel who reviewed prior studies. In Section 7.1, readers learn that this cause-effect conclusion comes from "four nested case-control studies and cohort studies" for which "meta-analyses were not possible". So not a single randomized trial and no estimation of the pooled effect.
What is nicely done in the report is the inclusion of "mechanisms" which are speculative explanations for the claimed causal effects. It's great to have thought carefully about the biological mechanisms. Nevertheless, these sections are basically "story time" unless researchers succeed in establishing those unproven links.
The American Association for Public Opinion Research (AAPOR) has put out its Big Data report last month (link). This one is worth reading. It has some of the most current citations, and readers of this blog will be very receptive to its core messages. The team who wrote the report is a mix of academics and practitioners.
In Big Data, there are many self-evident truths, according to the people who talk about Big Data. One of these is the idea that Big Data will make surveys obsolete. How could it not since Big Data means you have hundreds if not thousands of times more "respondents", the ability to track trends in "real time", and the ability to evolve your survey questions? This AAPOR report, I suppose, is a response to such claims.
Then there are those who say surveys are merely a suboptimal stopgap in the old days of "small data". They explain surveys measure "stated preferences" while Big Data (i.e. found data, observational data) measure "revealed preferences", and aiming for the latter is self-evidently better. Revealed preferences are closer to "the truth". Surveys merely represent an approximation, and with Big Data, we no longer need to run surveys.
This topic nicely ties in with Chapter 1 of Numbers Rule Your World (link). In that chapter, I explain the success of Disney in keeping customers happy despite having to wait two hours for rides that last two minutes. The "imagineers" realized that managing perception is even more important than optimizing reality. Customer happiness improves even though measured waiting times (i.e. revealed information) worsen or stay the same.
If Disney relies solely on revealed metrics, the data would say customers are waiting longer, or just as long, as before. When Disney conducts surveys and ask people how they feel, they say they are waiting less and so are happier. Feelings are crucial data that are not revealed by any observations. To the extent that feelings are "revealed", it requires an assumption on the part of the observer; since the measured waiting time is reduced, the customers must be feeling happier.
On a website, the web log has measured data that reveal the paths of users through the website. One can observe where traffic drops off. But that is not enough. In order to reduce attrition, the designer needs to understand why users exit. Surveys provide the answer here.
What if you learn that users exit by clicking on the home page icon? So you test a version of the design in which the home page icon is not clickable. You observe that the exit rate has significantly fallen off in the new version. Problem is the users now find a different way to exit. Relying only on revealed preferences frequently leads to superficial actions that cure symptoms but not root causes.
Revealed preferences and stated preferences are two different dimensions and they both have strengths and weaknesses. Logs are bigger and faster but researchers have no control over the composition of the responders. Neither is a substitute for the other. I am interested in seeing work on integrating the two approaches. The AAPOR report has a good discussion of this subject plus a few references to new work on integration.