This piece is part of the StatBusters column written jointly with Andrew Gelman. Hope they fix the labeling soon. In it, we talk about two recent studies on data privacy, which leads to contradictory conclusions. How should the media report such surveys? Is the brand name of the organization enough? In addition, we debunk the notion that consumers will definitely get something valuable out of sharing their data.
In the last installment, I embarked on a project--perhaps only a task--to assemble a membership list for an organization. It sounded simple: how hard could it be to merge two lists of people? Of course, I couldn’t just stitch one list on top of the other as there are members who subscribed to the newsletter as well as joined the Facebook group. These duplicate rows must be merged so that each individual is one row of data.
With barely a sweat, I blew past my initial budget of two hours.
After a half day, I produced a merged list by matching Facebook usernames to email usernames. It felt like running an obstacle course, with one annoying issue popping after another was resolved. Stray punctuation, ambiguous names, case sensitivity, and so on. Most of these problems lacked clear-cut solutions. Some periods (full stops) were redundant but not all; some middle names were part of the last name but not all. Tick, tick, tick, tick. These data issues demanded consideration, and considerable time.
At the start of Day 2, I executed a planned U-turn. Starting with the two lists of people, I attempted to match first and last names. I tried usernames as the key first because only a small portion of the Email list included names. However, a match of first and last names is a more confident result than a match of usernames.
Immediately, I stepped into text-matching quicksand. I must process the Facebook names (previously scraped) the same way I fixed up the names in the Email list.
As before, I tried a “full outer join.” Disaster. The output data had a crazy number of rows. I sensed missing values. Sure enough, there were some Facebook members for whom I did not have names (for example, they provided names in Chinese or Korean characters.) Each of these members with missing names matched, erroneously, the whole set of email subscribers who also did not provide names.
One way out of this mess was to extract only people with non-missing names from either list, and then merge those subsets. This path was not easy though. I had created four types of members: those with matching names on both lists; those having a Facebook name which didn’t match to anyone with an email name; those having an email name which didn’t match to anyone with a Facebook name; those who provided no usable names in either list.
The challenge was to combine those four groups of members in such a way that each unique member is just one row of data. For each such member, I also wanted to gather all other information from both Facebook and email lists. This required defining a number of dummy columns and also various columns sourcing the data.
I experienced a soothing satisfaction when the output data appeared as expected.
But the job was not yet finished. I ended up with two merged lists, one based on username matching, and the other, name matching. It was time to merge the merged. I spare you the details, most of which resembled the above.
Knowing my client’s name was on the list, I looked him up. There he was, again and again, occupying four or five rows. This might make your heart sink since I had tried so hard to maintain one row per member. But don’t worry. I was simplifying things a little bit. If someone provided multiple email addresses, as my client did, I had decided to keep all of them.
At long last, the master list of members was born. This exercise bore instant rewards. It is very useful to know which members are on both lists and which members are on just one. We have a rough measure of how involved a member is. The hard work lies ahead since our goal is to gain a much deeper understanding of the members.
An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.
The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.
Sounds simple enough.
But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.
What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).
Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.
The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)
Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.
It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.
Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.
I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.
I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.
The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.
But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.
What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.
If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.
A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.
In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?
And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.
I chose path B. It was ugly and unsexy but more of a sure thing.
I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.
By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.
At college reunions in beautiful Princeton on a glorious sunny day.
I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:
In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.
I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.
Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.
The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.
I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.
The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.
Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.
In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.
Last time we heard about Deflategate on this blog, Warren Sharp compiled some statistics on fumble rates, showing that the Patriots were unusually good at avoiding fumbles. (link, link) I thought the level of analysis was "above average" and remarked that statistical evidence of this type can only get you so far. The metric is indirect, and it does not speak to causation.
The official investigators have now issued their report. New York Times has its coverage here. As one reader commented, this article, currently nearing 800 comments, has more comments than most articles with more serious subject matter. The NYT article is one of the better ones out there on this subject.
Two set of new evidence has emerged.
The first, which is getting most of the headlines and attention, are text messages involving two Patriots employees who discussed their deflating operation. These text messages are highly incriminating for the two involved and for me, also incriminating for Tom Brady, the team's superstar quarterback (who refused to release his own text messages or other correspondence to the investigators). The text messaging evidence shores up the causal evidence in a way that numbers by themselves could never accomplish.
The takeaway from the text evidence is the power of "metadata". Metadata is data about the text messages (sender, recipient, date and time of sending, length, etc.), as distinct from the content of the texts. Metadata went mainstream when the U.S. government was revealed to have been massively scooping up metadata on domestic phone calls, but denied collecting contents of said phone calls (See thesecoverage, for example). The investigators can use metadata to learn who else is in the circle of insiders, how often they communciate, when they communicate, etc. Notice that these pertinent questions do not require knowing the contents of the text themselves. (This is not to say knowing the contents of at least some of the text messages is important--at the minimum, to zoom in on the relevant texts.)
But these investigators could not determine when the deflator operation started, how often it occurred or the full scope of the operation. This is likely to do with selective disclosure of the text messages by selected parties (e.g. none from Brady).
Another takeaway is the inherent bias in surveillance data. Simply put, you only know what you can measure, and there is much that are not being measured. To get the "full scope", the investigators would need phone records, emails, and even wiretap evidence following the key players around (just kdding).
The second set of evidence is also extremely important to the story but it has received far less attention. One reason I like the NYT coverage is that the reporter gets to this evidence before talking about the text messages. For the first time, I see direct evidence of football tampering. The NFL rule requires footballs to be inflated to between 12.5 and 13.5 pounds per square inch. According to the NYT report, after the Colts raised suspicion at half-time of the Patriots-Colts matchup, all of the footballs were found to be underinflated (below 12.5 pounds), with the minimum vaule of 10.5.
This is the first time I see a clear admission that all of the footballs were underinflated. This is much more convincing evidence that someone tampered with the footballs than any of the fumble analysis.
Further, the referee had already weighed the balls before the game, and at the time, found all of the Colts-supplied footballs to be about 13 pounds, and only two of the Patriots-supplied footballs to be under-inflated.
Once tampering is established, the investigators can move on to finding the cause. Here, they are helped by videotapes from surveillance cameras, and also the texts.
One nitpick about the sentence: 'The report uses the nebulous phrase “more probable than not” several times in making its conclusions.' To a statistician, this is a very precise statement, not nebulous at all! I interpret the investigators to mean there is more than 50% chance. That is the standard of "preponderance of evidence."
FiveThirtyEight has a lengthy discussion of the report. They helpfully showed a screenshot of the measured ball weights:
Harvard Business Review devotes a long article to customer data privacy in the May issue (link). The article raises important issues, such as the low degree of knowledge about what data are being collected and traded, the value people place on their data privacy, and so on. In a separate post, I will discuss why I don't think the recommendations issued by the authors will resolve the issues they raised. In this post, I focus my comments on an instance of "story time", some questions about the underlying survey, and thoughts about the endowment effect.
Much of the power of this article come from its reliance on survey data. The main survey used here is one conducted in 2014 by frog, the "global product strategy and design agency" that employs the authors. They "surveyed 900 people in five countries -- the United States, the United Kingdom, Germany, China, and India -- whose demographic mix represented the general online population". (At other points in the article, the authors reference different surveys although no other survey was explicitly described other than this one.)
Story time is the moment in a report on data analysis when the author deftly moves from reporting a finding of data to the telling of stories based on assumptions that do not come from the data. Some degree of story-telling is required in any data analysis so readers must be alert to when "story time" begins. Conclusions based on data carry different weight from stories based on assumptions. In the HBR article, story time is called below the large graphic titled "Putting a Price on Data".
The graphic presented the authors' computation of how much people in the five nations value their privacy. They remarked that the valuations have very high variance. Then they said:
We don't believe this spectrum represents a "maturity model," in which attitudes in a country predictably shift in a given direction over time (say, from less privacy conscious to more). Rather, our findings reflect fundamental dissimilarities among cultures. The cultures of India and China, for example, are considered more hierarchical and collectivist, while Germany, the United States and the United Kingdom are more individualistic, which may account for their citizens' stronger feelings about personal information.
Their theory that there are cultural causes for differential valuation may or may not be right. The maturity model may or may not be right. Their survey data do not suggest that there is a cultural basis for the observed gap. This is classic "story time."
I wonder if the HBR editors reviewed the full survey results. As a statistician, I think the authors did not disclose enough details about how their survey was conducted. There are lots of known unknowns: we don't know the margins of error on anything, we don't know the statistical significance on anything, we don't know whether the survey was online or not, we don't know how most of the questions were phrased, and we don't know how respondents were selected.
What we do know about the survey raises questions. Nine hundred respondents spread out over five countries is a tiny poll. Gallup surveys 1,000 people in the U.S. alone. If the 900 were spread evenly across the five countries, their survey has fewer than 200 respondents per country. A rough calculation gives a margin of error of at least plus/minus 7 percent. If the sample is proportional to population size, then the margin of error for a smaller country like the U.K. will be even wider.
The authors also claim that their sample is representative of the "demographic mix" of the "general online population." This is hard to believe since they have no one from South America, Africa, Middle East, Australia, etc.
The graphic referenced above, "Putting a Price on Data," supposedly gives a dollar amount for the value of different types of data. Here is the top of the chart to give you an idea.
The article said "To see how much consumers valued their data, we did conjoint analysis to determine what amount survey participants would be willing to pay to protect different types of information." Maybe my readers can help me understand how conjoint analysis is utilized for this problem.
A typical usage of conjoint is for pricing new products. The product is decomposed into attributes so for example, the Apple Watch may be thought of as a bundle of fashion, thickness, accuracy of reported time, etc. Different watch prototypes are created based on bundling different amounts of those attributes. Then people are asked how much they are willing to pay for different prototypes. The goal is to put a value on the composite product, not the individual attributes.
Also interesting is the possibility of an "endowment effect" in the analysis of the value of privacy. We'd really need to know the exact questions that the survey respondents were asked to be sure. It seems like people were asked how much they would pay to protect their data, i.e. to acquire privacy. In this setting, you don't have privacy and you have to buy it. A different way of assessing the same issue is to ask how much money would you accept to sell your data. That is, you own your privacy to start with. The behavioral psychologist Dan Kahneman and his associates pioneered research that shows the value obtained by those two methods are frequently wide apart!
In a classic paper (1990), Kahneman et. al. told one group of people that they have been gifted a mug, and asked them how much money they would accept in exchange for it (the median was about $7.) Another group of people were asked how much they were willing to pay to acquire a mug; the median was below $3.
Is this the reason why businesses keep telling the press we don't have privacy and we have to buy it? As opposed to we have privacy and we can sell it at the right price?
Despite my reservations, the HBR piece is well worth your time. It raises many issues about data collection that you should be paying attention to. Read the whole article here.
This is a supplement to the previous post about a new research paper on the effect of Alcoholics Anonymous, and an NY Times exposition that I commented on. A misreading of that article led me to complain about per-protocol analysis, which wasn't the methodology behind the Humphrey et. al. research. I will explain their methodology in this post (known as instrumental variables analysis).
In the last post, I showed this hypothetical situation, involving patients who "cross over" (disobey treatment assignment) in a randomized experiment.
In the paper, actual treatment is measured by the change in frequency of attending AA meetings (relative to baseline).
Because initial treatment assignment (rows) is random, one expects that equal proportions of people would have moved out of state, got married, got divorced, etc. Similarly, one expectas equal proportions of people would have increased AA attendance. But in the table above, 90% of people in the treatment arm upped attendance while only 60% of those assigned to no treatment increased attendance. (The researchers use a continuous scale of frequency rather than proportion but the concept is the same.)
Of course, the random assignment to treatment itself is a cause of higher relative attendance. People are told to go to AA meetings. But there are other reasons for increased attendance, such as self-motivation leading those in the no-treatment arm to cross over.
In ITT analysis, you ignore the actual attendance, and analyze how treatment assignment affects the amount of drinking.
Alternatively, one can run a regression of frequency of AA meetings on amount of drinking (relative to baseline). This will yield a result such as "the more meetings someone attends, the less they drink". The problem with this analysis is that while the initial assignment is random, the actual attendance is tainted by selection bias.
Instead of using the actual frequency of AA meetings as a regressor, the instrumental variables (IV) analysis uses a predicted frequency of AA meetings. The prediction is itself a regression of treatment assignment and demographic variables on the actual frequency of AA meetings. In other words, we only care about the proportion of the variability in AA attendance that can be explained by the random assignment (controlling for the demographic variables). The remaining variability (due to self-motivation, etc.) are left on the table.
This is the "correction" that Frakt inferred in the New York Times article. I think Frakt is correct that the conclusion can be applied only to those who obey the protocol but I don't think the researchers drop all non-compliers from the dataset.
Also, Humphrey, et. al. seem to be at odds with the author of The Atlantic article, as they say "The long-established positive association between AA involvement and better outcomes was therefore consistent with, but did not prove, causation."
[After communicating with Frakt, Humphrey and Dean Eckles, I realize that I was confused about Frakt's description of the Humphrey paper, which does not perform PP analysis. So when reading this post, consider it a discussion of ITT versus PP analysis. I will post about Humphrey's methodology separately.]
The New York Times plugged a study of the effectiveness of Alcoholics Anonymous (AA) (link). The author (Austin Frakt) used this occasion to advocate "per-protocol" (PP) analysis over "intent-to-treat" (ITT) analysis. He does a good job explaining the potential downside of ITT, but got into a mess explaining PP and never properly addressed the downside of PP. It's an opportunity missed because I fear the article confuses readers even more on an important topic.
The key issue at play is non-compliance in a randomized experiment. If some patients are assigned to AA treatment and others are assigned to some other treatment, typically some subset of patients will "cross-over," (or drop out altogether), and usually such cross-over is associated with the outcome being measured--for example, a patient assigned to AA treatment felt that AA was not working and aberrantly switched to the other treatment; or vice versa.
ITT and PP differ in how they deal with the subset of non-compliers. In ITT, you analyze everyone in the experiment based on their initial assignment, ignoring non-compliance. In PP, you drop all non-compliers from the study, and analyze the subset of compliers only. (Each analysis is "extreme" in its own way.)
Between these two, I usually preferred ITT. The PP analysis answers the question: "If everyone complied with the treatment, what would be its effect?" I don't find the assumption of zero non-compliance realistic. ITT answers a different question: "Of those who take are given the treatment, what would be the expected effect?" This effect is an average of those who complied and those who did not comply, weighted by the proportion of compliers.
Frakt lost me when he said:
In a hypothetical example, imagine that 50 percent of the sample receive treatment regardless of which group they've been assigned to. And likewise imagine that 25 percent are not treated no matter their assignment. In this imaginary experiment, only 25 percent would actually be affected by random assignment.
First of all, the arithmetic does not work. If we ignore assignment as he suggested in the first two sentences, then the patients can either have received treatment or not. But 50 percent plus 25 percent leaves 25 percent of the patients unaccounted for.
Here is an illustration of what I think Frakt wanted to get across:
Of the 50% assigned to the treatment, 90% (45 out of 50) complied and 10% crossed over. Of the other half initially assigned to no treatment, 60% (30 out of 50) crossed over to the treatment. All in all, 75% of the study population received treatment and 25% did not... regardless of their initial assignment.
In an ITT analysis, all patients in the table are analyzed. We compare the top row with the bottom row. By contrast, in a PP analysis, we only analyze the patients along the top-left, bottom-right diagonal, namely, the 65% of the patients who complied with the assigned treatment. So, we compare the top left corner with the bottom right corner.
The important question is whether this 65% subset constitutes a random sample. Frakt implies it is: "only 25 percent [i.e. 65 percent in my example] would actually be affected by random assignment." Maybe when he said "affected by", he didn't really mean random; because it should be obvious that treatment is no longer randomized within the 65% subset.
If the 65% subset were randomly drawn from the initial population, we should still have equal proportions of treated versus non-treated but in fact, we have 70% treated versus 30% not treated. Said differently, the not-treated patients are more likely to cross over than the treated patients.
Cross-over isn't something that happens randomly. Patients are assessing their own health during the experiment, and thus, the opting out is frequently related to the observed (albeit incomplete) outcome.
In the article, Frakt states that the study of Humphreys et. al. "corrects for crossover by focusing on the subset of participants who do comply with their random assignment". I call this "filtering" rather than "correcting".
Does analyzing this subset lead to an accurate estimate of the treatment effect? I don't think so.
By filtering out the cross-overs, the researchers introduce a survivorship bias. If the cross-overs do so because they are unhappy about their assigned treatment, then these patients, if forced to continue the original treatment, are likely to have below-par outcomes compared to those who did not cross over. In a PP analysis, this subset is removed. Practically, this means that the treatment effect (PP analysis) is too optimistic.
Frakt is careless with his language when it comes to discussing the downside of PP analysis. He says (my italics):
it’s not always the case that the resulting treatment effect is the same as one would obtain from an ideal randomized controlled trial in which every patient complied with assignment and no crossover occurred. Marginal patients may be different from other patients...Despite the limitation, analysis of marginal patients reflects real-world behavior, too.
"Not always" leaves the impression that PP analysis is usually right except for rare situations. Note how he uses the word "limitation" above (paired with "despite"), and below, when discussing ITT analysis:
For a study with crossover, comparing treatment and control outcomes reflects the combined, real-world effects of treatment and the extent to which people comply with it or receive it even when it’s not explicitly offered. (If you want to toss around jargon, this type of analysis is known as “intention to treat.”) A limitation is that the selection effects introduced by crossover can obscure genuine treatment effects.
The choice of words leaves the impression that ITT is more limited than PP when both analyses suffer from problems arising from the same source: patients with worse outcomes are more likely to cross over.
Many readers of the NYT article link to a much longer article in The Atlantic. It appears that the scientific evidence on AA is very weak.
In my latest piece for Harvard Business Review (link), I tackle this common problem in the interactions between data scientists and business managers:
A typical big data analysis goes like this: First, a data scientist finds some obscure data accumulating in a server. Next, he or she spends days or weeks slicing and dicing the numbers, eventually stumbling upon some unusual insights. Then, a meeting is organized to present the findings to business managers, after which, the scientist feels disgruntled or even disrespected while the managers wish they could take the time back.
Using analyses of the popular baby names dataset as an example, I contrast the kind of analysis that generates click bait (e.g. the most "poisoned" names, the most "trendy" names) with the kind of analysis that generates potentially real business value.
The American Association for Public Opinion Research (AAPOR) has put out its Big Data report last month (link). This one is worth reading. It has some of the most current citations, and readers of this blog will be very receptive to its core messages. The team who wrote the report is a mix of academics and practitioners.
In Big Data, there are many self-evident truths, according to the people who talk about Big Data. One of these is the idea that Big Data will make surveys obsolete. How could it not since Big Data means you have hundreds if not thousands of times more "respondents", the ability to track trends in "real time", and the ability to evolve your survey questions? This AAPOR report, I suppose, is a response to such claims.
Then there are those who say surveys are merely a suboptimal stopgap in the old days of "small data". They explain surveys measure "stated preferences" while Big Data (i.e. found data, observational data) measure "revealed preferences", and aiming for the latter is self-evidently better. Revealed preferences are closer to "the truth". Surveys merely represent an approximation, and with Big Data, we no longer need to run surveys.
This topic nicely ties in with Chapter 1 of Numbers Rule Your World (link). In that chapter, I explain the success of Disney in keeping customers happy despite having to wait two hours for rides that last two minutes. The "imagineers" realized that managing perception is even more important than optimizing reality. Customer happiness improves even though measured waiting times (i.e. revealed information) worsen or stay the same.
If Disney relies solely on revealed metrics, the data would say customers are waiting longer, or just as long, as before. When Disney conducts surveys and ask people how they feel, they say they are waiting less and so are happier. Feelings are crucial data that are not revealed by any observations. To the extent that feelings are "revealed", it requires an assumption on the part of the observer; since the measured waiting time is reduced, the customers must be feeling happier.
On a website, the web log has measured data that reveal the paths of users through the website. One can observe where traffic drops off. But that is not enough. In order to reduce attrition, the designer needs to understand why users exit. Surveys provide the answer here.
What if you learn that users exit by clicking on the home page icon? So you test a version of the design in which the home page icon is not clickable. You observe that the exit rate has significantly fallen off in the new version. Problem is the users now find a different way to exit. Relying only on revealed preferences frequently leads to superficial actions that cure symptoms but not root causes.
Revealed preferences and stated preferences are two different dimensions and they both have strengths and weaknesses. Logs are bigger and faster but researchers have no control over the composition of the responders. Neither is a substitute for the other. I am interested in seeing work on integrating the two approaches. The AAPOR report has a good discussion of this subject plus a few references to new work on integration.