This piece is part of the StatBusters column written jointly with Andrew Gelman. Hope they fix the labeling soon. In it, we talk about two recent studies on data privacy, which leads to contradictory conclusions. How should the media report such surveys? Is the brand name of the organization enough? In addition, we debunk the notion that consumers will definitely get something valuable out of sharing their data.
In the last installment, I embarked on a project--perhaps only a task--to assemble a membership list for an organization. It sounded simple: how hard could it be to merge two lists of people? Of course, I couldn’t just stitch one list on top of the other as there are members who subscribed to the newsletter as well as joined the Facebook group. These duplicate rows must be merged so that each individual is one row of data.
With barely a sweat, I blew past my initial budget of two hours.
After a half day, I produced a merged list by matching Facebook usernames to email usernames. It felt like running an obstacle course, with one annoying issue popping after another was resolved. Stray punctuation, ambiguous names, case sensitivity, and so on. Most of these problems lacked clear-cut solutions. Some periods (full stops) were redundant but not all; some middle names were part of the last name but not all. Tick, tick, tick, tick. These data issues demanded consideration, and considerable time.
At the start of Day 2, I executed a planned U-turn. Starting with the two lists of people, I attempted to match first and last names. I tried usernames as the key first because only a small portion of the Email list included names. However, a match of first and last names is a more confident result than a match of usernames.
Immediately, I stepped into text-matching quicksand. I must process the Facebook names (previously scraped) the same way I fixed up the names in the Email list.
As before, I tried a “full outer join.” Disaster. The output data had a crazy number of rows. I sensed missing values. Sure enough, there were some Facebook members for whom I did not have names (for example, they provided names in Chinese or Korean characters.) Each of these members with missing names matched, erroneously, the whole set of email subscribers who also did not provide names.
One way out of this mess was to extract only people with non-missing names from either list, and then merge those subsets. This path was not easy though. I had created four types of members: those with matching names on both lists; those having a Facebook name which didn’t match to anyone with an email name; those having an email name which didn’t match to anyone with a Facebook name; those who provided no usable names in either list.
The challenge was to combine those four groups of members in such a way that each unique member is just one row of data. For each such member, I also wanted to gather all other information from both Facebook and email lists. This required defining a number of dummy columns and also various columns sourcing the data.
I experienced a soothing satisfaction when the output data appeared as expected.
But the job was not yet finished. I ended up with two merged lists, one based on username matching, and the other, name matching. It was time to merge the merged. I spare you the details, most of which resembled the above.
Knowing my client’s name was on the list, I looked him up. There he was, again and again, occupying four or five rows. This might make your heart sink since I had tried so hard to maintain one row per member. But don’t worry. I was simplifying things a little bit. If someone provided multiple email addresses, as my client did, I had decided to keep all of them.
At long last, the master list of members was born. This exercise bore instant rewards. It is very useful to know which members are on both lists and which members are on just one. We have a rough measure of how involved a member is. The hard work lies ahead since our goal is to gain a much deeper understanding of the members.
An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.
The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.
Sounds simple enough.
But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.
What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).
Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.
The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)
Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.
It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.
Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.
I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.
I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.
The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.
But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.
What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.
If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.
A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.
In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?
And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.
I chose path B. It was ugly and unsexy but more of a sure thing.
I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.
By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.
At college reunions in beautiful Princeton on a glorious sunny day.
I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:
In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.
I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.
Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.
The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.
I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.
The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.
Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.
In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.
I mentioned the Harvard Business Reviewarticle on business use of customer data in the "Big Data" era. In the previous post, I looked at the nature of the evidence used by the authors. In this post, ignoring my discomfort with some of the evidence, I examine the conclusions of the article.
The report has a three-part structure: the first section describes the issues; the second section communicates results from a few surveys conducted by frog - a global strategy and design agency - on various issues related to data privacy; and the third section presents examples of their recommendations for clients, which they offer generally to businesses involved in collecting and monetizing customer data.
The survey results are revealing (although the sample size of 900 in five countries is tiny so I'm not sure you should believe them). The agency found that 97% of the people surveyed are concerned about businesses and governments mis-using their data. Seventy-two percent of Americans are reluctant to share information with businesses because they "just want to maintain their privacy".
The authors also learned that consumers have grossly under-estimated the extent of data collection. Only 25% of the respondents said they knew businesses tracked their location, and only 14% said they knew businesses shared their web-surfing history. Finally, their analysts attributed dollar value to the privacy of different types of data.
I follow them up to this point. In fact, the authors summed it up very nicely at the beginning of the article: most [companies] prefer to keep consumers in the dark, choose control over sharing, and ask for forgiveness rather than permission.
Unfortunately, I am let down by the list of recommendations that follow. They feel to me like tweaks on failed ideas, rather than paradigm shifts.
The first recommendation is "educate the consumers". The authors gave an example of one of their own consulting clients who required "customers" to watch a video and give preliminary consent before sharing their own (genomic) data. And the personal data is withheld until the "customer" returns a hard-copy agreement.
We don't need to be reminded that every day, we "voluntarily" sign Terms and Conditions which no ordinary person actually reads. Frequently, we are told not to use a website if we don't agree with any part of a lengthy agreement written in one-sided language favoring the business.
The "new" solution doesn't change the status quo. In fact, it gives businesses a stronger case for arguing that their users have voluntarily given up the right to their own data. In my view, until businesses confront the issue of properly disclosing how they collect data, what information is being collected, and how such data are being sold or traded, consumers will continue to find such practices creepy.
The second recommendation looks good on paper but is impractical. Another one of frog's client is featured here. This client allows customers to specify which pieces of data can go to whom.
Assume there are 100 variables (only!) being collected and five levels of access control. That amounts to 500 yes/no questions each user is required to answer in order to gain full control of the data. In practice, most users will decide not to bother because it is too complex and time-consuming. The solution is a form of suffocation by paperwork.
For the data analysts, such a solution creates headaches. It generates self-selected data of the worst kind. Each variable has its own source of bias as different subsets of users decide to withhold their data for their own reasons.
To implement such a system properly requires a herculean effort. Say I reviewed the list of 100 variables and divided them into five groups of 20 variables using the five levels of control (from allowing anyone to see my gender to hiding my age from everyone). Two months later, I changed my mind. I removed access to 80 of the 100 variables from everyone. Now, the database administrator should find all instances of those 60 variables and delete them. Some of the data may already have been sold to other entities, and what if those other entities re-sell my data after I asked for the data to be deleted by the original source?
The last recommendation is an argument that businesses should not need to pay users for their data. Given the finding in the second section that users assign meaningful dollar values to their data, this seems to be a solution for businesses rather than for consumers.
Pandora's free advertising-supported service is used as an example of customers' willingness to exchange their privacy for "in-kind value". The article failed to mention just how much money Pandora has been paying for such data! As this other HBR article tells us, Pandora is "13 years, 175 million users, little profit". It has never been able to establish a profitable business model because while 80% of its revenues come from advertising to those "free" accounts, 60% of its revenues immediately goes out the door as royalty payments for the "free" music! It's not surprising that many consumers are willingly engaging in this lop-sided exchange with Pandora.
I often wonder if consumers realize that over-sharing their data works to their disadvantage, would they become more interested in how businesses use their data?
For instance, insurance companies will be very interested in acquiring data from personal analytics devices, like Fitbit. They will use the data to predict whether you have health risks, and they will charge you more for insurance. Everyone is at risk for something.
The Uber app gives its users the ability to track their drivers -- in Manhattan, it's like watching a horse-race when your driver tries to negotiate the city gridlock. The same data is used by Uber to get an accurate picture of supply and demand, which drives their surge-pricing algorithms. That's how you end up paying five to ten times the normal cab rate.
Businesses use personal data to reduce information asymmetry, which in the past prevented them from extracting maximum value from consumers.
Today, the data privacy question is phrased as "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Are you willing to share such data with Company X?"
Imagine you are asked a different question: "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Being notified of heart-rate irregularity may help you but 80% of the warnings will be false alarms. Also, your heart rate data will be used by our insurance arm to adjust your insurance premiums. There is a 50% chance that your premium will increase after sharing your data. Are you willing to share such data with Company X?"
I am a guest at the New School's Journalism + Design program this semester.
The students conducted interviews about the question of what makes someone famous. Their interviewees were asked to name five famous people. We had images of these people up on the wall.
Then, we put the pictures into clusters. We tried two different ways of doing it.
At the end, we compared our result to what a computer program generated.
Here are some interesting applications of cluster analysis in the press: FiveThirtyEight made clusters of the intra-season performance profiles of NFL quarterbacks; Pew Research Center made clusters of political values and attitudes.
In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.
In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.
For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.
Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.
As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.
A testing program using fixed samples may face one of several problems:
a) Too few tests are called significant.
b) Too many tests are called significant.
c) It takes too long to call a test.
You need to figure out what is your biggest problem.
Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)
Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.
The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.
On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.
The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.
[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]
How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.
Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.
Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.
I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.
During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.
From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.
Here are some improvements they have made since I last wrote about the tool:
Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.
Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.
Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.
Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.
New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!
I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.
I have been traveling quite a bit lately, and last week, I went to Rome for a few days, and spent time at the KDIR conference. Rome is one of my favorite destinations and apart from the architecture and museums, and the restaurants, I also enjoy shopping there.
To my dismay, a gray cloud followed me around this entire trip - in the form of a misfiring fraud detection algorithm. On foreign trips, I always prefer to spend cash via my ATM card to avoid those ridiculous credit card surcharges. And I have used my Schwab card without any issues for many trips.
This time, however, my card was blocked after one withdrawal. This means I had to call back to the States to unblock it. Then, the next day, it happened again. And this continued every day until the last day of the short four-day trip.
Needless to say, I was getting more and more irate as the trip progressed. It makes me wonder about fraud detection algorithms. Are they any good? If they are tuned to be very risk-averse, then you can always prove to your boss that you have prevented a lot of fraud. The flip side is you have always caused a lot of hassle for your good customers. In technical terms, each time my card was blocked, the algorithm committed a false-positive error.
The service reps had not a clue how these algorithms work. On the first day, I was told that it blocked me because I didn't give them warning that I'd be traveling. That made sense until it blocked me again the next day... after I told the rep the exact days I'd be in Rome. The next rep explained that I was getting blocked because Rome is a high-fraud zone, and I was using certain ATM machines. That sounds reasonable, except if those were the reasons, then I might as well throw the card away. The experience got me thinking about the challenges of making a good fraud detection algorithm.
Clearly, when I am traveling, my habits don't match what is in my customer history. I'm going to be engaging in a series of transactions that might look suspicious - like taking more cash out than usual, taking cash from places and machines that I have never used, taking cash out multiple times a day (because there is a per-transaction limit on most ATMs), taking cash out from machines all over town, etc. How can a computer figure out if those transactions are legitimate?
When the algorithm got it seriously wrong, it can be very annoying. One of those days, I had put money down on a suit. It was an hour away from the store's closing time. Because the problem could not be resolved in time, I had to go back the next day, which meant I had to cut other things out of the itinerary. If it happened on the last day of the trip, it would have been a lot of trouble. I racked up probably $100 of international roaming charges for all the calls I had to make to unblock the card repeatedly. There were several moments I had to stand on the street, the phone on one hand, the other hand operating the ATM, testing the machine, pulling out cash, etc. Those moments felt very ironic because the blocking of my card was supposed to make me feel secure.
As a statistician, I want to know the probability of falling victim to the kind of fraud Schwab's algorithm is trying to prevent, and the average cost of such fraud (bearing in mind that you can only take 250 euros per transaction). I suspect that the cost of the inconvenience, both tangible and intangible, may outweigh the potential benefit.
Tom Davenport is one of the leading voices on business analytics, and he has a new piece titled "Why are most 'targeted' marketing offers so bad?" in which he expanded on a question I raised in my HBR article. Tom's book Competing on Analytics is a classic. He has a great appreciation for the business of the data business.
In the new feature, Davenport classifies marketing offers he gets into five types:
retargeted offers, well-meaning but poorly-targeted offers, offers that benefit the offerer rather than the potential consumer, offers that are OK except for the context, and well-targeted offers that benefit you
and he certainly speaks some truths.
On retargeted offers, he reminds marketers "for the most part, if we abandon a search or purchase, we intended to do so."
On well-meaning and poorly-targeted offers (like sending men offers for women's clothing), he suspects that the retailers didn't try hard enough to mine their data.
I think there are some technical deficiencies partially responsible for these issues.
Firstly, human behavior and preferences can never and will never be reduced to a set of equations. Thus, every targeting algorithm has to balance false positives and false negatives. I have written about this a lot. Start with Chapter 4 of Numbers Rule Your World or the Groupon and Target chapters in Numbersense.
Secondly, the existence of "retargeting" as a business is entirely due to perversion of measurement, which I address in Chapters 1-2 of Numbersense. I also wrote about how online marketing is measured here. Briefly, the more you flood customers with impressions, the more likely your impression is close in time to a purchase event, the more credit you get for "influence".
Thirdly, the data is noisy and few are investing any time in getting rid of bad data. Just think about it for a second. Let's say you are a guy. If your son let his classmate use your iPad to buy something from a girl's clothing site just once, you are forever tagged as a girl's clothing buyer.
The Big Data mindset to solving this problem: they want to be even more creepy; they want "all of your data". If everything is being tracked, but by hundreds or thousands of different entities, that wouldn't work either, so the Big Data end game is one all-knowing monopolist of all of your data.
But this path is entirely a dead end. Here's something to ponder - the fact that you visited a particular website is today equated to an expression of interest in that website. The data measure what you do, and why you do it.
The solution is humility, and accepting a level of uncertainty. Enhance observed data with more direct, even qualitative data. Remove noise, which is a way of managing the uncertainty.