This piece is part of the StatBusters column written jointly with Andrew Gelman. Hope they fix the labeling soon. In it, we talk about two recent studies on data privacy, which leads to contradictory conclusions. How should the media report such surveys? Is the brand name of the organization enough? In addition, we debunk the notion that consumers will definitely get something valuable out of sharing their data.
Harvard Business Review devotes a long article to customer data privacy in the May issue (link). The article raises important issues, such as the low degree of knowledge about what data are being collected and traded, the value people place on their data privacy, and so on. In a separate post, I will discuss why I don't think the recommendations issued by the authors will resolve the issues they raised. In this post, I focus my comments on an instance of "story time", some questions about the underlying survey, and thoughts about the endowment effect.
Much of the power of this article come from its reliance on survey data. The main survey used here is one conducted in 2014 by frog, the "global product strategy and design agency" that employs the authors. They "surveyed 900 people in five countries -- the United States, the United Kingdom, Germany, China, and India -- whose demographic mix represented the general online population". (At other points in the article, the authors reference different surveys although no other survey was explicitly described other than this one.)
Story time is the moment in a report on data analysis when the author deftly moves from reporting a finding of data to the telling of stories based on assumptions that do not come from the data. Some degree of story-telling is required in any data analysis so readers must be alert to when "story time" begins. Conclusions based on data carry different weight from stories based on assumptions. In the HBR article, story time is called below the large graphic titled "Putting a Price on Data".
The graphic presented the authors' computation of how much people in the five nations value their privacy. They remarked that the valuations have very high variance. Then they said:
We don't believe this spectrum represents a "maturity model," in which attitudes in a country predictably shift in a given direction over time (say, from less privacy conscious to more). Rather, our findings reflect fundamental dissimilarities among cultures. The cultures of India and China, for example, are considered more hierarchical and collectivist, while Germany, the United States and the United Kingdom are more individualistic, which may account for their citizens' stronger feelings about personal information.
Their theory that there are cultural causes for differential valuation may or may not be right. The maturity model may or may not be right. Their survey data do not suggest that there is a cultural basis for the observed gap. This is classic "story time."
I wonder if the HBR editors reviewed the full survey results. As a statistician, I think the authors did not disclose enough details about how their survey was conducted. There are lots of known unknowns: we don't know the margins of error on anything, we don't know the statistical significance on anything, we don't know whether the survey was online or not, we don't know how most of the questions were phrased, and we don't know how respondents were selected.
What we do know about the survey raises questions. Nine hundred respondents spread out over five countries is a tiny poll. Gallup surveys 1,000 people in the U.S. alone. If the 900 were spread evenly across the five countries, their survey has fewer than 200 respondents per country. A rough calculation gives a margin of error of at least plus/minus 7 percent. If the sample is proportional to population size, then the margin of error for a smaller country like the U.K. will be even wider.
The authors also claim that their sample is representative of the "demographic mix" of the "general online population." This is hard to believe since they have no one from South America, Africa, Middle East, Australia, etc.
The graphic referenced above, "Putting a Price on Data," supposedly gives a dollar amount for the value of different types of data. Here is the top of the chart to give you an idea.
The article said "To see how much consumers valued their data, we did conjoint analysis to determine what amount survey participants would be willing to pay to protect different types of information." Maybe my readers can help me understand how conjoint analysis is utilized for this problem.
A typical usage of conjoint is for pricing new products. The product is decomposed into attributes so for example, the Apple Watch may be thought of as a bundle of fashion, thickness, accuracy of reported time, etc. Different watch prototypes are created based on bundling different amounts of those attributes. Then people are asked how much they are willing to pay for different prototypes. The goal is to put a value on the composite product, not the individual attributes.
Also interesting is the possibility of an "endowment effect" in the analysis of the value of privacy. We'd really need to know the exact questions that the survey respondents were asked to be sure. It seems like people were asked how much they would pay to protect their data, i.e. to acquire privacy. In this setting, you don't have privacy and you have to buy it. A different way of assessing the same issue is to ask how much money would you accept to sell your data. That is, you own your privacy to start with. The behavioral psychologist Dan Kahneman and his associates pioneered research that shows the value obtained by those two methods are frequently wide apart!
In a classic paper (1990), Kahneman et. al. told one group of people that they have been gifted a mug, and asked them how much money they would accept in exchange for it (the median was about $7.) Another group of people were asked how much they were willing to pay to acquire a mug; the median was below $3.
Is this the reason why businesses keep telling the press we don't have privacy and we have to buy it? As opposed to we have privacy and we can sell it at the right price?
Despite my reservations, the HBR piece is well worth your time. It raises many issues about data collection that you should be paying attention to. Read the whole article here.
It's a good thing that FTC is making some noise about regulating the snooping done by online services. (link) It's not a good thing that the measures described in the article ("tools to view, suppress and fix the information") do not solve the fundamental problem, and are likely counter-productive.
What's the fundamental problem?
Imagine a world in which you walk into your supermarket. When you check out, you are required to provide your home address and phone number; otherwise, you can shop at the store down the block. Imagine a world in which they take a snapshot of your face at the checkout counter. Imagine a world in which every package you send through UPS is scanned and the contents recorded.
Not long ago, these actions are considered invasion of privacy. Today, websites large and small, and mobile/tablet apps, are doing all of the above and more.
FTC thinks the problem is "transparency". Not quite, the fundamental issue is propriety.
Why are the proposed measures inadequate?
There is a big difference between providing tools to suppress the information and not collecting the information in the first place. The word "suppress" is ambiguous. It can mean a lot of things. Many of the websites out there do not delete user information; if you delete your information, it just severs future access. Archived copies of past data are pretty easy to find. Besides, your data are likely to still be present in the website's servers, and are possibly still being used by algorithms.
A simple blanket do-not-track option should be made available but websites have moved in the opposite direction. Yahoo! recently informed users they no longer honor do-not-track requests -- other major websites didn't even bother with do not track.
The blanket option is the only option that is viable. Otherwise, consumers will be forced to scroll through thousands of attributes to check off everything that they want suppressed.
The other measure of allowing consumers to correct the data is even worse. By correcting the data, the individual has given implicit consent to its collection and continual usage.
Why are the measures counter-productive?
Being able to correct data sounds like a good idea but the devil is in the details.
Let's say John's age is currently listed as 16. John corrects it to 25. Now what? The data vendor will continue to receive information about John's age, from public records, etc. The new data may contradict John's correction. If the vendor now overwrites John's entry, does this violate the law? Who's to arbitrate which version of the "truth" is the truth? If the vendor is not allowed to overwrite John's entry, then the data become "dead" each time a consumer "corrects" it.
The biggest problem with correcting data is gaming. I discussed this years ago in my first book, Numbers Rule Your World (Chapter 2) in the context of credit bureau data. It is naive to think that consumers will provide truthful data about themselves. Given that we know that they use this data to set insurance rates, to decide who gets a marketing offer, etc., our incentive is to create a better version of ourselves. The only errors that would be corrected are those that paint the consumer in a worse light (think late payments misattributed to you). If the errors artificially inflate the consumer's positive qualities, why would one fix them? (think true late payments not reported to the bureau)
In addition, there is an unavoidable bias in who are correcting the data. In the case of credit bureau data, for example, people who have bad credit scores are much more likely to want to review and correct their data than people who have great credit scores.
The end result of such fixing is to take the data further from reality, add human mischief to the mix, and deteriorate the performance of algorithms. That's why open data correction is counterproductive.
The data industry wants to self-regulate. I agree with this stance but the substance is lacking. What has the industry done to regulate themselves?
The industry should be talking about the blanket do-not-track option. It should be defining the minimum set of data it needs. It should be quantifying the incremental value of collecting more data. That means not collecting data that are of no utility.
The industry is confident that the additional data bring tremendous benefits to consumers; we keep hearing about more relevant ads! more personalized marketing! better user experience! better customer service! etc. For all the confidence, it is afraid of do-not-track. Why should that be?
Note: Act quickly. Looks like you can still get a free book courtesy of SAS from here.
The New York Timesfeatures Acxiom, one of several data vendors that purportedly know a lot about you and me. Other key names in this sector include Experian and Equifax. What's new is that Acxiom will allow consumers to proactively "correct errors", or at least learn what is being bought and sold behind our backs.
I have dealt with all three of these companies. They are pretty much the same business. Here are some points to note about this news:
1. AboutTheData.com only deals with marketing data. Acxiom sells data to all kinds of parties, like credit card companies, insurance companies, and presumably government agencies. You cannot edit those data. The article gives examples of data that won't be made available: whether a person is a “potential inheritor” or an “adult with senior
parent,” or whether a household has a “diabetic focus” or “senior
2. Marketing data are mostly harmless. When they get it wrong, you may have different kinds of offers or marketing materials sent to you, or you may receive a different price on a good than intended. The data that are sold to insurers and government entities are likely to impact your life more heavily but they are not being exposed in this program.
3. You are allowed to edit your data but the option to delete your profile is not available. Opting out of sharing is not the same as deleting the data.
4. Just like every report I have seen about "correcting data", the reporter just assumes that the only things being edited are "errors". Not true. Since the data represent you to external parties, there is an incentive to create a fictional self. This type of error correction measure is very likely to cause even more distortion in the data. It encourages gaming. I last wrote about this issue here.
5. At the individual level, much of this data is guesswork. In technical terms, they are "modeled". The data provider has actual data on a sample of people, and then try to guess what everyone is like. Even something as basic as gender might be guessed. Your name is used as a clue, so is your neighborhood and other things like religion. If you have an unusual name, the guess can be wrong. This type of data is valuable when used with some level of aggregation; it is not to be trusted at the individual level.
I have been highlighting some cases of analysts doing a masterful job in taking apart claims made in the media. Here is another example from Bruce Schneier, the national security expert.
In June 2013, a Who’s Who list of
telecom and technology companies, such as Verizon, AT&T, Google and
Facebook, was busted for passing customer data to the National Security Agency in
its domestic surveillance program. This episode invites scrutiny on one of the
tenets of Big Data, the business practice of tracking customers. One of the
articles I read at the time mentioned TrackMeNot as a way to foil the spies.
TrackMeNot is a browser plug-in developed at New York University that sends
randomly generated search queries from your browser. This strategy has a
scientific basis—those who read Nate Silver’s book, The Signal and The Noise, can appreciate how adding noise can diffuse
Despite the association with a reputable institution and the scientific principle, one would be mistaken to think that TrackMeNot protects one from snooping!
A search of “TrackMeNot” brought
me to a blog post from 2006 (!) by Bruce Schneier (link),
whose valuable work on security technology I came across while researching Numbers Rule Your World. In fewer than
1,000 words, he turned what sounded like a promising concept into half-baked pseudoscience.
The secret is read the fine print,
something you should always do before believing statistical claims.
It turns out that the “random”
search queries generated by the tool (circa 2006) consisted of two-word
combinations, chosen from a dictionary of 1,673 words, sent to one of four
search engines at regular twelve-second intervals. Imagine how easy it is for
an analyst to remove such noise from the web logs. It is obvious from reading
the comments that Schneier’s audience is well informed about the Achilles’ heel
of all terrorist prediction methodologies, the problem of false positives. One
reader simply said “WakeMeNot.”
Chapters 6 and 7 of Numbersense, on economic indicators, highlight the importance of learning
the formulas behind metrics, and how data are processed. In short, details
You have till Saturday to enter the book giveaway quiz. Click here to participate.
I will never understand why people believe Snapchat when they say the photos would disappear after a set time. Nothing digital ever really disappears. It is long known that deleting a file on your PC does not eliminate it. It is also known that the entire Internet Protocal (IP) is based on replication--copies of the same file can be found on a lot of servers.
My colleague showed me this Valleywag post in which they dug through Snapchat's "law enforcement guide". What did they find?
Since Snapchat is a mobile application, they are able to tie each photo to you personally, they have your telephone number, email and username (through the carrier). This is more than sufficient to identify you, your name, and go to any data vendor or credit report bureau to take everything else. Wait, the company keeps "a log of the last 200 snaps that have been sent and received."
Like Eric Schmidt said, you really shouldn't be doing things you don't want other people to know about.
TechCrunch has a great piece on how Facebook tracks you even if you don't give them data. (link; be careful, opening this link drags my browser to a crawl.) Here's my take on the issue:
I have always been disturbed by the complicity of invading other people's privacy, forced upon us when we use a service like Facebook (or Google or you name it). For those of you who allow these websites to import your address book, have you thought about whether those friends and acquaintances of yours want to have their private information given over to a private company of your choosing? Have you considered whether you need permission from those friends and acquaintances to pass along their contact details, their birthdays, etc.?
Perhaps you are not aware of how the data in your address books are being used. Try setting up a Facebook account and *not* importing your address book. If the email address you use for registering is an existing email, upon signing in, you will find a gigantic list of people that Facebook thinks you know and suggest you "friend". How does Facebook figure this out? You can bet that almost everyone of those people has shared their emails or address books with Facebook, and since you are on their address book, it is logical to conclude that you may be friends with them.
If you have Gmail, the situation may be worse. Gmail has this function that saves every email address on your incoming emails. So all kinds of people appear on your address book, including spammers who happen to penetrate their spam filters. When Gmail first started out, they told us they would mine the emails for data to help target the ads. I haven't checked their terms and conditions lately; it wouldn't surprise me that they now use that data to "improve their service". For example, the problem of wrongly assuming every contact in your address book is a friend is solved by counting the number of emails you received from each of those contacts.
From a user's perspective, we should think twice before sharing other people's data.
What about the data analyst? I think all data analysts should be required to take a course on privacy law. There should be a document outlining what kinds of data and what types of analyses are compliant. Data exist in many different locations, collected for many different purposes. Today, most analysts would mine anything they can get their hands on. They would merge data from different sources. There is no thought given to whether the data is approved for use in the way the analyst is intending, or whether it is legitimate to merge certain types of data. (Not every business does this type of data harvesting. I happen to work for one that does not copy your address book.)
If we don't police ourselves, we will be policed, and the rules would be stricter.
Felix Salmon hates the 401(k), and he explains his reasoning here. His strongest argument is the data, which shows that the first generation of retirees who grew up with these individual retirement savings accounts find themselves with meager retirement savings (average: $120,000, excluding those with zero).
I have always disliked 401(k), and here are some reasons:
I hate the myth of individual control. These accounts (just like health savings accounts and other similar things) are sold to us as an expression of individual freedom and choice. But it is a mirage.
First, it is never a good idea to chop up money into little pockets. The same banks who every day praise the benefit of diversifying our portfolios also lobby the government to establish these individual accounts. If you earn $60,000 a year, and get paid bi-weekly, then your paycheck is $2,500 less tax withholding. If you put 5% of your earnings into the 401(k), then every two weeks, you put away an extra $125. Imagine you then split the $125 into five or ten investments. It just doesn't make much sense.
And this doesn't even consider the per-trade fees that Charles Schwab, or E*Trade, or any of those firms charge us, as a percentage of the expected gains from your trade. Transaction costs are where individuals will always lose to funds.
Second, how many of us can truly beat investment professionals at the investment game? Most of us don't have the time or inclination to be monitoring our portfolio 24/7. The professionals have access to better information. They are able to react faster when something unexpected happens. In fact, our too-big-to-fail banks routinely have zero days of trading losses in a given quarter when they trade on their own accounts. (This does not mean these banks will perform the same when it comes to our money.) I'm aware that most professional fund managers are shown to do no better than some simulated simple-minded strategy. Is there a study that takes the investment performance of the average Joe and compare them to the average fund manager?
Third, moving from a defined benefit plan to a 401(k) is to shift risk from institutions to individuals. When we were sold the myth of individual control, we were also gifted the Trojan Horse of financial risk. Now, we are responsible for our own individual bad decisions. This is an important point. More people in the pool lowers the average risk.
Fourth, the 401(k) is regressive. The lower your salary, the smaller your biweekly contribution. The risk borne by the individual is higher when the principal is smaller.
If those are not enough, individual control is simply a lie, and becoming more so each year. Most of us find that the banks that run the 401(k) only allow us to put our money in one of the funds they administer. They charge a fee to put the money elsewhere. In one case, I was told if I really want to invest in funds not in the approved list, I must first buy one of the designated funds, and then transfer the money to the fund I really want. Needless to say, the in-and-out is not free. There is another company who matches my contribution but in company stock, which is the opposite of diversification; and if I want to transfer the money to a different asset, I have to pay fees.
It's a steep price to pay to enjoy that "individual control."
Mashable (link) made a big fuss out of a new technology: mannequins that have cameras and data processing software, which allows the retailers to discern age, gender, and race of shoppers. This allows the shops to observe shopping patterns, and they will use this knowledge to change the store layouts, product placements, and so on to improve the shopping experience.
I see nothing wrong with this. The software as described is much more benign than the face recognition software deployed by Facebook for example. The Facebook version links a face to a name while this software only places one into broad demographic segments. A store has to be designed for the "average" shopper, and so it is useless for the shop to identify individuals. Moreover, shops have a large number of transient customers, say tourists, who will only show up once in their lives.
Mashable has a quote from Nordstrom, claiming that this technology "crosses privacy boundaries with customers" and they would never use it. That's a nice sound bite but I wonder if Nordstrom, or any large retailer for that matter, has owned up to spying on shoppers already. They have pretend-shoppers in their stores taking detailed notes on how individual shoppers behave. A number of books have been written on this topic, such as Paco Underhill's Why We Buy: The Science of Shopping (link).
I admire the work Underhill and associates have been doing. Store statistics can only tell you what but never why. In order to understand why shoppers behave in certain ways, these researchers are doing the legwork of painstakingly collecting data. This is one of the lessons of Chapter 2 of Numbers Rule Your World. In epidemiology, it's called shoe leather. In the world of Nate Silver, it's called polling -- what Nate does isn't possible without pollsters interviewing real people, gathering the data, compiling and analyzing them. One can say the new mannequin technology streamlines the data collection a bit.
In fact, these mannequins are much less intrusive than online or mobile shopping. In those arenas, retailers can absolutely link our every click to our name and address. There are those who claim they receive "permission" to collect data but no online retailer I know of would let me buy something if I reject their data sharing agreement.
I'm waiting for Mashable to correct their story. In a bid to find a villain, they claimed that Benetton is one of the big-name customers of this technology--both in the lead and in the article itself. However, if you click to the original Bloomberg article they cited (link), you'd be surprised to learn that "Benetton Group SpA said it’s not using EyeSee or comparable technology."
This lapse of reporting accuracy is much more serious than the alleged privacy intrusion.
A researcher demonstrates that a smartphone has an embedded piece of software that has the ability to track what its user is doing on the phone, things such as keystrokes and details of text messages received. There is no question that the software on at least some phones is sending all this data back to Carrier IQ, which is the company that produces this software. (See, for example, Gizmodo articles about Carrier IQ here and here.)
A host of companies, from Carrier IQ to carriers like AT&T to handset manufacturers like Apple, line up to deny any malicious intent. Some say they are not aware of how the software ended up in the phones, some say the software is resident but not activated, some say it used to be activated but not in current versions, some say the software is turned on but collects only "innocuous" data like call quality, some say the data is collected but no one's privacy is being violated (because the information is aggregated, etc. etc.)
These things are for sure: users are not made aware of such tracking; no one is being asked to opt in or opt out of such tracking; the application runs in the background but is not listed besides other running apps; if the user is smart enough to find the application, the button that should disable the app is immune to clicking.
It's the wild, wild west. If the industry does not exercise self-restraint, the path ahead will be rocky. Mom knew it best: there are things in life you can have but shouldn't.