This piece is part of the StatBusters column written jointly with Andrew Gelman. Hope they fix the labeling soon. In it, we talk about two recent studies on data privacy, which leads to contradictory conclusions. How should the media report such surveys? Is the brand name of the organization enough? In addition, we debunk the notion that consumers will definitely get something valuable out of sharing their data.
I only read nutrition studies in the service of this blog but otherwise, I don't trust them or care. Nevertheless, the health beat of most media outlets is obsessed with printing the latest research on coffee or eggs or fats or alcohol or what have you.
Now, the estimable John Ioannidis has published an editorial in BMJ titled "Implausible Results in Human Nutrition Research". John previously told us about the crisis of false positives in medical research.
Oops, here are some statistics on nuitrition "science":
In 52 attempts at using randomized experiments to validate findings from observational studies, the number of times the findings were replicated: 0
In the NHANES questionnaire (the basis of all those findings), two-thirds of the participants provided answers that imply an energy intake that is "incompatible with life". I haven't read this paper; seems like worthwhile reading.
There are at least 34,000 papers on PubMed with keywords "coffee OR caffeine" which means this one nutrient has been associated with almost any interesting outcome.
Almost every single nutrient imaginable has peer reviewed publications associating it with almost any outcome. A statistician should never give the advice "If at first you don't succeed,..."
Many findings are entirely implausible (and still get published in top journals)... for example, the idea that a couple of servings a day of a single nutrient will halve the burden of cancer is clearly "too good to be true," even more so for anyone who is familiar with this literature
"Big datasets just confer spurious precision status to noise"
Randomized experiments offer hope but are woefully undersized (like requiring 10 times the current sample).
Just to nail home the point, John concludes: "Definitive solutions will not come from another million observational papers or a few small randomized trials."
I mentioned the Harvard Business Reviewarticle on business use of customer data in the "Big Data" era. In the previous post, I looked at the nature of the evidence used by the authors. In this post, ignoring my discomfort with some of the evidence, I examine the conclusions of the article.
The report has a three-part structure: the first section describes the issues; the second section communicates results from a few surveys conducted by frog - a global strategy and design agency - on various issues related to data privacy; and the third section presents examples of their recommendations for clients, which they offer generally to businesses involved in collecting and monetizing customer data.
The survey results are revealing (although the sample size of 900 in five countries is tiny so I'm not sure you should believe them). The agency found that 97% of the people surveyed are concerned about businesses and governments mis-using their data. Seventy-two percent of Americans are reluctant to share information with businesses because they "just want to maintain their privacy".
The authors also learned that consumers have grossly under-estimated the extent of data collection. Only 25% of the respondents said they knew businesses tracked their location, and only 14% said they knew businesses shared their web-surfing history. Finally, their analysts attributed dollar value to the privacy of different types of data.
I follow them up to this point. In fact, the authors summed it up very nicely at the beginning of the article: most [companies] prefer to keep consumers in the dark, choose control over sharing, and ask for forgiveness rather than permission.
Unfortunately, I am let down by the list of recommendations that follow. They feel to me like tweaks on failed ideas, rather than paradigm shifts.
The first recommendation is "educate the consumers". The authors gave an example of one of their own consulting clients who required "customers" to watch a video and give preliminary consent before sharing their own (genomic) data. And the personal data is withheld until the "customer" returns a hard-copy agreement.
We don't need to be reminded that every day, we "voluntarily" sign Terms and Conditions which no ordinary person actually reads. Frequently, we are told not to use a website if we don't agree with any part of a lengthy agreement written in one-sided language favoring the business.
The "new" solution doesn't change the status quo. In fact, it gives businesses a stronger case for arguing that their users have voluntarily given up the right to their own data. In my view, until businesses confront the issue of properly disclosing how they collect data, what information is being collected, and how such data are being sold or traded, consumers will continue to find such practices creepy.
The second recommendation looks good on paper but is impractical. Another one of frog's client is featured here. This client allows customers to specify which pieces of data can go to whom.
Assume there are 100 variables (only!) being collected and five levels of access control. That amounts to 500 yes/no questions each user is required to answer in order to gain full control of the data. In practice, most users will decide not to bother because it is too complex and time-consuming. The solution is a form of suffocation by paperwork.
For the data analysts, such a solution creates headaches. It generates self-selected data of the worst kind. Each variable has its own source of bias as different subsets of users decide to withhold their data for their own reasons.
To implement such a system properly requires a herculean effort. Say I reviewed the list of 100 variables and divided them into five groups of 20 variables using the five levels of control (from allowing anyone to see my gender to hiding my age from everyone). Two months later, I changed my mind. I removed access to 80 of the 100 variables from everyone. Now, the database administrator should find all instances of those 60 variables and delete them. Some of the data may already have been sold to other entities, and what if those other entities re-sell my data after I asked for the data to be deleted by the original source?
The last recommendation is an argument that businesses should not need to pay users for their data. Given the finding in the second section that users assign meaningful dollar values to their data, this seems to be a solution for businesses rather than for consumers.
Pandora's free advertising-supported service is used as an example of customers' willingness to exchange their privacy for "in-kind value". The article failed to mention just how much money Pandora has been paying for such data! As this other HBR article tells us, Pandora is "13 years, 175 million users, little profit". It has never been able to establish a profitable business model because while 80% of its revenues come from advertising to those "free" accounts, 60% of its revenues immediately goes out the door as royalty payments for the "free" music! It's not surprising that many consumers are willingly engaging in this lop-sided exchange with Pandora.
I often wonder if consumers realize that over-sharing their data works to their disadvantage, would they become more interested in how businesses use their data?
For instance, insurance companies will be very interested in acquiring data from personal analytics devices, like Fitbit. They will use the data to predict whether you have health risks, and they will charge you more for insurance. Everyone is at risk for something.
The Uber app gives its users the ability to track their drivers -- in Manhattan, it's like watching a horse-race when your driver tries to negotiate the city gridlock. The same data is used by Uber to get an accurate picture of supply and demand, which drives their surge-pricing algorithms. That's how you end up paying five to ten times the normal cab rate.
Businesses use personal data to reduce information asymmetry, which in the past prevented them from extracting maximum value from consumers.
Today, the data privacy question is phrased as "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Are you willing to share such data with Company X?"
Imagine you are asked a different question: "Company X would like to collect information about your heart rate and in exchange, you will get notified if any irregularity is detected. Being notified of heart-rate irregularity may help you but 80% of the warnings will be false alarms. Also, your heart rate data will be used by our insurance arm to adjust your insurance premiums. There is a 50% chance that your premium will increase after sharing your data. Are you willing to share such data with Company X?"
Last time we heard about Deflategate on this blog, Warren Sharp compiled some statistics on fumble rates, showing that the Patriots were unusually good at avoiding fumbles. (link, link) I thought the level of analysis was "above average" and remarked that statistical evidence of this type can only get you so far. The metric is indirect, and it does not speak to causation.
The official investigators have now issued their report. New York Times has its coverage here. As one reader commented, this article, currently nearing 800 comments, has more comments than most articles with more serious subject matter. The NYT article is one of the better ones out there on this subject.
Two set of new evidence has emerged.
The first, which is getting most of the headlines and attention, are text messages involving two Patriots employees who discussed their deflating operation. These text messages are highly incriminating for the two involved and for me, also incriminating for Tom Brady, the team's superstar quarterback (who refused to release his own text messages or other correspondence to the investigators). The text messaging evidence shores up the causal evidence in a way that numbers by themselves could never accomplish.
The takeaway from the text evidence is the power of "metadata". Metadata is data about the text messages (sender, recipient, date and time of sending, length, etc.), as distinct from the content of the texts. Metadata went mainstream when the U.S. government was revealed to have been massively scooping up metadata on domestic phone calls, but denied collecting contents of said phone calls (See thesecoverage, for example). The investigators can use metadata to learn who else is in the circle of insiders, how often they communciate, when they communicate, etc. Notice that these pertinent questions do not require knowing the contents of the text themselves. (This is not to say knowing the contents of at least some of the text messages is important--at the minimum, to zoom in on the relevant texts.)
But these investigators could not determine when the deflator operation started, how often it occurred or the full scope of the operation. This is likely to do with selective disclosure of the text messages by selected parties (e.g. none from Brady).
Another takeaway is the inherent bias in surveillance data. Simply put, you only know what you can measure, and there is much that are not being measured. To get the "full scope", the investigators would need phone records, emails, and even wiretap evidence following the key players around (just kdding).
The second set of evidence is also extremely important to the story but it has received far less attention. One reason I like the NYT coverage is that the reporter gets to this evidence before talking about the text messages. For the first time, I see direct evidence of football tampering. The NFL rule requires footballs to be inflated to between 12.5 and 13.5 pounds per square inch. According to the NYT report, after the Colts raised suspicion at half-time of the Patriots-Colts matchup, all of the footballs were found to be underinflated (below 12.5 pounds), with the minimum vaule of 10.5.
This is the first time I see a clear admission that all of the footballs were underinflated. This is much more convincing evidence that someone tampered with the footballs than any of the fumble analysis.
Further, the referee had already weighed the balls before the game, and at the time, found all of the Colts-supplied footballs to be about 13 pounds, and only two of the Patriots-supplied footballs to be under-inflated.
Once tampering is established, the investigators can move on to finding the cause. Here, they are helped by videotapes from surveillance cameras, and also the texts.
One nitpick about the sentence: 'The report uses the nebulous phrase “more probable than not” several times in making its conclusions.' To a statistician, this is a very precise statement, not nebulous at all! I interpret the investigators to mean there is more than 50% chance. That is the standard of "preponderance of evidence."
FiveThirtyEight has a lengthy discussion of the report. They helpfully showed a screenshot of the measured ball weights:
Harvard Business Review devotes a long article to customer data privacy in the May issue (link). The article raises important issues, such as the low degree of knowledge about what data are being collected and traded, the value people place on their data privacy, and so on. In a separate post, I will discuss why I don't think the recommendations issued by the authors will resolve the issues they raised. In this post, I focus my comments on an instance of "story time", some questions about the underlying survey, and thoughts about the endowment effect.
Much of the power of this article come from its reliance on survey data. The main survey used here is one conducted in 2014 by frog, the "global product strategy and design agency" that employs the authors. They "surveyed 900 people in five countries -- the United States, the United Kingdom, Germany, China, and India -- whose demographic mix represented the general online population". (At other points in the article, the authors reference different surveys although no other survey was explicitly described other than this one.)
Story time is the moment in a report on data analysis when the author deftly moves from reporting a finding of data to the telling of stories based on assumptions that do not come from the data. Some degree of story-telling is required in any data analysis so readers must be alert to when "story time" begins. Conclusions based on data carry different weight from stories based on assumptions. In the HBR article, story time is called below the large graphic titled "Putting a Price on Data".
The graphic presented the authors' computation of how much people in the five nations value their privacy. They remarked that the valuations have very high variance. Then they said:
We don't believe this spectrum represents a "maturity model," in which attitudes in a country predictably shift in a given direction over time (say, from less privacy conscious to more). Rather, our findings reflect fundamental dissimilarities among cultures. The cultures of India and China, for example, are considered more hierarchical and collectivist, while Germany, the United States and the United Kingdom are more individualistic, which may account for their citizens' stronger feelings about personal information.
Their theory that there are cultural causes for differential valuation may or may not be right. The maturity model may or may not be right. Their survey data do not suggest that there is a cultural basis for the observed gap. This is classic "story time."
I wonder if the HBR editors reviewed the full survey results. As a statistician, I think the authors did not disclose enough details about how their survey was conducted. There are lots of known unknowns: we don't know the margins of error on anything, we don't know the statistical significance on anything, we don't know whether the survey was online or not, we don't know how most of the questions were phrased, and we don't know how respondents were selected.
What we do know about the survey raises questions. Nine hundred respondents spread out over five countries is a tiny poll. Gallup surveys 1,000 people in the U.S. alone. If the 900 were spread evenly across the five countries, their survey has fewer than 200 respondents per country. A rough calculation gives a margin of error of at least plus/minus 7 percent. If the sample is proportional to population size, then the margin of error for a smaller country like the U.K. will be even wider.
The authors also claim that their sample is representative of the "demographic mix" of the "general online population." This is hard to believe since they have no one from South America, Africa, Middle East, Australia, etc.
The graphic referenced above, "Putting a Price on Data," supposedly gives a dollar amount for the value of different types of data. Here is the top of the chart to give you an idea.
The article said "To see how much consumers valued their data, we did conjoint analysis to determine what amount survey participants would be willing to pay to protect different types of information." Maybe my readers can help me understand how conjoint analysis is utilized for this problem.
A typical usage of conjoint is for pricing new products. The product is decomposed into attributes so for example, the Apple Watch may be thought of as a bundle of fashion, thickness, accuracy of reported time, etc. Different watch prototypes are created based on bundling different amounts of those attributes. Then people are asked how much they are willing to pay for different prototypes. The goal is to put a value on the composite product, not the individual attributes.
Also interesting is the possibility of an "endowment effect" in the analysis of the value of privacy. We'd really need to know the exact questions that the survey respondents were asked to be sure. It seems like people were asked how much they would pay to protect their data, i.e. to acquire privacy. In this setting, you don't have privacy and you have to buy it. A different way of assessing the same issue is to ask how much money would you accept to sell your data. That is, you own your privacy to start with. The behavioral psychologist Dan Kahneman and his associates pioneered research that shows the value obtained by those two methods are frequently wide apart!
In a classic paper (1990), Kahneman et. al. told one group of people that they have been gifted a mug, and asked them how much money they would accept in exchange for it (the median was about $7.) Another group of people were asked how much they were willing to pay to acquire a mug; the median was below $3.
Is this the reason why businesses keep telling the press we don't have privacy and we have to buy it? As opposed to we have privacy and we can sell it at the right price?
Despite my reservations, the HBR piece is well worth your time. It raises many issues about data collection that you should be paying attention to. Read the whole article here.
This is a supplement to the previous post about a new research paper on the effect of Alcoholics Anonymous, and an NY Times exposition that I commented on. A misreading of that article led me to complain about per-protocol analysis, which wasn't the methodology behind the Humphrey et. al. research. I will explain their methodology in this post (known as instrumental variables analysis).
In the last post, I showed this hypothetical situation, involving patients who "cross over" (disobey treatment assignment) in a randomized experiment.
In the paper, actual treatment is measured by the change in frequency of attending AA meetings (relative to baseline).
Because initial treatment assignment (rows) is random, one expects that equal proportions of people would have moved out of state, got married, got divorced, etc. Similarly, one expectas equal proportions of people would have increased AA attendance. But in the table above, 90% of people in the treatment arm upped attendance while only 60% of those assigned to no treatment increased attendance. (The researchers use a continuous scale of frequency rather than proportion but the concept is the same.)
Of course, the random assignment to treatment itself is a cause of higher relative attendance. People are told to go to AA meetings. But there are other reasons for increased attendance, such as self-motivation leading those in the no-treatment arm to cross over.
In ITT analysis, you ignore the actual attendance, and analyze how treatment assignment affects the amount of drinking.
Alternatively, one can run a regression of frequency of AA meetings on amount of drinking (relative to baseline). This will yield a result such as "the more meetings someone attends, the less they drink". The problem with this analysis is that while the initial assignment is random, the actual attendance is tainted by selection bias.
Instead of using the actual frequency of AA meetings as a regressor, the instrumental variables (IV) analysis uses a predicted frequency of AA meetings. The prediction is itself a regression of treatment assignment and demographic variables on the actual frequency of AA meetings. In other words, we only care about the proportion of the variability in AA attendance that can be explained by the random assignment (controlling for the demographic variables). The remaining variability (due to self-motivation, etc.) are left on the table.
This is the "correction" that Frakt inferred in the New York Times article. I think Frakt is correct that the conclusion can be applied only to those who obey the protocol but I don't think the researchers drop all non-compliers from the dataset.
Also, Humphrey, et. al. seem to be at odds with the author of The Atlantic article, as they say "The long-established positive association between AA involvement and better outcomes was therefore consistent with, but did not prove, causation."
[After communicating with Frakt, Humphrey and Dean Eckles, I realize that I was confused about Frakt's description of the Humphrey paper, which does not perform PP analysis. So when reading this post, consider it a discussion of ITT versus PP analysis. I will post about Humphrey's methodology separately.]
The New York Times plugged a study of the effectiveness of Alcoholics Anonymous (AA) (link). The author (Austin Frakt) used this occasion to advocate "per-protocol" (PP) analysis over "intent-to-treat" (ITT) analysis. He does a good job explaining the potential downside of ITT, but got into a mess explaining PP and never properly addressed the downside of PP. It's an opportunity missed because I fear the article confuses readers even more on an important topic.
The key issue at play is non-compliance in a randomized experiment. If some patients are assigned to AA treatment and others are assigned to some other treatment, typically some subset of patients will "cross-over," (or drop out altogether), and usually such cross-over is associated with the outcome being measured--for example, a patient assigned to AA treatment felt that AA was not working and aberrantly switched to the other treatment; or vice versa.
ITT and PP differ in how they deal with the subset of non-compliers. In ITT, you analyze everyone in the experiment based on their initial assignment, ignoring non-compliance. In PP, you drop all non-compliers from the study, and analyze the subset of compliers only. (Each analysis is "extreme" in its own way.)
Between these two, I usually preferred ITT. The PP analysis answers the question: "If everyone complied with the treatment, what would be its effect?" I don't find the assumption of zero non-compliance realistic. ITT answers a different question: "Of those who take are given the treatment, what would be the expected effect?" This effect is an average of those who complied and those who did not comply, weighted by the proportion of compliers.
Frakt lost me when he said:
In a hypothetical example, imagine that 50 percent of the sample receive treatment regardless of which group they've been assigned to. And likewise imagine that 25 percent are not treated no matter their assignment. In this imaginary experiment, only 25 percent would actually be affected by random assignment.
First of all, the arithmetic does not work. If we ignore assignment as he suggested in the first two sentences, then the patients can either have received treatment or not. But 50 percent plus 25 percent leaves 25 percent of the patients unaccounted for.
Here is an illustration of what I think Frakt wanted to get across:
Of the 50% assigned to the treatment, 90% (45 out of 50) complied and 10% crossed over. Of the other half initially assigned to no treatment, 60% (30 out of 50) crossed over to the treatment. All in all, 75% of the study population received treatment and 25% did not... regardless of their initial assignment.
In an ITT analysis, all patients in the table are analyzed. We compare the top row with the bottom row. By contrast, in a PP analysis, we only analyze the patients along the top-left, bottom-right diagonal, namely, the 65% of the patients who complied with the assigned treatment. So, we compare the top left corner with the bottom right corner.
The important question is whether this 65% subset constitutes a random sample. Frakt implies it is: "only 25 percent [i.e. 65 percent in my example] would actually be affected by random assignment." Maybe when he said "affected by", he didn't really mean random; because it should be obvious that treatment is no longer randomized within the 65% subset.
If the 65% subset were randomly drawn from the initial population, we should still have equal proportions of treated versus non-treated but in fact, we have 70% treated versus 30% not treated. Said differently, the not-treated patients are more likely to cross over than the treated patients.
Cross-over isn't something that happens randomly. Patients are assessing their own health during the experiment, and thus, the opting out is frequently related to the observed (albeit incomplete) outcome.
In the article, Frakt states that the study of Humphreys et. al. "corrects for crossover by focusing on the subset of participants who do comply with their random assignment". I call this "filtering" rather than "correcting".
Does analyzing this subset lead to an accurate estimate of the treatment effect? I don't think so.
By filtering out the cross-overs, the researchers introduce a survivorship bias. If the cross-overs do so because they are unhappy about their assigned treatment, then these patients, if forced to continue the original treatment, are likely to have below-par outcomes compared to those who did not cross over. In a PP analysis, this subset is removed. Practically, this means that the treatment effect (PP analysis) is too optimistic.
Frakt is careless with his language when it comes to discussing the downside of PP analysis. He says (my italics):
it’s not always the case that the resulting treatment effect is the same as one would obtain from an ideal randomized controlled trial in which every patient complied with assignment and no crossover occurred. Marginal patients may be different from other patients...Despite the limitation, analysis of marginal patients reflects real-world behavior, too.
"Not always" leaves the impression that PP analysis is usually right except for rare situations. Note how he uses the word "limitation" above (paired with "despite"), and below, when discussing ITT analysis:
For a study with crossover, comparing treatment and control outcomes reflects the combined, real-world effects of treatment and the extent to which people comply with it or receive it even when it’s not explicitly offered. (If you want to toss around jargon, this type of analysis is known as “intention to treat.”) A limitation is that the selection effects introduced by crossover can obscure genuine treatment effects.
The choice of words leaves the impression that ITT is more limited than PP when both analyses suffer from problems arising from the same source: patients with worse outcomes are more likely to cross over.
Many readers of the NYT article link to a much longer article in The Atlantic. It appears that the scientific evidence on AA is very weak.
I was creating an online survey using Surveymonkey earlier this week. They asked me to try their new design, and so I did. There appeared to be a bug in one of the features. It kept preventing me from displaying the questions in a certain way. I tried a bunch of tricks but after ten minutes, decided to switch back to the old design. I clicked on their Feedback link and after describing my problem, they asked me to answer a few questions.
Here is one question:
This question is as standard as they come in a customer satisfaction survey.
My mood at the time was slightly unhappy. Just as I was about to click on that second-to-last radio button, I stopped. Can you see why? (Look at the choices more carefully.)
The fourth button is labelled "Slightly Satisfied". I was expecting it to say "Slightly Dissatisfied"!
Then I realized Surveymonkey is using a unipolar scale. All five answers are varying levels of satisfaction. I'm more used to a bipolar scale, such as:
Neither satisfied nor dissatified
The bipolar scale is centered in the middle and allows answers in both positive and negative directions.
I was debating between the last two choices. Was I "slightly satisfied" or "not at all satisfied"? Surely, I wasn't 100 percent unhappy, far from it. But "slightly" was also inappropriate. The mirror image of "slightly dissatisfied" should be "mostly satisfied", which meant I should be debating between the second button and the last.
However, "very satisfied" didn't fit with my mood, even though technically it was the mirror image of it. I wanted to express a negative sentiment, albeit minor, not a positive sentiment, albeit qualified. (Since I couldn't bear to pick either, I abandoned the survey at that point.)
I am not a fan of unipolar scales for many applications. For example, if you are measuring political attitudes (conservative and liberal), would your choices be:
Not at all conservative
or would they be
Neither conservative nor liberal
The unipolar scale automatically creates the problem of which pole to feature in those answers. Conservatives probably won't have an issue with that unipolar conservative scale but it's difficult for a "somewhat liberal" person to think he/she is "moderately conservative" or "very conservative"; vice versa.
The criticism of bipolar scales is that people (and I think this means Americans, and I doubt it generalizes to other cultures) tend to bias toward the positive direction relative to the negative. I don't see that as a big problem if a 7-point scale is used, or have the scale re-centered.
The American Association for Public Opinion Research (AAPOR) has put out its Big Data report last month (link). This one is worth reading. It has some of the most current citations, and readers of this blog will be very receptive to its core messages. The team who wrote the report is a mix of academics and practitioners.
In Big Data, there are many self-evident truths, according to the people who talk about Big Data. One of these is the idea that Big Data will make surveys obsolete. How could it not since Big Data means you have hundreds if not thousands of times more "respondents", the ability to track trends in "real time", and the ability to evolve your survey questions? This AAPOR report, I suppose, is a response to such claims.
Then there are those who say surveys are merely a suboptimal stopgap in the old days of "small data". They explain surveys measure "stated preferences" while Big Data (i.e. found data, observational data) measure "revealed preferences", and aiming for the latter is self-evidently better. Revealed preferences are closer to "the truth". Surveys merely represent an approximation, and with Big Data, we no longer need to run surveys.
This topic nicely ties in with Chapter 1 of Numbers Rule Your World (link). In that chapter, I explain the success of Disney in keeping customers happy despite having to wait two hours for rides that last two minutes. The "imagineers" realized that managing perception is even more important than optimizing reality. Customer happiness improves even though measured waiting times (i.e. revealed information) worsen or stay the same.
If Disney relies solely on revealed metrics, the data would say customers are waiting longer, or just as long, as before. When Disney conducts surveys and ask people how they feel, they say they are waiting less and so are happier. Feelings are crucial data that are not revealed by any observations. To the extent that feelings are "revealed", it requires an assumption on the part of the observer; since the measured waiting time is reduced, the customers must be feeling happier.
On a website, the web log has measured data that reveal the paths of users through the website. One can observe where traffic drops off. But that is not enough. In order to reduce attrition, the designer needs to understand why users exit. Surveys provide the answer here.
What if you learn that users exit by clicking on the home page icon? So you test a version of the design in which the home page icon is not clickable. You observe that the exit rate has significantly fallen off in the new version. Problem is the users now find a different way to exit. Relying only on revealed preferences frequently leads to superficial actions that cure symptoms but not root causes.
Revealed preferences and stated preferences are two different dimensions and they both have strengths and weaknesses. Logs are bigger and faster but researchers have no control over the composition of the responders. Neither is a substitute for the other. I am interested in seeing work on integrating the two approaches. The AAPOR report has a good discussion of this subject plus a few references to new work on integration.
Chapter 1 of Numbersense(link)uses the example of U.S. News ranking of law schools to explore the national pastime of ranking almost anything. Since there is no objective standard for the "correct" ranking, it is pointless to complain about "arbitrary" weighting and so on. Every replacement has its own assumptions.
A more productive path forward is to understand how the composite ranking is created, and shine a light on the underlying assumptions.
The New York Times recently published an article entitled "What's the Matter with Eastern Kentucky?" (link). The problem with Eastern Kentucky, as the reporter saw it, is that those counties rank at the bottom of their list. Here is their ranking methodology:
The team at The Upshot, a Times news and data-analysis venture, compiled six basic metrics to give a picture of the quality and longevity of life in each county of the nation: educational attainment, household income, jobless rate, disability rate, life expectancy and obesity rate. Weighting each equally, six counties in eastern Kentucky’s coal country (Breathitt, Clay, Jackson, Lee, Leslie and Magoffin) rank among the bottom 10.
There is a companion blog at The Upshot, giving more context, and a county-level map of the ranking (link). Here are the relevant sentences.
The Upshot came to this conclusion by looking at six data points for each county in the United States: education (percentage of residents with at least a bachelor’s degree), median household income, unemployment rate, disability rate, life expectancy and obesity. We then averaged each county’s relative rank in these categories to create an overall ranking.
(We tried to include other factors, including income mobility and measures of environmental quality, but we were not able to find data sets covering all counties in the United States.)
We used disability — the percentage of the population collecting federal disability benefits but not also collecting Social Security retirement benefits — as a proxy for the number of working-age people who don’t have jobs but are not counted as unemployed.
How should we read this article?
What is this a ranking of? What is the research question? The answer is "how hard it is to live in specific counties". Right away, we know any answer is subjective, even if data is proffered.
Look out for the relative weights. The authors tell us it's equally weighted. "Equal weighting" implies fairness but frequently hides the inequity. Are those six factors equally important? Are there strong correlations among some of those factors?
The blog post discloses that each of the six metrics is first converted to ranks before being averaged. This means we need to worry about how much each metric vary from county to county. Take obesity rate for example. Here is a map of obesity at the county-level published by the CDC, based on a model estimate (link).
The people who made this map placed the counties into five groups. The middle groups are narrowly defined, for example, 29.2% to 30.8%. Any analyst who converts the county-level obesity rates to ranks makes over 3000 gradations of obesity rate. Said differently, the worst county is rated as over 3000 times worse than the best county. In the case of obesity, the medical community would consider most of these counties unhealthy.
This is an example that shows too much granularity hurts you, a core insight of statistics that may seem counterintuitive.
Ultimately, it's for you to decide whether you believe this ranking makes sense or not. I'm not here to dismiss it because as I said in Numbersense (link), you can replace this methodology with something else, but the new method will also have its own assumptions.