The Wall Street Journal has been doing an excellent job describing the various data mining activities undertaken by a variety of companies out there in a series of articles. Readers can judge for themselves whether these activities are a breach of privacy, or a reasonable trade-off that provides more benefits than harms.
The latest article focuses on an experiment in which a life insurer attempts to use consumer marketing data instead of direct medical test results to assess the risk of applicants. This type of model is essentially the same as the credit-scoring models described in Chapter 2 of Numbers Rule Your World; and in Chapter 3, I discussed the basic problem facing insurers of separating the high-risk and the low-risk subgroups of customers.
One particular sentence in the article requires some comments:
The goal of Aviva's test: With 60,000 actual insurance applicants, figure out how to use the marketing databases and other information to reach the same underwriting conclusions that Aviva reached using traditional methods such as blood work. The 60,000 people were applicants Aviva had already judged.
The stated objective is by no means the only natural objective for such a pilot test. The end goal of any predictive model is to accept a subgroup of applicants who on average will generate enough revenues to cover the payouts plus the insurer's desired profit margin. Two types of models are being compared here, one using "traditional" medical test data, the other using marketing data (collected both online and offline). Instead of directly testing whether the marketing-data-driven model can predict accurately whether the approved applicants are profitable, the stated goal only compares how well the marketing-data-driven model mimics the predictions of the medical-data model.
There is an underlying assumption that the medical-data model is accurate enough. In other words, the objective of the insurer is purely to reduce the cost of predicting because the cost of buying marketing data is much lower than the cost of conducting medical tests. This test will not improve the accuracy of the prediction while it is possible that the marketing-data-driven model may in fact reduce the accuracy. The insurer might very well accept a less accurate model because according to the Journal, the cost differential is very large: $125 per head (I think this is the right number; the article is a bit confusing on this point) which might cover the incremental cost of lesser accuracy.
The purported benefit to potential customers is less hassle, and lower cost (assuming that the insurer in fact passes on the cost savings). However, the insurers are understandably cautious about this because of something called adverse selection -- the type of customers most likely to benefit from no medical tests is the subgroup of customers who have private knowledge about their own longevity, and so removing the requirement for testing will likely attract this type of applicants disproportionately.
This explains why the insurers say:
The process would simply speed up applications from people who look like good risks. Other people would go through the traditional assessment process.
In Chapter 2, I also described various credit-repair scams, including piggybacking on other people's good credit history. A fundamental flaw of using online data, especially using social-media data, is the ease with which any customer can create fake personas. If you know that insurers are tracking you on Facebook, you can easily paint yourself as their ideal customer, making yourself look like a "good risk" when you aren't. I'd love to hear what the data miners have to say about this.