This Wall Street Journal article about predictive modeling in the workplace is well worth reading. Your reaction may vary from excitement to stomach churning. (link)
As with most reporting on analytics, the author never addresses the issue of model errors. For example, this article started out with:
When looking for workers to staff its call centers, Xerox Corp. used to pay lots of attention to applicants who had done the job before. Then, a computer program told the printer and outsourcing company that experience doesn't matter.
The fun may be squeezed out if we want a more balanced story. I get that. But surely, someone should ask if this "computer program" makes no errors. Because if you read this style of reporting, you would think every prediction is correct.
The reporter does get into the creepiness and the legal murkiness of using such models but those discussions hover over the unspoken assumption that the predictions issued by these computer algorithms using "big data" are infallible. It's one thing if the model excludes, say, women from being hired because well, every woman fails in that particular job; it's another thing if the model excludes women erroneously! (I can already see the "I took steroids but look, it didn't help." excuse.) Later in the article, we are told that a startup company has created the model of "an ideal call-center worker". The fact that there could be an "ideal" reflects a level of over-confidence in models that bothers on insanity. (And I'm someone who builds models for a living.)
As you read the article, bear these points in mind:
- In Chapter 2 of Numbers Rule Your World, I talk about models of correlations versus models of causation. In my view, when people's livelihood is at stake, a correlational model is not good enough. All these models described in the article are correlation based. The factors they use are things like commuting distance, and social networks. There is a serious danger of "causation creep" here.
- An analogy: if you look at crime statistics, it is certainly true that African Americans are more likely to be criminals. If you build a crime prediction model, race will be a strong predictor even if it's not the only predictor. A lot of us are uncomfortable with this kind of racial profiling.
- A technical note: this type of models suffers from "rejection inference" (terminology from the credit scoring world). I didn't find space to talk about rejection inference in Chapter 2 but I might as well say something here. The output of such predictive models is used to make a hiring decision. The candidate can either be hired or rejected. If the candidate is hired, the company can track his or her performance and make a determination as to whether, ex-post, the hiring decision was good or bad. If the candidate is rejected, the company has no way of knowing whether the rejection was just or not. So when any company collects data to build a predictive hiring model, the dataset is already biased, as it does not contain candidates who would have been rejected.