An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.
The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.
Sounds simple enough.
But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.
What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).
Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.
The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)
Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.
It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.
Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.
I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.
I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.
The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.
But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.
What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.
If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.
A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.
In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?
And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.
I chose path B. It was ugly and unsexy but more of a sure thing.
I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.
By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.
Next time try mergic! It's pretty cool for these sorts of messy, real-world string matching tasks: https://github.com/ajschumacher/mergic
Posted by: Zach | 06/11/2015 at 12:08 PM
Zach: I went to a talk about mergic recently. How does it deal with merging columns other than the match key?
The larger issue is whether this is a problem of tools. Ideally, if there is a tool, I'd like it to solve the problem completely. Tools like mergic solve part of the problem; then, the analyst must review the output to solve the rest of it. The time invested in learning the new tool often doesn't pay off.
But I'm being too demanding I think. The underlying issue is incomplete information. There isn't a clear cut answer to whether Scott in "Scott Lewis" is a first or last name. There is no hope that a tool can give a sure answer. So tool developers give probabilities. Then people have to interpret those probabilities and make decisions.
The other challenge is the issues are not known in advance. Solutions often solve one problem while exacerbating a different problem.
Posted by: Kaiser | 06/11/2015 at 02:51 PM
Mergic only creates the match key. After joining the match key to each of the parent tables, you do the join as you normally would. This means you can run mergic on multiple pairs of columns between the 2 tables, join each match key to both tables, and then merge the 2 tables on multiple keys in the final step.
What I personally like about mergic is it explicitly includes a "human makes edits step." So the algorithm does 80% of the work (grabbing all the easy cases) and you can focus your manual work on the 20% cases like "Scott Lewis" vs "Lewis Scott."
Posted by: Zach | 06/16/2015 at 04:55 PM
Here's an excellent mergic tutorial:
https://github.com/ajschumacher/mergic/tree/master/tennis
Posted by: Zach | 06/16/2015 at 05:05 PM
Kaiser - welcome to the big bad world of data quality. Where people think computers can perform magic! Did you consider the "Jr. or III" as well? No tool can handle bad data. This is one of many reasons why, for all the hype about "big data", it's small data that matters. If the developers had used a RDBMS with constraints, had determined how to identify people (email address in both places, etc)..
Also, a lot of older folks share an email account (usually the one their ISP sets up for them). So even email can't be relied upon to reach a single person.
Posted by: Nate | 06/17/2015 at 12:19 PM