An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.
The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.
Sounds simple enough.
But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.
What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).
Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.
The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)
Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.
It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.
Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.
I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.
I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.
The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.
But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.
What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.
If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.
A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.
In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?
And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.
I chose path B. It was ugly and unsexy but more of a sure thing.
I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.
By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.