In the last installment, I embarked on a project--perhaps only a task--to assemble a membership list for an organization. It sounded simple: how hard could it be to merge two lists of people? Of course, I couldn’t just stitch one list on top of the other as there are members who subscribed to the newsletter as well as joined the Facebook group. These duplicate rows must be merged so that each individual is one row of data.
With barely a sweat, I blew past my initial budget of two hours.
After a half day, I produced a merged list by matching Facebook usernames to email usernames. It felt like running an obstacle course, with one annoying issue popping after another was resolved. Stray punctuation, ambiguous names, case sensitivity, and so on. Most of these problems lacked clear-cut solutions. Some periods (full stops) were redundant but not all; some middle names were part of the last name but not all. Tick, tick, tick, tick. These data issues demanded consideration, and considerable time.
At the start of Day 2, I executed a planned U-turn. Starting with the two lists of people, I attempted to match first and last names. I tried usernames as the key first because only a small portion of the Email list included names. However, a match of first and last names is a more confident result than a match of usernames.
Immediately, I stepped into text-matching quicksand. I must process the Facebook names (previously scraped) the same way I fixed up the names in the Email list.
As before, I tried a “full outer join.” Disaster. The output data had a crazy number of rows. I sensed missing values. Sure enough, there were some Facebook members for whom I did not have names (for example, they provided names in Chinese or Korean characters.) Each of these members with missing names matched, erroneously, the whole set of email subscribers who also did not provide names.
One way out of this mess was to extract only people with non-missing names from either list, and then merge those subsets. This path was not easy though. I had created four types of members: those with matching names on both lists; those having a Facebook name which didn’t match to anyone with an email name; those having an email name which didn’t match to anyone with a Facebook name; those who provided no usable names in either list.
The challenge was to combine those four groups of members in such a way that each unique member is just one row of data. For each such member, I also wanted to gather all other information from both Facebook and email lists. This required defining a number of dummy columns and also various columns sourcing the data.
I experienced a soothing satisfaction when the output data appeared as expected.
But the job was not yet finished. I ended up with two merged lists, one based on username matching, and the other, name matching. It was time to merge the merged. I spare you the details, most of which resembled the above.
Knowing my client’s name was on the list, I looked him up. There he was, again and again, occupying four or five rows. This might make your heart sink since I had tried so hard to maintain one row per member. But don’t worry. I was simplifying things a little bit. If someone provided multiple email addresses, as my client did, I had decided to keep all of them.
At long last, the master list of members was born. This exercise bore instant rewards. It is very useful to know which members are on both lists and which members are on just one. We have a rough measure of how involved a member is. The hard work lies ahead since our goal is to gain a much deeper understanding of the members.
Some years ago I had a data set of medical treatment where some patients moved centres, but because of privacy concerns we didn't get names or even id numbers. So i resorted to dates of births. Then I found that someone hadn't implemented a check for valid date of birth, so some had day and month transposed, others were missing the 19 at the start of the year. I ended up with a lot of merges to test each possibility if the correction wasn't obvious. It was a constant stream of e-mails back to the source of the data asking them what was correct. Eventually we ended with checking for patients that suddenly disappeared and then seeing if a similar patient suddenly appeared at another site.
I think there is now software for medical health data that will determine how close the match is to perfect, and will allow a decision on how close they need to be to be included.
Posted by: Ken | 06/16/2015 at 07:35 PM
@Ken - those might have been european folks with the day and month transposed. I try to impress on people the need to use a standard date field. YYYYMMDD. It's sortable, easy to remember, and easy to enter.
Posted by: Nate | 06/17/2015 at 12:11 PM