I haven't written a despatch from the desk of the data scientist in a while but I had quite the stereoptypical data science moment yesterday so I'm sharing with you how I spent three hours of my life.
For a consulting project, I'm using a list of last names compiled by the government. As all data science stories begin, I must jump through a series of hurdles. The hurdles are set up in a dark tunnel; I can't see the next gate until I've jumped past the previous one.
The file of names comes in Excel and csv (text) formats. I spend some time investigating the data in Excel. Then, I upload the csv file to Python, where I do some heavy lifting. Once satisifed, I convert the data table back to csv which I intend to send to my client together with the code for the meat of the project. Processing the file of names is but a tiny step in a longer journey.
In the main code, the first step imports the modified csv file. Now, I'm merging this dataset with other files. To give you a flavor for this step, imagine the following made-up task. A third party has built a model that predicts someone's birthplace based on one's family name. For each name on my list, I want to attach the likely birthplace. This task is similar to a VLOOKUP in Excel.
Merging datasets is rarely uneventful. You need a match key, a column that exists on both datasets, and in the same format. If one column is in small letters and the other column is capital letters, nothing will match, and the resulting matched dataset is empty. If one dataset has compound last names (say, Smith Johnson) while the other uses hyphens (Smith-Johnson), those entries will not connect. These issues are easily spotted and fixed as you realize the data have disappeared after the merge.
What I'm battling is more wicked. Perhaps once in hundreds of entries, I find some wild results. Values are attached to other things that do not seem possible given the code I've written. Combing through the code reveals no obvious bugs. The results are just too outlandish.
From past experience with Python, I try rewriting the code fragment using a simpler data container (from data frames to lists, for those who care). A different error pops up. I feel like another block is pulled from my tower of Jenga. Please not let it collapse! The new error is a missing value in the list of names, in Python syntax, a "nan".
Ordinarily, blank data is an easy error to fix. Here, I'm mystified because the original source, the Excel sheet, does not contain any blanks. Perhaps I've erred when dealing with a catch-all entry called "All Other Names" (which is a bundle of the most infrequent names, obscured for privacy reason.) So I trace the data all the way back to the import step. I almost wish I erred... but that is not to be.
An hour into this work, I am downtrodden. I have a list of non-repeating names, e.g. [ADAMS, SMITH, WILLIAMS, ....], and when I turn this list into a "set", a missing value appears, e.g. (nan, ALSTON, APPLETON, ... ). The set operator should have just changed the data container but should not have inserted a missing value!
Let's take stock of my situation. I discover some wild values in my dataset after merging a file of names with some other dataset. While investigating it, I learn that somewhere in my code, one name has been converted to missing. I have yet to figure out how that missing value has crept into the data. But I know that merging datasets with missing values can create wayward output.
I plunge into the rabbit hole yet again. I pinpoint the location of the empty value inside my list of names. The list has come from a data frame. I determine the offending row, which shows a blank in the name column. The data frame is a copy of the modified csv file, which leads back to the original csv from the government site. On that original file, I strike gold.
The gold isn't the blank that I've been chasing.
Pulling out that particular row, I extract the value of the "name" field to be present!! Its value is the family name "NULL". Light bulb flashing. Some procedure in my code must have interpreted "NULL" as signifying an absent value. It's automation gone astray. In most cases, NULL is a placeholder for missing but in a file of last names, "NULL" is valid data.
Which procedure is my nemesis? It has to be the file import function that reads the csv file into a Python data frame. Consulting the manual, I learn that by default, the system "handles" potential missing values by assuming NULL, and other values such as NA, N/A, and #N/A as placeholders for missing. Users can turn off this default. I immediately, excitedly edit the code to circumvent the automated conversion.
The blank is still there! I shift the order of the parameters. Nope, still there. I try editing other parameters that may affect how missing values are processed. The blank entry stubbornly refuses to go away. Perhaps, like a few other Python functions I've encountered, some of the functionality have not been implemented. What now?
I've found the culprit. I just can't fix it. I've spent way more time than expected on this little step. So I bow to reality and "hard-code" the fix. This means I add a line that says on line X, overwrite the "name" field with the value "NULL". This is la brutta figura.
***
The next morning, I open up the modified csv file using Excel. I am shocked to find the blank entry there as well. So, it isn't the file import procedure that has converted the name "NULL" to blank. That's why manipulating the import procedure has not fixed the problem.
I recall that the imported csv file isn't the original csv file from the government's site. It's been modified using some Python code. Finally, I solve the puzzle. The file import procedure is the villain but the null-to-missing conversion has happened on the first use of it, not the subsequent step.
Half the life of a data scientist is expensed on little, exasperating puzzles. Many of these riddles are side effects of previous automation, frequently regarded as "smart". At the end of these three hours, I want to berate myself for obsessing over an issue of such world-conquering significance as the rehabilitation of the NULL clan from the ignomity of being cancelled by software, a group numbering fewer than 8,000 in a country of 330 million. For my sanity, I am able to say the mangled data merge salvages the worthiness of this little issue.
Comments