For those in Boston/Cambridge, I will be speaking at the Chief Data Scientist meetup on Wednesday night. See you there.
Warning: this post may be hard to understand if you don't know SQL.
SQL is one of the most fundamental tools in data science. It is used to manipulate data. Its simplicity is a big reason for its popularity. There are lots of things it can’t do but the few tasks it supports cover the majority of required tasks.
Over the years, I have noticed some bad habits of SQL coders. These habits tend to prevent the coders from “seeing” the imperfections in their data. Here are a few:
“Select top N” to “spot check” the data
Most analysts realize that they need to check the integrity of a data set. The easiest “check” is to eyeball the top N rows of data. In most cases, the data set is ordered in some way, not necessarily known to the analyst, so the top N rows do not form a representative sample of all rows.
Even if those top rows were as if random, it’s not clear what checks the analyst is performing mentally as he or she scrolls up and down a printed list of say 100 rows and 20 columns of data. Is the analyst looking for missing data? For extreme values? For discontinuity in the distribution? For out-of-range values? None of these tasks are simple enough to do in one's head.
Further, if there is a problem with the data, it usually comes from extreme or missing values, which are rare. Or if the data contain text, it may be that a few rows contain bad characters that will trip up SQL during a routine task.
Here’s the bottom line: if the data problem affects a huge chunk of the data, you will find it using a spot check, or any kind of checking. But most data problems affect a small corner of the data. A spot check will almost always miss these, leading to a false negative problem. The real trouble is when the analyst issues a bill of health after a spot check.
Assume that a data table has no duplicate rows
When merging data sets, it’s very easy to generate duplicate rows, if one or both of the data sets contain duplicate rows of the same match key. For instance, the analyst is merging the customer sales history data with the customer contact information data. The match key is the customer id number.
It is normal to assume that the contact database has only one row for each customer (who would design this table in any other way?), and nine times out of ten, this assumption will be correct.
The one time is when your CEO needs an updated sales number right now for a board meeting. Oops, the sales number is double the expected value. The culprit is that duplicate customer ids made their way into the contact history table so that when it is merged with the transactions history, each sales record is replicated one or more times.
It may sound like a waste of time but before merging any data, check each table for duplicates.
Use open-ended time windows
In business analytics, we are always counting events over time, be they sales transactions, clickthroughs, responses to offers, etc. I have a pet peeve: code that does not have explicit accounting windows, meaning a starting time and an ending time.
Such code is not auditable. Every time you run the code, it will generate a different count (unless your business has infrequent events). If you wrote the code yesterday, and I ran it today, the counts would be different. How will I know if the difference is entirely due to the longer accounting window or if there are problems with the underlying data?
The usual excuse for this coding practice is that the business wants the “most updated” number, up to the very last microsecond. Let me assure you: a day-old number that has been verified is preferred to a second-old number that cannot be audited.
The other excuse is the code is hard to maintain since you have to hard-code the ending time. But that is letting the tool limit your analytical ability. There are plenty of tools for which this is not a limitation.
When I talk about numbersense, I am also talking about the habits of the analysts. Bad habits doom many analyses before takeoff.
Have you encountered these issues? Do you have your own list of bad habits? Let me know!