For those in Boston/Cambridge, I will be speaking at the Chief Data Scientist meetup on Wednesday night. See you there.
***
Warning: this post may be hard to understand if you don't know SQL.
SQL is one of the most fundamental tools in data science. It is used to manipulate data. Its simplicity is a big reason for its popularity. There are lots of things it can’t do but the few tasks it supports cover the majority of required tasks.
Over the years, I have noticed some bad habits of SQL coders. These habits tend to prevent the coders from “seeing” the imperfections in their data. Here are a few:
“Select top N” to “spot check” the data
Most analysts realize that they need to check the integrity of a data set. The easiest “check” is to eyeball the top N rows of data. In most cases, the data set is ordered in some way, not necessarily known to the analyst, so the top N rows do not form a representative sample of all rows.
Even if those top rows were as if random, it’s not clear what checks the analyst is performing mentally as he or she scrolls up and down a printed list of say 100 rows and 20 columns of data. Is the analyst looking for missing data? For extreme values? For discontinuity in the distribution? For out-of-range values? None of these tasks are simple enough to do in one's head.
Further, if there is a problem with the data, it usually comes from extreme or missing values, which are rare. Or if the data contain text, it may be that a few rows contain bad characters that will trip up SQL during a routine task.
Here’s the bottom line: if the data problem affects a huge chunk of the data, you will find it using a spot check, or any kind of checking. But most data problems affect a small corner of the data. A spot check will almost always miss these, leading to a false negative problem. The real trouble is when the analyst issues a bill of health after a spot check.
Assume that a data table has no duplicate rows
When merging data sets, it’s very easy to generate duplicate rows, if one or both of the data sets contain duplicate rows of the same match key. For instance, the analyst is merging the customer sales history data with the customer contact information data. The match key is the customer id number.
It is normal to assume that the contact database has only one row for each customer (who would design this table in any other way?), and nine times out of ten, this assumption will be correct.
The one time is when your CEO needs an updated sales number right now for a board meeting. Oops, the sales number is double the expected value. The culprit is that duplicate customer ids made their way into the contact history table so that when it is merged with the transactions history, each sales record is replicated one or more times.
It may sound like a waste of time but before merging any data, check each table for duplicates.
Use open-ended time windows
In business analytics, we are always counting events over time, be they sales transactions, clickthroughs, responses to offers, etc. I have a pet peeve: code that does not have explicit accounting windows, meaning a starting time and an ending time.
Such code is not auditable. Every time you run the code, it will generate a different count (unless your business has infrequent events). If you wrote the code yesterday, and I ran it today, the counts would be different. How will I know if the difference is entirely due to the longer accounting window or if there are problems with the underlying data?
The usual excuse for this coding practice is that the business wants the “most updated” number, up to the very last microsecond. Let me assure you: a day-old number that has been verified is preferred to a second-old number that cannot be audited.
The other excuse is the code is hard to maintain since you have to hard-code the ending time. But that is letting the tool limit your analytical ability. There are plenty of tools for which this is not a limitation.
***
When I talk about numbersense, I am also talking about the habits of the analysts. Bad habits doom many analyses before takeoff.
Have you encountered these issues? Do you have your own list of bad habits? Let me know!
"It may sound like a waste of time but before merging any data, check each table for duplicates."
Even better— enforce uniqueness for things like customer IDs
Posted by: Zach | 08/05/2014 at 05:06 PM
I've often seen examples of outer joins where the restriction is placed in the where instead of the join's on statement. It instantly turns an outer join to an inner one.
Another danger point is dates - assuming that between 2014/01/01 and 2014/01/02 includes two full days when it may include only one.
Posted by: Ben Graham | 08/05/2014 at 07:30 PM
Ben: Yes I have encountered those two as well. Dates and times in particular have all sorts of issues.
Zach: In theory, your solutions works. In practice, not so much. This is because the problem has not gone away; you just pushed it upstream. Whoever enforces uniqueness has to deal with the dupes. Because the analytics team is at the end of the assembly line, we are caught holding the bag.
Posted by: junkcharts | 08/05/2014 at 11:00 PM
I like your solutions to #2 and #3. Do you have any suggestions for #1--“Select top N” to “spot check” the data?
As you effectively argue, even random selection may not be effective.
Missing data, extreme values, discontinuity in the distribution, out-of-range values... one way to (maybe?) comprehensively check for these undesirables is through summary statistics and graphical displays -- is.na(), summary(), quantile(x, c(.90, .99)), class(), histograms, box plots, time series, qq plots, etc. This of course requires time, but reporting half the actual revenue for a month because of a missing value and then fixing it (after the executive catches!) requires more time (and diminishes people's confidence in your analysis).
Posted by: Jordan Erickson | 08/06/2014 at 09:39 AM
Not sure I can describe this concisely but something on the order of "non-robust validation checks"...Example: I recently had a client with a data set of fuel supply, consumption, and inventory stocks. They had never checked them against each other, i.e. a programmatic check that (1) inventory stocks at time n = inveventory stocks at time n-1 + supply - consumption [the difference is losses, which should be small] and (2) inventory stocks (expressed as a function of days of consumption) conformed to expected industry standards. The supply and consumption data were pretty reliable and could be cross-checked, and doing the tests described above showed convincingly that much of the inventory stocks data was nonsensical.
In general, I think the analyst should write the validation tests before even looking at the data, based on the description of data from the source--much like unit testing of methods/functions in programming. It's difficult to do that in the time-crunch of a project, but it sure helps prevent problems later on.
Posted by: Dale | 08/06/2014 at 11:11 AM
Jordan: Yes, for #1, you have to use a different software, like R, SAS, etc. Look at distributions and summary statistics. Also, visualization is often very useful.
Dale: Actually yes, I encountered this with accounting data, which also has inventory like equations. I think this item is really about checking your computations. A simple analogy would be you compute market share by dividing product A units by total units. Now we can check this by other formulas such as (a) multiply product A share by total units subtract product A units or (b) add all product market shares and divide by 100%. Or, try two different ways of computing the same variable and see if they match up.
Posted by: junkcharts | 08/06/2014 at 03:11 PM
From the new Setosa blog, a way to visualize mistakes in CSV files: http://setosa.io/blog/2014/08/03/csv-fingerprints/
A good way to see what's missing (empty) and classes (string, decimal, integer).
Posted by: Jordan Erickson | 08/14/2014 at 06:10 PM
Jordan: Thanks for the note. I asked them a question. They set up the problem right but then the example they give is trivial.
Also, the page is now at: http://setosa.io/blog/2014/08/02/csv-fingerprints/
Posted by: junkcharts | 09/05/2014 at 12:46 AM