« Another day, another misuse of averages | Main | A report from #JSM Joint Statistical Meetings 2014 »


Feed You can follow this conversation by subscribing to the comment feed for this post.


"It may sound like a waste of time but before merging any data, check each table for duplicates."

Even better— enforce uniqueness for things like customer IDs

Ben Graham

I've often seen examples of outer joins where the restriction is placed in the where instead of the join's on statement. It instantly turns an outer join to an inner one.

Another danger point is dates - assuming that between 2014/01/01 and 2014/01/02 includes two full days when it may include only one.


Ben: Yes I have encountered those two as well. Dates and times in particular have all sorts of issues.

Zach: In theory, your solutions works. In practice, not so much. This is because the problem has not gone away; you just pushed it upstream. Whoever enforces uniqueness has to deal with the dupes. Because the analytics team is at the end of the assembly line, we are caught holding the bag.

Jordan Erickson

I like your solutions to #2 and #3. Do you have any suggestions for #1--“Select top N” to “spot check” the data?
As you effectively argue, even random selection may not be effective.
Missing data, extreme values, discontinuity in the distribution, out-of-range values... one way to (maybe?) comprehensively check for these undesirables is through summary statistics and graphical displays -- is.na(), summary(), quantile(x, c(.90, .99)), class(), histograms, box plots, time series, qq plots, etc. This of course requires time, but reporting half the actual revenue for a month because of a missing value and then fixing it (after the executive catches!) requires more time (and diminishes people's confidence in your analysis).


Not sure I can describe this concisely but something on the order of "non-robust validation checks"...Example: I recently had a client with a data set of fuel supply, consumption, and inventory stocks. They had never checked them against each other, i.e. a programmatic check that (1) inventory stocks at time n = inveventory stocks at time n-1 + supply - consumption [the difference is losses, which should be small] and (2) inventory stocks (expressed as a function of days of consumption) conformed to expected industry standards. The supply and consumption data were pretty reliable and could be cross-checked, and doing the tests described above showed convincingly that much of the inventory stocks data was nonsensical.

In general, I think the analyst should write the validation tests before even looking at the data, based on the description of data from the source--much like unit testing of methods/functions in programming. It's difficult to do that in the time-crunch of a project, but it sure helps prevent problems later on.


Jordan: Yes, for #1, you have to use a different software, like R, SAS, etc. Look at distributions and summary statistics. Also, visualization is often very useful.

Dale: Actually yes, I encountered this with accounting data, which also has inventory like equations. I think this item is really about checking your computations. A simple analogy would be you compute market share by dividing product A units by total units. Now we can check this by other formulas such as (a) multiply product A share by total units subtract product A units or (b) add all product market shares and divide by 100%. Or, try two different ways of computing the same variable and see if they match up.

Jordan Erickson

From the new Setosa blog, a way to visualize mistakes in CSV files: http://setosa.io/blog/2014/08/03/csv-fingerprints/
A good way to see what's missing (empty) and classes (string, decimal, integer).


Jordan: Thanks for the note. I asked them a question. They set up the problem right but then the example they give is trivial.
Also, the page is now at: http://setosa.io/blog/2014/08/02/csv-fingerprints/

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Your Information

(Name is required. Email address will not be displayed with the comment.)


Link to Principal Analytics Prep

See our curriculum, instructors. Apply.
Business analytics and data visualization expert. Author and Speaker. Founder of Principal Analytics Prep, MS Applied Analytics at Columbia. See my full bio.

Next Events

May: 2 New York Marketing Association Big Data Workshop, NYC

May: 5 NYPL Analytics Careers Talk, NYC

May: 8 Data Visualization Seminar, Denver, CO

May: 15 Data Visualization Seminar, Cambridge, MA

May: 17 Data Visualization Seminar, Philadelphia, PA

May: 22 Data Visualization Seminar, San Ramon, CA

Past Events

See here

Future Courses (New York)

Summer: Statistical Reasoning & Numbersense, Principal Analytics Prep (4 weeks)

Summer: Applied Analytics Frameworks & Methods, Columbia (6 weeks)

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee


  • only in Big Data