« Another day, another misuse of averages | Main | A report from #JSM Joint Statistical Meetings 2014 »


Feed You can follow this conversation by subscribing to the comment feed for this post.


"It may sound like a waste of time but before merging any data, check each table for duplicates."

Even better— enforce uniqueness for things like customer IDs

Ben Graham

I've often seen examples of outer joins where the restriction is placed in the where instead of the join's on statement. It instantly turns an outer join to an inner one.

Another danger point is dates - assuming that between 2014/01/01 and 2014/01/02 includes two full days when it may include only one.


Ben: Yes I have encountered those two as well. Dates and times in particular have all sorts of issues.

Zach: In theory, your solutions works. In practice, not so much. This is because the problem has not gone away; you just pushed it upstream. Whoever enforces uniqueness has to deal with the dupes. Because the analytics team is at the end of the assembly line, we are caught holding the bag.

Jordan Erickson

I like your solutions to #2 and #3. Do you have any suggestions for #1--“Select top N” to “spot check” the data?
As you effectively argue, even random selection may not be effective.
Missing data, extreme values, discontinuity in the distribution, out-of-range values... one way to (maybe?) comprehensively check for these undesirables is through summary statistics and graphical displays -- is.na(), summary(), quantile(x, c(.90, .99)), class(), histograms, box plots, time series, qq plots, etc. This of course requires time, but reporting half the actual revenue for a month because of a missing value and then fixing it (after the executive catches!) requires more time (and diminishes people's confidence in your analysis).


Not sure I can describe this concisely but something on the order of "non-robust validation checks"...Example: I recently had a client with a data set of fuel supply, consumption, and inventory stocks. They had never checked them against each other, i.e. a programmatic check that (1) inventory stocks at time n = inveventory stocks at time n-1 + supply - consumption [the difference is losses, which should be small] and (2) inventory stocks (expressed as a function of days of consumption) conformed to expected industry standards. The supply and consumption data were pretty reliable and could be cross-checked, and doing the tests described above showed convincingly that much of the inventory stocks data was nonsensical.

In general, I think the analyst should write the validation tests before even looking at the data, based on the description of data from the source--much like unit testing of methods/functions in programming. It's difficult to do that in the time-crunch of a project, but it sure helps prevent problems later on.


Jordan: Yes, for #1, you have to use a different software, like R, SAS, etc. Look at distributions and summary statistics. Also, visualization is often very useful.

Dale: Actually yes, I encountered this with accounting data, which also has inventory like equations. I think this item is really about checking your computations. A simple analogy would be you compute market share by dividing product A units by total units. Now we can check this by other formulas such as (a) multiply product A share by total units subtract product A units or (b) add all product market shares and divide by 100%. Or, try two different ways of computing the same variable and see if they match up.

Jordan Erickson

From the new Setosa blog, a way to visualize mistakes in CSV files: http://setosa.io/blog/2014/08/03/csv-fingerprints/
A good way to see what's missing (empty) and classes (string, decimal, integer).


Jordan: Thanks for the note. I asked them a question. They set up the problem right but then the example they give is trivial.
Also, the page is now at: http://setosa.io/blog/2014/08/02/csv-fingerprints/

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep