A LinkedIn contact and 538 reader pointed me to this demo video by Joe Hellerstein, from a Bay Area startup called Trifacta. They have a neat product that tries to automate data cleaning/processing tasks for analysts.
I love that people are working on this problem. It's an area that I'm interested in getting involved in. Also, they have a sleek user interface, well thought out, and innovative.
There is a long way to go still. The product is designed by computer scientists and it shows in several ways:
1. The data is, by and large, accepted as pristine. The tasks Joe chose to show during the 15-minute demo are about transforming or formatting variables. There is no "cleaning". All of the data are presumed correct. There was a brief moment of unease when he found missing values in a date field, which led to a difference in days being recorded as NA. This was quickly "solved" by replacing those differences by zero days! (In reality, this is probably right censored data for which the missing is informative.)
2. Visual inspection of the top 10 rows is central to the process. I already ranted about this practice here. In the Trifacta design, the top rows are not just used to check the data; they are also used to generate transformation rules.
I suggest that Trifacta hire a statistician to expand the list of tasks that need to be tackled. This is a good product that can be made great.
***
Also, New York Times recently wrote about the "unsexy" part of the job. (link)
Here's my take from a few years back (link).
Plus, I have more in the Epilogue of Numbersense (link).
You nailed the problem! Data scientists and others need help finding the correct version of the correct file. Here's one startup that is addressing that problem with a data inventory for Hadoop: www.waterlinedata.com.
Posted by: Denise Sparks | 09/15/2014 at 02:06 PM