During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.
From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.
Here are some improvements they have made since I last wrote about the tool:
Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.
Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.
Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.
Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.
New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!
I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.