« What I said about data science at Princeton Reunions | Main | The Day After the Half Day in the Life of a Data Scientist »

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Zach

Next time try mergic! It's pretty cool for these sorts of messy, real-world string matching tasks: https://github.com/ajschumacher/mergic

Kaiser

Zach: I went to a talk about mergic recently. How does it deal with merging columns other than the match key?

The larger issue is whether this is a problem of tools. Ideally, if there is a tool, I'd like it to solve the problem completely. Tools like mergic solve part of the problem; then, the analyst must review the output to solve the rest of it. The time invested in learning the new tool often doesn't pay off.

But I'm being too demanding I think. The underlying issue is incomplete information. There isn't a clear cut answer to whether Scott in "Scott Lewis" is a first or last name. There is no hope that a tool can give a sure answer. So tool developers give probabilities. Then people have to interpret those probabilities and make decisions.

The other challenge is the issues are not known in advance. Solutions often solve one problem while exacerbating a different problem.

Zach

Mergic only creates the match key. After joining the match key to each of the parent tables, you do the join as you normally would. This means you can run mergic on multiple pairs of columns between the 2 tables, join each match key to both tables, and then merge the 2 tables on multiple keys in the final step.

What I personally like about mergic is it explicitly includes a "human makes edits step." So the algorithm does 80% of the work (grabbing all the easy cases) and you can focus your manual work on the 20% cases like "Scott Lewis" vs "Lewis Scott."

Zach

Here's an excellent mergic tutorial:
https://github.com/ajschumacher/mergic/tree/master/tennis

Nate

Kaiser - welcome to the big bad world of data quality. Where people think computers can perform magic! Did you consider the "Jr. or III" as well? No tool can handle bad data. This is one of many reasons why, for all the hype about "big data", it's small data that matters. If the developers had used a RDBMS with constraints, had determined how to identify people (email address in both places, etc)..

Also, a lot of older folks share an email account (usually the one their ISP sets up for them). So even email can't be relied upon to reach a single person.

The comments to this entry are closed.

Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.
Numbers Rule Your World:
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

Search3

  • only in Big Data

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Courses

R Fundamentals, Principal Analytics Prep

Numbersense: Statistical Reasoning in Practice, Principal Analytics Prep

Applied Analytics Frameworks & Methods, Columbia

The Art of Data Visualization, NYU

Signed copies at McNally-Jackson, NYC

Excerpts: Numbersense Ch. 1, 7, 8. NRYW

Junk Charts Blog



Link to junkcharts

Graphics design by Amanda Lee

Community