It appears that the IT world today considers spam detection to be a solved problem. One doesn't hear many complaints about spam. Spam used to only infiltrate our emails; they now appear in many formats, including spam websites that adulterate search engine results, and spam comments that pollute blogging communities.
Many types of methods have been deployed to fight spam. Fighting spam is a statistical problem: algorithms try to predict whether each incoming email or comment is spam or not.
Bayesian spam filters, which determines what features of a piece of text are predictive of spam, are popular - the buttons that allow users to "report spam" or "report not spam" are integral to such algorithms, enabling them to learn by experience. Blacklists that block IP addresses are another method. The blacklist is a type of heuristic, essentially fixed rules.
***
I have been bothered by Typepad's seeming ineptitude in spam filtering for a long time. Certain comments that are obviously spam do not get filtered automatically. These are comments that appear to be extremely easy to identify via some simple heuristics. Namely, if the name of the person writing the comment is the name of a pharmaceutical drug or a brand of shoes or fashion, classify the comment as spam.
Imposing such rules will get rid of the following types of spam comments that recently have been "false negatives" under Typepad's so-called spam filtering product:
Cheap Jordans said:
If a thing is worth doing it is worth doing well.
The blog recently got flooded with 55 spams on a single day, coming from "louboutin", "louboutin shoes", and "christian louboutin sale". I just don't understand how any spam filtering algorithm worth its name can miss such obvious cases. It is extremely unusual to have someone whose first name is louboutin and last name is shoes. If someone has such a name, I'd argue (and sorry to be mean) that your comment deserves being expunged.
***
Every time I provide feedback to Typepad, their IT staff recommends that I manually block those IP addresses (the blacklist method). It appears that they don't see this as an obvious failure. They certainly don't realize that the failure to identify the spam already occurred. So I'm mystified. If you know anything about spam filters, please address these questions:
- Is it operationally difficult to add some heuristics to the algorithm?
- Is there something wrong with rules that assume no one is named "viagra" or "cheap viagra" or "cheap jordans"?
- Are there any new developments in spam filtering worth our attention?
Spam is a solved problem ... unless you're on Typepad. There are great services out there like Akismet and Mollom that take care of spam entirely, without even bugging the user with CAPTCHAs. I haven't seen a single spam comment in a year if not more (thanks to Mollom, in my case). You should check if you can add one of these services to your blog here.
Adding rules manually may sound obvious when you get a number of similar spam comments, but in practice it's not feasible. There are so many different things that spammers post about, and so many variations that the only reasonable defense is to have something that notices similarities across lots of websites. Also, adding some brand names is one thing, but once you add words like shoes and jewelry and drugs and lots of synonyms, you start creating false positives.
Posted by: Robert Kosara | 05/20/2011 at 08:55 AM
Agreed. Unfortunately, it's sometimes hard to filter on names. For example "Medford Risk Management" sounds legit, but is actually spam. There seems to be a subtle combination of name, website, and content that enables an observant human to detect spam, whereas algorithms might miss it.
And, of course, no IT staff wants to deploy an algorithm that has false positives (blocking valid comments) since that angers the users. But as you point out in your book, balancing the false positives and false negatives is a tricky business that necessarily requires trade-offs.
Posted by: Rick Wicklin | 05/20/2011 at 09:00 AM
My Movable Type blog gets tons of spam that's not caught by the filter (and, of course, the filter sometimes traps real comments). I'm switching to Wordpress.
Posted by: Andrew Gelman | 05/21/2011 at 04:35 PM
I have a Typepad blog and long ago I stopped expecting any results from their filters (if they have any). I've been manually blocking for a long time and gave up on getting any results from sending suggestions to Typepad. I got one reply to a suggestion that essentially said "Too bad, that's just the way it is."
Posted by: Ron's Log | 05/25/2011 at 12:38 AM
Ron: I have also voiced my discontent many times and so I thought maybe I can shame them into doing something about it. Apparently not. On twitter, they told me to block those IP addresses manually, which is the same robotic response they give every time I point this out.
Since Robert tells us spam filtering is a solved problem, how difficult is it for Typepad to buy the technology to resolve this issue? The fact that they haven't is an indication that they don't care. This attitude is shocking since unlike others, they charge money for their software.
One day, I'll take Andrew's advice and quit.
Posted by: Kaisr | 05/26/2011 at 12:31 AM
The only way to do is make the comment form is only for facebook account and others social that have login API key which can integrated with blog you have. So when spammer come to your blog to make a comment, they'll thing twice. They can't put their link on your comment form except only their profile page of their social site.
Posted by: vimax | 05/27/2011 at 12:15 AM
The only way to do is make the comment form is only for facebook account and others social that have login API key which can integrated with blog you have. So when spammer come to your blog to make a comment, they'll thing twice.
Posted by: sologon | 05/31/2011 at 09:19 PM
Originally posted at : http://www.iipmthinktank-kksrivastava.com/slumbering-smugness-eats.html
Of course some realise their mistake and attempt mid-course correction – Mercedes (now following lower price high volume strategy), Star bouquet of channels (which is regaining its number one position through revamped programming and positioning, after it had lost to Colors), Coke (which realised that every experiment to undermine Thums Up was actually benefiting Pepsi more than Coke), and many others.
So, let complacency not consume your brand. You may not get a second chance to regain lost ground.
Originally posted at : http://www.iipmthinktank-kksrivastava.com/slumbering-smugness-eats.htmlman
Posted by: manjeet kumar | 07/23/2011 at 06:55 PM
http://www.iipmthinktank-kksrivastava.com/slumbering-smugness-eats.html
...Of course some realise their mistake and attempt mid-course correction – Mercedes (now following lower price high volume strategy), Star bouquet of channels (which is regaining its number one position through revamped programming and positioning, after it had lost to Colors), Coke (which realised that every experiment to undermine Thums Up was actually benefiting Pepsi more than Coke), and many others.
Posted by: sansar | 07/23/2011 at 06:56 PM
Comprehensive! And very well researched.Thanks for sharing informative information.
Posted by: Tom Albas | 08/16/2011 at 03:43 AM