It appears that the IT world today considers spam detection to be a solved problem. One doesn't hear many complaints about spam. Spam used to only infiltrate our emails; they now appear in many formats, including spam websites that adulterate search engine results, and spam comments that pollute blogging communities.
Many types of methods have been deployed to fight spam. Fighting spam is a statistical problem: algorithms try to predict whether each incoming email or comment is spam or not.
Bayesian spam filters, which determines what features of a piece of text are predictive of spam, are popular - the buttons that allow users to "report spam" or "report not spam" are integral to such algorithms, enabling them to learn by experience. Blacklists that block IP addresses are another method. The blacklist is a type of heuristic, essentially fixed rules.
I have been bothered by Typepad's seeming ineptitude in spam filtering for a long time. Certain comments that are obviously spam do not get filtered automatically. These are comments that appear to be extremely easy to identify via some simple heuristics. Namely, if the name of the person writing the comment is the name of a pharmaceutical drug or a brand of shoes or fashion, classify the comment as spam.
Imposing such rules will get rid of the following types of spam comments that recently have been "false negatives" under Typepad's so-called spam filtering product:
Cheap Jordans said:
If a thing is worth doing it is worth doing well.
The blog recently got flooded with 55 spams on a single day, coming from "louboutin", "louboutin shoes", and "christian louboutin sale". I just don't understand how any spam filtering algorithm worth its name can miss such obvious cases. It is extremely unusual to have someone whose first name is louboutin and last name is shoes. If someone has such a name, I'd argue (and sorry to be mean) that your comment deserves being expunged.
Every time I provide feedback to Typepad, their IT staff recommends that I manually block those IP addresses (the blacklist method). It appears that they don't see this as an obvious failure. They certainly don't realize that the failure to identify the spam already occurred. So I'm mystified. If you know anything about spam filters, please address these questions:
- Is it operationally difficult to add some heuristics to the algorithm?
- Is there something wrong with rules that assume no one is named "viagra" or "cheap viagra" or "cheap jordans"?
- Are there any new developments in spam filtering worth our attention?