
This issue arose in conversation recently, thus I thought it worth asking here.
Which tools are state of the art these days in statistical spam filtering?
Not the question you asked, but I don't use statistical spam filtering, mainly because my spam filters are all front ends to other servers and customers would rather the occasional extra spam get through than to go to the trouble of doing any training. My approach (some of which go against the flow) is: . Never use a Junk Mail folder. Either deliver the email to the inbox or don't accept it (maybe causing the sender to get an NDR, but that's the responsibility of the sending server). This requires filtering at SMTP time but that's how I do it anyway. . I used to do sender callouts - test the senders email address against the senders MX. Some people howl with dismay at this idea "won't somebody _please_ think of the bandwidth/cpu cycles", but if you look at the big picture it's still a net win. A quick VRFY and then trivially rejecting email because the senders address isn't valid is _way_ cheaper than the subsequent spam processing to determine that the email was actually spam (especially when using statistical analysis), or missing that it was spam and delivering it to a mailbox somewhere and having the user deal with it. Unfortunately there are some people who still think this is a bad idea (read http://www.backscatterer.org/?target=sendercallouts - it's a hoot!) so doing this gets you blacklisted so I don't anymore. . Do recipient callouts. My spam servers are basically just relays that forward to a server somewhere, which is normally Exchange. Verify that the recipient is valid on the target server before doing any further processing. . Use spamassassin (including RBL's) . Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run. . Only reject the email after DATA for the first time the email has been seen (Except for sender callouts, which used to be rejected immediately), and keep a copy of the email on the server for a short time. The users have a mechanism to retrieve emails from this quarantine, which is useful when a password reset email is greylisted. One idea I had for statistical spam filtering was to train based on RBL's, so if the sender IP is blacklisted in an RBL (and not in my greylist app's whitelist, which covers most false positives), it would be trained as 'spam'. I think that part would work but I'd be concerned about training for non-spam, as plenty of spam comes from non-RBL'd addresses. I thought of using spamassassin score so that a really low spamassassin score could be used to nominate non-spam for training, but then that may end up with a filter that is too polarised... Is there any filtering app that can better detect phishing? So many times I see things like <A HREF="stealyourpassword.com.ru/somebank">www.somebank.com.au</>, which should be an immediate red flag. I always read emails in plain text so don't even see the original link but it's not me I'm concerned about. James