
This issue arose in conversation recently, thus I thought it worth asking here. Which tools are state of the art these days in statistical spam filtering? (Free/open-source and running on Linux are both assumed.) When I last looked (some years ago), the top options appeared to be, in no particular order: Spamassassin (rules + statistical classifier). It works well for many people; the statistical classifier used to be very memory-intensive and I know people for whom SpamAssassin didn't give accurate results even after training. CRM114: this is what I am currently using for my incoming mail. It appears to be in the midst of a rewrite as a library with support for various scripting languages. My initial experiences with it weren't good, but I tried it again several years ago and, this time, it quickly surpassed SpamAssassin when trained to classify my mail. Dspam: also has a good reputation, seems to be maintained to some extent. When last I looked at it in detail, a number of years ago, there were plans to add interesting features for allowing users to share filters so that a new user wouldn't have to train it from an empty database and one user's training could affect other users' filters. There were other projects around, but the above appeared to be the most sophisticated. So it's now 2013... Any changes? Comments?

This issue arose in conversation recently, thus I thought it worth asking here.
Which tools are state of the art these days in statistical spam filtering?
Not the question you asked, but I don't use statistical spam filtering, mainly because my spam filters are all front ends to other servers and customers would rather the occasional extra spam get through than to go to the trouble of doing any training. My approach (some of which go against the flow) is: . Never use a Junk Mail folder. Either deliver the email to the inbox or don't accept it (maybe causing the sender to get an NDR, but that's the responsibility of the sending server). This requires filtering at SMTP time but that's how I do it anyway. . I used to do sender callouts - test the senders email address against the senders MX. Some people howl with dismay at this idea "won't somebody _please_ think of the bandwidth/cpu cycles", but if you look at the big picture it's still a net win. A quick VRFY and then trivially rejecting email because the senders address isn't valid is _way_ cheaper than the subsequent spam processing to determine that the email was actually spam (especially when using statistical analysis), or missing that it was spam and delivering it to a mailbox somewhere and having the user deal with it. Unfortunately there are some people who still think this is a bad idea (read http://www.backscatterer.org/?target=sendercallouts - it's a hoot!) so doing this gets you blacklisted so I don't anymore. . Do recipient callouts. My spam servers are basically just relays that forward to a server somewhere, which is normally Exchange. Verify that the recipient is valid on the target server before doing any further processing. . Use spamassassin (including RBL's) . Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run. . Only reject the email after DATA for the first time the email has been seen (Except for sender callouts, which used to be rejected immediately), and keep a copy of the email on the server for a short time. The users have a mechanism to retrieve emails from this quarantine, which is useful when a password reset email is greylisted. One idea I had for statistical spam filtering was to train based on RBL's, so if the sender IP is blacklisted in an RBL (and not in my greylist app's whitelist, which covers most false positives), it would be trained as 'spam'. I think that part would work but I'd be concerned about training for non-spam, as plenty of spam comes from non-RBL'd addresses. I thought of using spamassassin score so that a really low spamassassin score could be used to nominate non-spam for training, but then that may end up with a filter that is too polarised... Is there any filtering app that can better detect phishing? So many times I see things like <A HREF="stealyourpassword.com.ru/somebank">www.somebank.com.au</>, which should be an immediate red flag. I always read emails in plain text so don't even see the original link but it's not me I'm concerned about. James

On Thu, 2 May 2013, James Harper <james.harper@bendigoit.com.au> wrote:
. Never use a Junk Mail folder. Either deliver the email to the inbox or don't accept it (maybe causing the sender to get an NDR, but that's the responsibility of the sending server). This requires filtering at SMTP time but that's how I do it anyway.
I agree. I currently only run one server with a junk folder (as far as I recall), and that is a "pending" folder for mail which has a challenge- response message sent out (not my choice, I'm just paid to do sysadmin work).
. Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run.
Sounds nice, can you release it under the GPL? One problem with statistical anti-spam measures is users who blindly put their "spam" into it as training without review. So when (not if) a legitimate message is classified as spam the statistical system is trained to do that again... -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, May 02, 2013 at 11:04:23AM +1000, Russell Coker wrote:
I agree. I currently only run one server with a junk folder (as far as I recall), and that is a "pending" folder for mail which has a challenge- response message sent out (not my choice, I'm just paid to do sysadmin work).
sucks to have to do something so useless and crappy. years ago (when the challenge-response idiocy was first starting), i wrote some procmail+perl code so that if the message was a C-R request, procmail would pipe it into my perl script which would extract the confirmation URL and fetch it with lynx or curl or something. i.e. ALL C-R requests were automatically confirmed. my attitude was that if backscattering bastards are going to outsource their spam checking to me just because my address had been used as the sender by some spamming scumbag, then i was going to make sure THEY got all their spam rather than me having to see it or deal with it for them. fortunately, challenge-response was only a short-lived fad, and never became common. most people realised it was just backscatter spam that only made the spam problem worse. i just looked for it now, but can't find it in my procmail scripts directory....I must have deleted it or lost it. i do recall that it was trivial to write as C-R messages tend to be consistent and easily parsed. craig -- craig sanders <cas@taz.net.au>

Thanks to all who have responded so far for the informative and insightful comments. This will certainly help when the issue arises for discussion again. There are no non-technical end-users of machines whose administrators are likely to discuss this with me, and all of the people involved use Linux mailers, so I am spared some of the complexities that have been discussed. As to my own set-up, Postfix rejects most of the spam during the SMTP negotiation (I have various standard checks in place), and CRM114 catches almost all the rest as part of my Procmail configuration. I haven't implemented grey-listing but I know people who have found it very effective. Block lists are also included in my Postfix configuration.

On Thu, 2 May 2013, Jason White <jason@jasonjgw.net> wrote:
There are no non-technical end-users of machines whose administrators are likely to discuss this with me, and all of the people involved use Linux mailers, so I am spared some of the complexities that have been discussed.
If you think that means you won't have users feeding unread messages from their "spam" folder into the statistical training system then you're wrong. I can think of one lead developer of a major Linux software development project who uses unread messages from his "spam" folder to train a Bayesian filter. I tried to explain why this is wrong, but he wasn't interested in learning. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

James Harper <james.harper@bendigoit.com.au> writes:
. Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run.
IIRC we greylist for one second. The fact that they're retrying *at all* shows they're not spammers. We also have to whitelist bigpond :-/ Other things you didn't mention are: Laying your MXs out like this stops spammers that don't try >1 MX and that try MXs in reverse order. 10 null-mx.cyber.com.au. <--- always closed 25 20 mail.cyber.com.au. <--- one of the middle pair 30 exetel.cyber.com.au. <--- ought to always work 40 tarbaby.junkemailfilter.com. <--- teergrube We also use reject_unauth_pipelining to throw away peers if they don't wait for the server's response when they should. We also use spamhaus.org DNS RBL.

James Harper <james.harper@bendigoit.com.au> writes:
. Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run.
IIRC we greylist for one second. The fact that they're retrying *at all* shows they're not spammers. We also have to whitelist bigpond :-/
My solution doesn't require whitelisting bigpond because it sees enough 'good' emails that get whitelisted directly because they have enough emails with low spamassassin stores that it sorts itself out within a week or so, probably less. Optus is (was?) the same in that they'd retry from different IP addresses. My reasoning for greylisting for longer is that a new spam run can take a while to appear on the blacklists and other checksum validation sites, so delaying suspect email helps a bit, although I haven't done any measurement on this in years.
Other things you didn't mention are:
Laying your MXs out like this stops spammers that don't try >1 MX and that try MXs in reverse order.
10 null-mx.cyber.com.au. <--- always closed 25 20 mail.cyber.com.au. <--- one of the middle pair 30 exetel.cyber.com.au. <--- ought to always work 40 tarbaby.junkemailfilter.com. <--- teergrube
I did that in the late 90's, mainly because we were on a crap ISDN connection and Telstra (with no spam protection at all) was our secondary MX, so all the spam just went there. My greylist filter communicates between the primary and the secondary too so the databases keep in sync. One addition I have wanted to make for a while is like your setup above where it could track the connections between the MX's, so if I had a setup like yours: 10 then 20 = good (maybe reduce the spam score by a bit) 20 or 30 without trying 10 first = bad (maybe increase the spam score a bit) 40 without 10-30 = bad (maybe add to a blacklist score in the greylist database) That by itself would be easy enough to implement given that I already communicate between them, but it's the exceptions that make it hard: 1. some MX's remember that the primary is down so go straight to the secondary for a bit until the negative cache entry times out 2. what if 10 is broken and so I don't see that it hit 10 first then 20? 3. what if 10-30 are all unreachable? MX's that violate the standards is the main frustration I'm seeing. I'd love to say that "people who violate RFC's get what they deserve" but when the RFC violators are big companies like Telstra (for example, I think they've been pretty good lately though), your users aren't interested in detailed explanations about standards and why sticking to them is a good idea, they just want their email.
We also use reject_unauth_pipelining to throw away peers if they don't wait for the server's response when they should.
Yes not waiting for a response is a big giveaway that you're talking to a spambot! James

James Harper wrote:
James Harper <james.harper@bendigoit.com.au> writes:
Use greylisting. I wrote my own here that has some smarts about trusting domains (eg bigpond) once a certain number of senders have been seen. I used to greylist for an hour but only 15 minutes now, and only for email with a spamassassin score above some threshold. The idea being that by waiting a bit the sender may get blacklisted in that time if I am the recipient of a new spam run.
IIRC we greylist for one second. The fact that they're retrying *at all* shows they're not spammers. We also have to whitelist bigpond :-/
My solution doesn't require whitelisting bigpond because it sees enough 'good' emails that get whitelisted directly because they have enough emails with low spamassassin stores that it sorts itself out within a week or so, probably less. Optus is (was?) the same in that they'd retry from different IP addresses.
Cool; I wasn't entirely sure that's what you were saying. Since it's so, I'd be interested in details/source code.

Jason White <jason@jasonjgw.net> writes:
CRM114: this is what I am currently using for my incoming mail. It appears to be in the midst of a rewrite as a library with support for various scripting languages. My initial experiences with it weren't good, but I tried it again several years ago and, this time, it quickly surpassed SpamAssassin when trained to classify my mail.
[Not really helpful for your question, but I'll brain-dump what I have.] This is what we've been using. Because we have mutt users, and mutt doesn't implement IMAP COPY, it's impossible to trigger it via dovecot hook. So we run it nightly and pass it find -type f -mtime 3 or thereabouts (so that users have a couple of days to classify the message). It was working MUCH worse before I changed -mtime +3 to -mtime 3, because in the old case it was retraining every day on old emails, so it ended up being VERY VERY certain about things it shouldn't. I don't remember why we picked crm114, but we're deliberately using it only for managers -- the engineers don't get spam in the first place, so it avoids having to piss about checking =Spam occasionally for false positives. RSN I'll write down the various pre-body things we do to reduce spam. We're also running it on an Ubuntu 10.04 stack, so we're not up to speed with recent developments. :-)

On 01/05/13 17:58, Jason White wrote:
So it's now 2013... Any changes? Comments?
Last time I rebuilt my mail server I let my spam filtering configuration get very simple, it's now: 1. Postfix's internal sanity checks (eg, real hostname, real address, etc.) 2. A custom access list I wrote many years ago (essentially a DUL) 3. A set of access lists equivalent to SPF for major webmail domains 4. A few RBL's: Spamhaus Zen, and the Barracuda RBL (almost all are blocked by the first, very few by the latter) That combination results in almost a little spam as a full spamassassin setup used to, and has the benefit of rejecting everything at SMTP time, not elsewhere. The only maintenance required is some very occasional (~1/year) work to delete now obsolete entries from the custom access list -- Julien Goodwin Studio442 "Blue Sky Solutioneering"

On 01/05/13 17:58, Jason White wrote:
So it's now 2013... Any changes? Comments?
Last time I rebuilt my mail server I let my spam filtering configuration get very simple, it's now: 1. Postfix's internal sanity checks (eg, real hostname, real address, etc.) 2. A custom access list I wrote many years ago (essentially a DUL) 3. A set of access lists equivalent to SPF for major webmail domains 4. A few RBL's: Spamhaus Zen, and the Barracuda RBL (almost all are blocked by the first, very few by the latter) That combination results in almost a little spam as a full spamassassin setup used to, and has the benefit of rejecting everything at SMTP time, not elsewhere. The only maintenance required is some very occasional (~1/year) work to delete now obsolete entries from the custom access list

On Wed, 1 May 2013 05:58:31 PM Jason White wrote:
Which tools are state of the art these days in statistical spam filtering?
Not really state of the art, but I'm using Maia Mailguard (basically amavisd- new, SpamAssassin and ClamAV with a web front end). I occasionally write custom rules for SA. Works for me and Donna. ;-) cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
participants (9)
-
Chris Samuel
-
Craig Sanders
-
James Harper
-
Jason White
-
Julien Goodwin
-
Julien Goodwin
-
Russell Coker
-
Trent W. Buck
-
trentbuck@gmail.com