[Techtalk] spam filters

Tue Oct 24 23:00:20 UTC 2006

Maria McKinley writes:
> I am currently using spamassasin to filter spam on our mail server. 
> Lately it has not been doing a very good job. I keep trying to update 
> it, but I always seem to be at the latest release. Does anyone have any 
> suggestions? Either an alternate spam filter or some secret knob to turn 
> to make it work better? ;-) I have the spam level at 5 right now, but I 
>  don't think turning it further down would help much, because most of 
> the spam getting through seems to be a 2.5 or less (and a depressing 
> amount of them are at 0!). Any advice? Has anyone tried turning it down 
> as low as a 2? What sort of false positives do you get?

I switched to spamassassin recently. Previously I'd used a mostly
homegrown set of procmail filters which made decisions based on
pattern matches in the subject, from, to/cc, content-type, etc.
That worked but I was constantly needing to add new patterns
and that got really tiresome (and the list of matches it had
to check for every mail message was huge).

With the default settings, I found that spamassassin didn't
filter much of anything. I mean, I would get messages like
"Subject: ENLARGE YOUR PEN!S"
and they would score zero! The only messages that got scored
were ones that matched a blacklist, like sorbs or spamcop (it checks
quite a few of them by default, at least on Ubuntu; I'm not sure how
I feel about that but at least it isn't completely rejecting
messages based on blacklists, like some sites do).

I bumped the thresholds much lower: instead of sending scores of
15 or more asterisks to the "almost-certainly" folder, I cut it
down to 8 and have never seen a false positive there (though I've
mostly stopped checking that folder; I only use it for training).
I set the tag for spam (which gets moved to the "probably" folder)
at 3.4, and I almost never see a false positive there either.

But I was still seeing lots of spam in my inbox. I edited
.spamassassin/user_prefs to bump up some of the default scores:

score DRUGS_ERECTILE 3
score IMPOTENCE 3

score HTML_SHOUTING3    .3
score HTML_MESSAGE      .1
score HTML_FONT_INVISIBLE       1.5

# The ALL_TRUSTED test seems to trust everybody, including dynamic
# IPs from random ISPs like c-71-201-212-73.hsd1.il.comcast.net.
score ALL_TRUSTED       0

# The autolearning thresholds don't work: it sometimes decides
# to autolearn (wrongly) from messages outside the thresholds.
# Turn autolearning off completely:
bayes_auto_learn        0

Then I trained it on all my ham and spam for a while, and after
about a week it started getting quite a bit better.

But I was still getting tons of asian-charset spams: none of
spamassassin's default settings seemed to help with that, nor
did the bayesian training. There were quite a few rules that
looked relevant, so I tried setting them:

score HTML_COMMENT_8BITS        1
score UPPERCASE_25_50           1
score UPPERCASE_50_75           1
score UPPERCASE_75_100          1
score BODY_8BITS                1.7
score HEADER_8BITS              3.2
score SUBJ_ILLEGAL_CHARS        3
score CHARSET_FARAWAY_HEADER    .5
score NONEXISTENT_CHARSET       2
score CHARSET_FARAWAY           3.2
score HTML_CHARSET_FARAWAY      2.5

but none of them made a bit of difference: I was still being flooded
with asian-charset spams and none of them ever triggered any of
these rules. Eventually I gave up on ever getting spamassassin to
block those, and added a procmail rule I'd used with my old system:

# Hiding the subject in another charset so it's not filterable:
:0:
*^Subject: =\?..*-..*\?
spam/charsets

I also added a rule to check for the To line pointing to three names
which spammers have somehow associated with my address (I have no
idea who Lindsay Hursty is or why anyone would think she's at my
address, but I get tons of spam that obviously thinks I'm her,
or two other people I also haven't heard of).

Now, training spamassassin regularly (on ham as well as spam),
my daily spam load looks something like:
5-10 false negatives landing in Inbox
50 in "probably" (have to check in case of the rare false positives)
300 in "almost-certainly" (never seen a false positive there yet)
100 caught by the charset rule
75 caught by the duplicates rule

The false negatives are sometimes the "random text with an attached
gif", but sometimes they're subjects you'd really think any sensible
spam filter should have caught, like "w From  Rolex, Cartier,
Breitling, Bvlgari, Omega, Patek Philippe, Tag Heuer, Officine
Panerai, Audemars Piguet, Franck Muller purpose matter" or
"Rock hard erections". I mean, come on, how is it that a spam
system can't recognize those? But the only rules flagged by the
"rock hard" one were
score=2.1 required=3.4 tests=BAYES_60,HTML_MESSAGE,URIBL_SBL
I really wonder who chooses these rules, sometimes.

	...Akkana