Saturday, June 5. 2010Efficiently learn spam and ham with Dovecot and virtual usersI guess nearly every admin of a webserver knows the article ISP-style Email Server with Debian-Etch and Postfix 2.3 from workaround.org. I use a similar setup (with different pathes) for my private mailserver. I have a number of users and accounts and on some days I get a pretty high number of spam. Step 11 of the tutorial shows an example how to automatically learn spams and ham. But I think, one can do this better. There's nothing wrong with the script. It just has some drawbacks:
As my setup is very small, performance is not that important. Also, I can trust all my users, so they won't try to learn ham as spam and vice versa. So my solution looks like this (I used the original path names here):
Okay, what happens here? Line 7 and 8 define patterns. The IGNOREPATTERN means: don't learn Mails which are in "trash"-folders. That means "trash" and "my/dir/trash" are ignored and spam and ham can safely be deleted the way thunderbird & co does. Line 8 defines folders which contains spam. This is my blacklist. Some users prefer "spam", others "junk". Note that "my/dir/junk" would not be found. Line 9 and 10 wrap the pattern into grep-compatible regexes. The lines 12-18 and 20-28 are nearly identical. First, /var/vmail is scanned for mails, which were created in the last 24 hours. This assumes that the script runs once per day (e.g. as a cron job). Lines 13 to 14 filter out directories which do not contain mails. Folder names may contain whitespaces, so a null-char is used to separate them. Line 16 only returns spam-Mails. Note the "-v" in line 24: here, only ham is returned. This is one of two differences. Next, ignored folders are removed, because we can't say if they contain spam or ham. After this, xargs is used to call sa-learn. This is a performance-boost since sa-learn is not called separately for each mail. Here is the second difference. In line 19, the parameter is "--spam", in line 26, it's "--ham". If a mail was learned as ham and classified as spam later (and the other way around), spamassassin automatically removes the mail before it re-learns it, so there is no need to do this manually. Last but not least, the correct owner is set.
The script runs as root, because it has to read all mails. For me, it's perfect. Perhaps there is someone out there who also likes it
Update: There are some reasons where it's perhaps better to not use this approach. If you have a big number of users, scanning mails for spam on the server may slow things down. In this case, it's perhaps better to let the users scan their mails individually. A second reason to avoid automatic filtering of mails is when different users write mails about totally different concerns. One user may report mails as spam which have content that another user would like to read. Thx to Christoph for this idea.
Update 2:
Learning spam and ham is much faster when the --no-sync and --sync parameters are used. Thanks to Alexey Vazhnov for this very valuable approach. Sometimes you should just read the manpage more than once Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
sa-learn --spam to sa-learn --no-sync --spam sa-learn --ham to sa-learn --no-sync --ham and add line to end of script: sa-learn --sync --dbpath $DBPATH From "man sa-learn": --no-sync Skip the slow synchronization step which normally takes place after changing database entries. If you plan to learn from many folders in a batch, or to learn many individual messages one-by-one, it is faster to use this switch and run "sa-learn --sync" once all the folders have been scanned. |
QuicksearchLinks in diesem ArtikelStatic PagesLinks
|