Recently I reinstalled my Debian GNU/Linux1 machine and reestablished my mail setup, which uses Postfix as the MTA, Mutt as the email cilent, Procmail for mail sorting and preprocessing, and Bogofilter for spam identification. A key part of the anti-spam setup is enabling bogofilter’s post-identification message text ham/spam classification, so that bogofilter, which presumably guesses correctly most of the time, will teach itself. However, as bogofilter occasionally makes mistakes, I need a mechanism by which I can identify and inform bogofilter when it has misclassified a piece of mail. Per bogofilter’s man page, I created a set of mutt macros to enable reclassification which looked something like:
macro index S "<enter-command>unset wait_keyn
<pipe-entry>bogofilter -Snn
<enter-command>set wait_keyn
<delete-message>"
"mark message as spam when misclassified as ham"
macro index N "<enter-command>unset wait_keyn
<pipe-entry>bogofilter -Nsn
<enter-command>set wait_keyn
<delete-message>"
"mark message as ham when misclassified as spam"
The purpose of the macro is to send the selected message to bogofilter, telling it to unregister all the message tokens as ham and register them all as spam (or spam and ham respectively). However, I soon noticed that bogofilter was having horrible classification success, i.e., it was nearly always wrong! Debugging, primarily using bogoutil, eventually led me to the discovery that mutt’s <pipe-entry> does not send the entire message to the provided process — it only sends the set of headers which are displayed by default to the user (configurable using the ignore and unignore parameters in mutt). This meant that many headers, which bogofilter has incorrectly classified using its self-teaching mechanism, were not being corrected!
I did not see an easy mechanism by which I could fix this (absurdly stupid) pipe-entry behavior, so I decided to solve the problem in a different way. Now, instead of sending the message to bogofilter, I save the message to a folder named spam-false-positives or spam-false-negatives depending on the situation. Then I wrote a cron job which runs every hour and checks to see if any messages are in these folders — if so, I send the entire folder’s contents to bogofilter for reclassification. The reason why this works is that when you save a message to a separate folder, you keep the full headers intact, and bogofilter is able to read them directly from the mbox file.
For reference, my mutt macros now look like:
macro index N "<save-message>=spam-false-positivesny" macro pager N "<save-message>=spam-false-positivesny" macro index S "<save-message>=spam-false-negativesny" macro pager S "<save-message>=spam-false-negativesny"
My cron job script looks like:
#!/bin/sh
FALSE_NEGATIVES=$HOME/.mail/mailboxes/spam-false-negatives
FALSE_POSITIVES=$HOME/.mail/mailboxes/spam-false-positives
if [ -f $FALSE_NEGATIVES ]; then
bogofilter -MNs -I $FALSE_NEGATIVES
rm $FALSE_NEGATIVES
fi
if [ -f $FALSE_POSITIVES ]; then
bogofilter -MSn -I $FALSE_POSITIVES
rm $FALSE_POSITIVES
fi
One advantage of this new setup is that reclassifying wrongly-identified mail is faster than before. While I don’t notice too much difference with bogofilter, as bogofilter is rather quick, I’m sure I would notice an enormous difference using this setup with SpamAssassin, as sa-learn is ungodly slow.
Recent Comments