Archiving Historical Email Using Procmail

Email, Unix No Comments »

To quickly summarize, my email setup is a combination of fetchmail, procmail, postfix, bogofilter, and mutt (with my libESMTP patch) using mbox files on a colocated Sun Fire V100 running OpenBSD/sparc64. For the most part I very much like this setup. In the past I have tinkered with SpamAssassin, maildir, and IMAP; I may end up setting up IMAP again in the future. BTW, I have an old entry describing how to properly integrate bogofilter and mutt.

The purpose of this post is to describe how I archive a copy of every piece of incoming email using these tools. I find these archives extremely valuable, whether for the occasional historical search or in that it enables me to be highly aggressive in keeping my mailboxes clean by deleting messages. Recently I have been contemplating performing statistical analysis of received spam based on these archives.

Archiving Historical Email Using Procmail

The first step to take when setting up a historical email archive is to consider how the email will be stored. For my regular email, I use the system-default mailbox for my inbox (/var/mail/sengelha) and store all my sorted email in mbox files in ~/.mail/mailboxes/. For backups, I keep mail in monthly archives in the mbox file ~/.mail/backups/YYYY/YYYY-MM, where YYYY is the current year and MM is the two-digit month (01 = January and so forth). For all past months I compress the mail using bzip2 -9 to save hard drive space. I used to keep daily archives but per the suggestion of Dan Sachs I recently moved to monthly to achieve better compression ratios. To give you an idea of the sizes that are involved, my October 2004 email archive is 21,947,942 bytes (7,734,053 bytes compressed). Undoubtedly much of this is spam.

My setup has two major parts:

  1. Save a copy of every incoming email to the ~/.mail/backups/YYYY/YYYY-MM file.
  2. Compress all past monthly archives using bzip2 -9.
Save a copy of every incoming email

To save a copy, I must first ensure that the directory ~/.mail/backups/YYYY/ already exists. In fact, I lost some email in January one year because the directory didn’t exist so the email backups were never saved. I have the following rule to my .procmailrc to ensure the directory always exists:

:0wc
* ! ? test -d $HOME/.mail/backups/`date +%Y`
| mkdir -p $HOME/.mail/backups/`date +%Y`

I then use the following rule to save a copy of every incoming email:

# Make backups of all mail received in format YYYY/YYYY-MM
:0c
$HOME/.mail/backups/`date +%Y`/`date +%Y-%m`
Compress all past monthly archives

I wrote the scripts bz2compressdir and compress-mail-backups (see below) to compress all past monthly email archives. I execute compress-mail-backups nightly using cron. (Monthly would suffice but there is effectively no penalty for running compress-mail-backups too frequently.)

bz2compressdir:

#!/usr/bin/perl -w
#
# bz2compressdir: Compress all files in a directory with bzip2
#
# $Id: bz2compressdir,v 1.2 2004/02/19 05:42:35 sengelha Exp $

use strict;
use File::Find;
use Getopt::Std;

my $usage = <<EOF;
Usage: $0 [options] dir1 [dir2 dir3 ...]

-e <regexp>  exclude <regexp> from files to compress
-h           print usage and exit
EOF
my %opts = ();

sub handleFile {
    if (-f && !/.bz2$/ && (!$opts{'e'} || !/$opts{'e'}/)) {
        `bzip2 -9 $_`;
    }
}

getopts('he:', %opts) or die $usage;

die $usage if ($opts{'h'});
die $usage if ($#ARGV == -1);

foreach my $dirname (@ARGV) {
    die "$dirname: Not a directory" if (! -d $dirname);

    find(&handleFile, $dirname);
}

compress-mail-backups:

#!/bin/sh
#
# compress-mail-backups: Go through every mail backup directory,
# compressing with bz2compressdir
#
# $Id: compress-mail-backups,v 1.2 2004/02/19 05:42:35 sengelha Exp $

exec nice bz2compressdir -e `date +%Y-%m` $HOME/.mail/backups/200?

Proper integration of bogofilter and mutt

Email, Mutt, Unix No Comments »

Recently I reinstalled my Debian GNU/Linux1 machine and reestablished my mail setup, which uses Postfix as the MTA, Mutt as the email cilent, Procmail for mail sorting and preprocessing, and Bogofilter for spam identification. A key part of the anti-spam setup is enabling bogofilter’s post-identification message text ham/spam classification, so that bogofilter, which presumably guesses correctly most of the time, will teach itself. However, as bogofilter occasionally makes mistakes, I need a mechanism by which I can identify and inform bogofilter when it has misclassified a piece of mail. Per bogofilter’s man page, I created a set of mutt macros to enable reclassification which looked something like:

macro index S "<enter-command>unset wait_keyn
               <pipe-entry>bogofilter -Snn
               <enter-command>set wait_keyn
               <delete-message>"
      "mark message as spam when misclassified as ham"
macro index N "<enter-command>unset wait_keyn
               <pipe-entry>bogofilter -Nsn
               <enter-command>set wait_keyn
               <delete-message>"
      "mark message as ham when misclassified as spam"

The purpose of the macro is to send the selected message to bogofilter, telling it to unregister all the message tokens as ham and register them all as spam (or spam and ham respectively). However, I soon noticed that bogofilter was having horrible classification success, i.e., it was nearly always wrong! Debugging, primarily using bogoutil, eventually led me to the discovery that mutt’s <pipe-entry> does not send the entire message to the provided process — it only sends the set of headers which are displayed by default to the user (configurable using the ignore and unignore parameters in mutt). This meant that many headers, which bogofilter has incorrectly classified using its self-teaching mechanism, were not being corrected!

I did not see an easy mechanism by which I could fix this (absurdly stupid) pipe-entry behavior, so I decided to solve the problem in a different way. Now, instead of sending the message to bogofilter, I save the message to a folder named spam-false-positives or spam-false-negatives depending on the situation. Then I wrote a cron job which runs every hour and checks to see if any messages are in these folders — if so, I send the entire folder’s contents to bogofilter for reclassification. The reason why this works is that when you save a message to a separate folder, you keep the full headers intact, and bogofilter is able to read them directly from the mbox file.

For reference, my mutt macros now look like:

macro index N "<save-message>=spam-false-positivesny"
macro pager N "<save-message>=spam-false-positivesny"
macro index S "<save-message>=spam-false-negativesny"
macro pager S "<save-message>=spam-false-negativesny"

My cron job script looks like:

#!/bin/sh

FALSE_NEGATIVES=$HOME/.mail/mailboxes/spam-false-negatives
FALSE_POSITIVES=$HOME/.mail/mailboxes/spam-false-positives

if [ -f $FALSE_NEGATIVES ]; then
    bogofilter -MNs -I $FALSE_NEGATIVES
    rm $FALSE_NEGATIVES
fi

if [ -f $FALSE_POSITIVES ]; then
    bogofilter -MSn -I $FALSE_POSITIVES
    rm $FALSE_POSITIVES
fi

One advantage of this new setup is that reclassifying wrongly-identified mail is faster than before. While I don’t notice too much difference with bogofilter, as bogofilter is rather quick, I’m sure I would notice an enormous difference using this setup with SpamAssassin, as sa-learn is ungodly slow.

[1] I defer to the silly moniker “GNU/Linux” rather than simply “Linux” as that is how Debian refers to itself.

Aha! A reason why my mail may not be getting through to AOL!

Email, Mutt No Comments »

I use mutt as my e-mail reader, with a patch I wrote that makes mutt use libESMTP to send mail. This ensures that my e-mail goes through Yahoo’s e-mail servers, the proper behavior for my Yahoo account. However, it seems that there is a bug somewhere, as my e-mail’s envelope From address is being set to MAILER-DAEMON. I’m not sure if it is doing this for all e-mails, or on all e-mail servers, but I wouldn’t be surprised if this is why my e-mail to AOL addresses isn’t getting through.

Ouch, my brain hurts

Email, Programming No Comments »

OK, so a few days ago I decided that the current IMAP servers out there (namely Courier) were a pain in the ass, so being a maniac, I decided to write my own. I figured if I implemented it in Python, did a minimum approach to security, made the server read-only, and a few other shortcuts, I could get the server done rather quickly, and I could read my e-mail with any client I wish.

Well, the protocol is complicated enough that it is quite hard to do any serious processing with simple regular expressions. I first tried incrementally building regular expressions (like the following):

reloginargs = re.compile(_loginargs)
_loginargs = r"(?:%s %s)" % (_userid, _password)
_userid = r"(?:%s)" % (_astring)
_astring = r"(?:%s|%s)" % (_atom, _string)
_atom = r"(?:%s+)" % (_ATOM_CHAR)
_ATOM_CHAR = r"(?:%s)" % (_atom_specials)
_atom_specials = r"(?:[a-zA-Z0-9/])" # HACK
...

This worked reasonably well, except for extracting the data from the matched string was very difficult. I finally decided to bite the bullet and try to use a full parser generator. I initially tried to get reaccustomed to lex/yacc, just to see if it was possible, but I couldn’t quite get my brain around them. Eventually I found a LL(1) parser generator Yapps (Yet Another Python Parser System), which is written in python and produces python code. After looking at the examples for a little bit, along with some experimentation, I was finally able to start on the IMAP grammar file. I spent a few hours, learnt some LL(1) tricks, and I actually have a grammar which understands a good portion of the IMAP protocol (as much or more as the regular expression-based one, and certainly more RFC compliant). It is especially cool because the grammar closely mirrors the EBNF in the “Formal Syntax” portion of the IMAP RFC.

Since I use various clients (Outlook Express and mutt primarily) to test the current state of the IMAP server (and also to see what commands they use, to minimize the commands I must implement), and I’ve pretty much exhausted simply parsing the commands I know, I must move on to actually acting on a parsed command. This should be fairly simple once I think about the right way to do it.

Well, even if this is a big waste of time, I think I’ve learned quite a bit.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in