Archiving Historical Email Using Procmail

Email, Unix No Comments »

To quickly summarize, my email setup is a combination of fetchmail, procmail, postfix, bogofilter, and mutt (with my libESMTP patch) using mbox files on a colocated Sun Fire V100 running OpenBSD/sparc64. For the most part I very much like this setup. In the past I have tinkered with SpamAssassin, maildir, and IMAP; I may end up setting up IMAP again in the future. BTW, I have an old entry describing how to properly integrate bogofilter and mutt.

The purpose of this post is to describe how I archive a copy of every piece of incoming email using these tools. I find these archives extremely valuable, whether for the occasional historical search or in that it enables me to be highly aggressive in keeping my mailboxes clean by deleting messages. Recently I have been contemplating performing statistical analysis of received spam based on these archives.

Archiving Historical Email Using Procmail

The first step to take when setting up a historical email archive is to consider how the email will be stored. For my regular email, I use the system-default mailbox for my inbox (/var/mail/sengelha) and store all my sorted email in mbox files in ~/.mail/mailboxes/. For backups, I keep mail in monthly archives in the mbox file ~/.mail/backups/YYYY/YYYY-MM, where YYYY is the current year and MM is the two-digit month (01 = January and so forth). For all past months I compress the mail using bzip2 -9 to save hard drive space. I used to keep daily archives but per the suggestion of Dan Sachs I recently moved to monthly to achieve better compression ratios. To give you an idea of the sizes that are involved, my October 2004 email archive is 21,947,942 bytes (7,734,053 bytes compressed). Undoubtedly much of this is spam.

My setup has two major parts:

  1. Save a copy of every incoming email to the ~/.mail/backups/YYYY/YYYY-MM file.
  2. Compress all past monthly archives using bzip2 -9.
Save a copy of every incoming email

To save a copy, I must first ensure that the directory ~/.mail/backups/YYYY/ already exists. In fact, I lost some email in January one year because the directory didn’t exist so the email backups were never saved. I have the following rule to my .procmailrc to ensure the directory always exists:

:0wc
* ! ? test -d $HOME/.mail/backups/`date +%Y`
| mkdir -p $HOME/.mail/backups/`date +%Y`

I then use the following rule to save a copy of every incoming email:

# Make backups of all mail received in format YYYY/YYYY-MM
:0c
$HOME/.mail/backups/`date +%Y`/`date +%Y-%m`
Compress all past monthly archives

I wrote the scripts bz2compressdir and compress-mail-backups (see below) to compress all past monthly email archives. I execute compress-mail-backups nightly using cron. (Monthly would suffice but there is effectively no penalty for running compress-mail-backups too frequently.)

bz2compressdir:

#!/usr/bin/perl -w
#
# bz2compressdir: Compress all files in a directory with bzip2
#
# $Id: bz2compressdir,v 1.2 2004/02/19 05:42:35 sengelha Exp $

use strict;
use File::Find;
use Getopt::Std;

my $usage = <<EOF;
Usage: $0 [options] dir1 [dir2 dir3 ...]

-e <regexp>  exclude <regexp> from files to compress
-h           print usage and exit
EOF
my %opts = ();

sub handleFile {
    if (-f && !/.bz2$/ && (!$opts{'e'} || !/$opts{'e'}/)) {
        `bzip2 -9 $_`;
    }
}

getopts('he:', %opts) or die $usage;

die $usage if ($opts{'h'});
die $usage if ($#ARGV == -1);

foreach my $dirname (@ARGV) {
    die "$dirname: Not a directory" if (! -d $dirname);

    find(&handleFile, $dirname);
}

compress-mail-backups:

#!/bin/sh
#
# compress-mail-backups: Go through every mail backup directory,
# compressing with bz2compressdir
#
# $Id: compress-mail-backups,v 1.2 2004/02/19 05:42:35 sengelha Exp $

exec nice bz2compressdir -e `date +%Y-%m` $HOME/.mail/backups/200?
WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in