Nightly MySQL Backups

Unix No Comments »

Courtesy of Dave, here’s a script I put into crontab to perform nightly full backups of my WordPress MySQL database:

#!/bin/sh

DATE=`date +%Y-%m-%d`
BACKUPDIR=$HOME/.mysql-backup

USERNAME=XXXXX # TODO
PASSWORD=XXXXX # TODO
DATABASE=XXXXX # TODO

if [ ! -d $BACKUPDIR/$DATABASE ]; then
  mkdir $BACKUPDIR/$DATABASE
fi

mysqldump --user=$USERNAME --password=$PASSWORD $DATABASE > $BACKUPDIR/$DATABASE/$DATE.dump
bzip2 -9 $BACKUPDIR/$DATABASE/$DATE.dump

Advanced Bash Scripting Guide

Shell Scripting, Unix No Comments »

This looks quite interesting: Advanced Bash Scripting Guide: An in-depth exploration of the art of shell scripting.

Fringe Hardware and Software Platforms

C++, Unix No Comments »
[14 sengelha@dt]% cat test.cc                                               ~/t
#include <string>

int main(void)
{
    try
    {
        throw std::string("sigh");
    }
    catch (...)
    {
        // Do nothing
    }

    return 0;
}
[15 sengelha@dt]% g++ -g -o test test.cc                                    ~/t
[16 sengelha@dt]% ./test                                                    ~/t
zsh: 24046 abort (core dumped)  ./test
[17 sengelha@dt]% gdb ./test test.core                              < 134 > ~/t
GNU gdb 4.16.1
Copyright 1996 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc64-unknown-openbsd3.5"...
Core was generated by `test'.
Program terminated with signal 6, Abort trap.
#0  0x43b7aa40 in ?? ()
(gdb) bt
#0  0x43b7aa40 in ?? ()
#1  0x102a84 in uw_init_context_1 (context=0xfffffffffffeedb0,
    outer_cfa=0xfffffffffffef130, outer_ra=0x44208088)
    at /usr/src/gnu/usr.bin/gcc/gcc/unwind-pe.h:77
#2  0x102e14 in _Unwind_RaiseException (exc=0x643d5c455b4a3a63)
    at /usr/src/gnu/usr.bin/gcc/gcc/unwind-pe.h:77
#3  0x44208090 in ?? ()
#4  0x1012c4 in main () at test.cc:3
#5  0x101054 in ___start ()
(gdb) quit
The program is running.  Quit anyway (and kill it)? (y or n) y
[18 sengelha@dt]% g++ --version                                             ~/t
g++ (GCC) 3.3.2 (propolice)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[19 sengelha@dt]% uname -a                                                  ~/t
OpenBSD dt.xxxxxxx.xxx 3.5 GENERIC#123 sparc64

Such are the perils of running the fairly uncommon hardware/software combination of OpenBSD/sparc64. *sigh*

Archiving Historical Email Using Procmail

Email, Unix No Comments »

To quickly summarize, my email setup is a combination of fetchmail, procmail, postfix, bogofilter, and mutt (with my libESMTP patch) using mbox files on a colocated Sun Fire V100 running OpenBSD/sparc64. For the most part I very much like this setup. In the past I have tinkered with SpamAssassin, maildir, and IMAP; I may end up setting up IMAP again in the future. BTW, I have an old entry describing how to properly integrate bogofilter and mutt.

The purpose of this post is to describe how I archive a copy of every piece of incoming email using these tools. I find these archives extremely valuable, whether for the occasional historical search or in that it enables me to be highly aggressive in keeping my mailboxes clean by deleting messages. Recently I have been contemplating performing statistical analysis of received spam based on these archives.

Archiving Historical Email Using Procmail

The first step to take when setting up a historical email archive is to consider how the email will be stored. For my regular email, I use the system-default mailbox for my inbox (/var/mail/sengelha) and store all my sorted email in mbox files in ~/.mail/mailboxes/. For backups, I keep mail in monthly archives in the mbox file ~/.mail/backups/YYYY/YYYY-MM, where YYYY is the current year and MM is the two-digit month (01 = January and so forth). For all past months I compress the mail using bzip2 -9 to save hard drive space. I used to keep daily archives but per the suggestion of Dan Sachs I recently moved to monthly to achieve better compression ratios. To give you an idea of the sizes that are involved, my October 2004 email archive is 21,947,942 bytes (7,734,053 bytes compressed). Undoubtedly much of this is spam.

My setup has two major parts:

  1. Save a copy of every incoming email to the ~/.mail/backups/YYYY/YYYY-MM file.
  2. Compress all past monthly archives using bzip2 -9.
Save a copy of every incoming email

To save a copy, I must first ensure that the directory ~/.mail/backups/YYYY/ already exists. In fact, I lost some email in January one year because the directory didn’t exist so the email backups were never saved. I have the following rule to my .procmailrc to ensure the directory always exists:

:0wc
* ! ? test -d $HOME/.mail/backups/`date +%Y`
| mkdir -p $HOME/.mail/backups/`date +%Y`

I then use the following rule to save a copy of every incoming email:

# Make backups of all mail received in format YYYY/YYYY-MM
:0c
$HOME/.mail/backups/`date +%Y`/`date +%Y-%m`
Compress all past monthly archives

I wrote the scripts bz2compressdir and compress-mail-backups (see below) to compress all past monthly email archives. I execute compress-mail-backups nightly using cron. (Monthly would suffice but there is effectively no penalty for running compress-mail-backups too frequently.)

bz2compressdir:

#!/usr/bin/perl -w
#
# bz2compressdir: Compress all files in a directory with bzip2
#
# $Id: bz2compressdir,v 1.2 2004/02/19 05:42:35 sengelha Exp $

use strict;
use File::Find;
use Getopt::Std;

my $usage = <<EOF;
Usage: $0 [options] dir1 [dir2 dir3 ...]

-e <regexp>  exclude <regexp> from files to compress
-h           print usage and exit
EOF
my %opts = ();

sub handleFile {
    if (-f && !/.bz2$/ && (!$opts{'e'} || !/$opts{'e'}/)) {
        `bzip2 -9 $_`;
    }
}

getopts('he:', %opts) or die $usage;

die $usage if ($opts{'h'});
die $usage if ($#ARGV == -1);

foreach my $dirname (@ARGV) {
    die "$dirname: Not a directory" if (! -d $dirname);

    find(&handleFile, $dirname);
}

compress-mail-backups:

#!/bin/sh
#
# compress-mail-backups: Go through every mail backup directory,
# compressing with bz2compressdir
#
# $Id: compress-mail-backups,v 1.2 2004/02/19 05:42:35 sengelha Exp $

exec nice bz2compressdir -e `date +%Y-%m` $HOME/.mail/backups/200?

Proper integration of bogofilter and mutt

Email, Mutt, Unix No Comments »

Recently I reinstalled my Debian GNU/Linux1 machine and reestablished my mail setup, which uses Postfix as the MTA, Mutt as the email cilent, Procmail for mail sorting and preprocessing, and Bogofilter for spam identification. A key part of the anti-spam setup is enabling bogofilter’s post-identification message text ham/spam classification, so that bogofilter, which presumably guesses correctly most of the time, will teach itself. However, as bogofilter occasionally makes mistakes, I need a mechanism by which I can identify and inform bogofilter when it has misclassified a piece of mail. Per bogofilter’s man page, I created a set of mutt macros to enable reclassification which looked something like:

macro index S "<enter-command>unset wait_keyn
               <pipe-entry>bogofilter -Snn
               <enter-command>set wait_keyn
               <delete-message>"
      "mark message as spam when misclassified as ham"
macro index N "<enter-command>unset wait_keyn
               <pipe-entry>bogofilter -Nsn
               <enter-command>set wait_keyn
               <delete-message>"
      "mark message as ham when misclassified as spam"

The purpose of the macro is to send the selected message to bogofilter, telling it to unregister all the message tokens as ham and register them all as spam (or spam and ham respectively). However, I soon noticed that bogofilter was having horrible classification success, i.e., it was nearly always wrong! Debugging, primarily using bogoutil, eventually led me to the discovery that mutt’s <pipe-entry> does not send the entire message to the provided process — it only sends the set of headers which are displayed by default to the user (configurable using the ignore and unignore parameters in mutt). This meant that many headers, which bogofilter has incorrectly classified using its self-teaching mechanism, were not being corrected!

I did not see an easy mechanism by which I could fix this (absurdly stupid) pipe-entry behavior, so I decided to solve the problem in a different way. Now, instead of sending the message to bogofilter, I save the message to a folder named spam-false-positives or spam-false-negatives depending on the situation. Then I wrote a cron job which runs every hour and checks to see if any messages are in these folders — if so, I send the entire folder’s contents to bogofilter for reclassification. The reason why this works is that when you save a message to a separate folder, you keep the full headers intact, and bogofilter is able to read them directly from the mbox file.

For reference, my mutt macros now look like:

macro index N "<save-message>=spam-false-positivesny"
macro pager N "<save-message>=spam-false-positivesny"
macro index S "<save-message>=spam-false-negativesny"
macro pager S "<save-message>=spam-false-negativesny"

My cron job script looks like:

#!/bin/sh

FALSE_NEGATIVES=$HOME/.mail/mailboxes/spam-false-negatives
FALSE_POSITIVES=$HOME/.mail/mailboxes/spam-false-positives

if [ -f $FALSE_NEGATIVES ]; then
    bogofilter -MNs -I $FALSE_NEGATIVES
    rm $FALSE_NEGATIVES
fi

if [ -f $FALSE_POSITIVES ]; then
    bogofilter -MSn -I $FALSE_POSITIVES
    rm $FALSE_POSITIVES
fi

One advantage of this new setup is that reclassifying wrongly-identified mail is faster than before. While I don’t notice too much difference with bogofilter, as bogofilter is rather quick, I’m sure I would notice an enormous difference using this setup with SpamAssassin, as sa-learn is ungodly slow.

[1] I defer to the silly moniker “GNU/Linux” rather than simply “Linux” as that is how Debian refers to itself.

Memory Leak Tracing in Linux Using mtrace

C, C++, Unix No Comments »

Today I ran across an article on DevX.com entitled Identifying Memory Leaks in Linux for C++ Programs. This article describes a utility called mtrace which, in concert with the mtrace() function in the GNU libc, allows one to easily identify memory leaks in C programs.

To use mtrace, follow these steps:

  • Set the environment variable MALLOC_TRACE to point to a file where mtrace will log memory allocations.
  • Insert a call to mtrace() within your code before any memory is allocated.
  • Compile the program with debugging options set (GCC’s -g flag)
  • Execute the program.
  • Use mtrace(1) on the trace log file to view the memory leaks.

I shall have to look at the GNU libc source code to see how mtrace is implemented. I have a few educated guesses, but I can’t seem to piece together the whole picture.

Apache Content-Type nightmare

Apache, HTML, Unix No Comments »

The problem:

  1. Debian has configured Apache such that it will add a Content-Type: … charset=iso-8859-1 to the HTTP request headers of all files with unknown types. This overrides my <meta http-equiv…charset=utf-8> line in my website which sets the character set to UTF-8, and thus breaks the handling of extended ASCII characters (making résumé appear incorrectly). I would consider disabling it, but it does exist for a reason. It is also the default for Apache 2.0.
  2. My XSLTs are configured NOT to include the <?xml version=”1.0″ charset=”…”?> stanza at the beginning of my webpages (more below). When I make my XSLTs produce ISO-8859-1 output without this stanza, my output validation stage fails because the document is not UTF-8. It suggests to use the <?xml…?> stanza to specify the character set.
  3. When I output the <?xml…?> stanza, Opera and IE do not display the page correctly. IE also has a bug where it won’t turn on strict conformance mode (to eliminate CSS bugs) if the <?xml…?> stanza exists.

The solutions seem to be:

  1. Eliminate all extended ASCII from output, and replace it with character references (such as &eacute;). This is obviously evil.
  2. Disable Apache’s Content-Type HTTP header crap. This is evil: see above.
  3. Forget about the output validation stage. Evil.
  4. Generate ISO-8859-1 with the XML stanza and add an extra stage after output validation that strips off the <?xml…?> stanza. Evil.
  5. Try to find a way to set the content-type of files so Apache sends the proper content-type in the HTTP headers. If done with .htaccess files, it will be a big PITA.
  6. Eliminate non-7bit ASCII altogether. Ugh.

What a mess.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in