Take the spice out of spam

What follows in a recipe to extract the finest ham from spam. (note however, that the author no longer uses this recipe as published)

Ingredients

  • procmail
  • spamassassin
  • DreamHost Unix shell account
  • Gmail account

Summary

We’ll forward your public email address (public@mydomain.com) to your server shell account (shellaccount@sjov.dreamhost.com), where we’ll filter your mail through procmail and SpamAssassin, and depending on the results, forward the mail to a private Gmail account (private@gmail.com) where we will do a little more filtering. The result, after proper training, will be absolutely zero spam.

Skipping Gmail is an option and simplifies the procmail recipes, but that is not described here (I happen to like Gmail — it even has vi key bindings).

This recipe assumes we are using DreamHost. Elsewhere you may be required to configure the various components (forwarding, procmail, spamassassin, etc) yourself and while I’ve tried to make the scripts as generic as possible, you’ll probably still have to make some heavy modifications to the procmail scripts below. Whether or not you’re using DreamHost, this recipe will focus only on procmail and Gmail configurations, gloss over SpamAssassin, and assume you can set up and forward email. You must be comfortable using the command line, most likely POSIX, and willing to train your spam filters a few minutes a week.

Procedures

We will perform the configuration somewhat backwards so as not to loose email while we set things up and test. You can use an existing Gmail account although a new secret account is recommended. You must use a shell account from which you can send and receive mail. Your shell account should not already be receiving email unless you can handle the required procmail filtering modifications yourself or can accept loosing a few pieces mail before you complete this recipe.

Collect some info

You’ll need to know your shell account name and full host and domain (what we refer to as shellaccount@sjov.dreamhost.com).

Configure Gmail

Whether you use a existing or new Gmail account (private@gmail.com), you should eventually disseminate your public email address (public@mydomain.com) and may want to discourage (or keep secret) your Gmail account (private@gmail.com).

Below we will configure Gmail to leave known addresses (white list) in your Inbox as expected. Mail from new contacts (low but positive spam scores) will have the “new” label and likely spam will be archived with the “Block” label (something like a folder outside of the Inbox). Aliasdomain.com and old@address.com are additional features found in the scripts below (but not described here). You may leave them or remove them with no loss of functionality.

Select “Settings” in the top right corner. Select “Filters”.

Create a new filter matching From: -googlemail.com and Subject:(0K|1K|2K) which only applies the label “new”. Then “Create Filter” or “Update Filter”. This will result in something like this:

Matches: from:(-googlemail.com) subject:((0K|1K|2K))
Do this: Apply label "new"

Simiarly, you’ll need to create the following two filters yourself (note that these below Skip the Inbox):

Matches: from:(-googlemail.com) subject:(3K|4K|5K|6K|7K|8K|9K)
Do this: Skip Inbox, Apply label "Block"
Matches: from:(-public@mydomain.com -aliasdomain.com -googlemail.com)
to:(-public@mydomain.com -old@address.com -aliasdomain.com)
subject:(-bcc)
Do this: Skip Inbox, Apply label "Block"
Test Gmail

Send an email to your private@gmail.com from the same Gmail account. Unless you have other filters, your email should appear in the Gmail Inbox as normal. Note that Gmail does not necessarily send the mail through the Internet when Gmail can handle the mail with its own machines.

Send another mail from some other email host account to private@gmail.com. This mail should skip the Inbox and find its way into the “Block” folder (or label). This will be the fate of nearly all of your emails if you stop now.

Note, I will refer to Gmail labels as folders when a mail skips the Inbox. On the left side, you should see a labels box. Mails with the “Block” label will not appear in the Inbox. So essentially, they are archived in a separate folder, namely the “Block” folder.

Configure Procmail

Login to the your shell account. Make sure you do not already have a .forward.postfix file, .procmailrc file, nor .procmail directory. If any of those exist and you are unwilling to overwrite/delete them, stop here, do not continue, send me an email or a comment (if you really quit, you might want to remove the Gmail filters). Otherwise type the following:

   cd
   wget http://genaud.net/uploads/2007/05/procmailrc
   mv procmailrc .procmailrc
   mkdir .procmail
   cd .procmail
   touch list.black
   echo "my email address has changed" > msg.oldaddress
   wget http://genaud.net/uploads/2007/05/rc.block
   wget http://genaud.net/uploads/2007/05/rc.subjectexe
   wget http://genaud.net/uploads/2007/05/rc.spamtraining
   wget http://genaud.net/uploads/2007/05/rc.spamassassin
   wget http://genaud.net/uploads/2007/05/rc.forward
   cd

You must change the ~/.procmailrc file, modifying most/all of the six EMAIL lines. The lines match the example naming convention in this document. Change CODEWORD to something unique. EMAIL_REGEX is a regular expression that should match your own email addresses.

Install SpamAssassin

An old version is already running on DreamHost, which is perfectly fine. Most likely you can skip ahead.

The adventurous may want to compile the latest and greatest. Edit the rc.spamassassin recipe if your spamassassin is not located in /usr/bin/spamassassin. You may have to (want to) run spamc instead.

Procmail Launch

If you are already using your shell account email you may not want to turn procmail on already, but you’re on your own. Otherwise, we will now filter all email to your shellaccount@sjov.dreamhost.com email address through procmail. Type the following command:

echo "\"|/usr/bin/procmail -t\"" > ~/.forward.postfix
Test Procmail

Sent an email to shellaccount@sjov.dreamhost.com from the shell or another email account. This mail should eventually end up in the Gmail “Block” folder (perhaps with a [bcc] in the Subject depending on the your EMAIL_REGEX variable, but this is not important). You should see that this email was indeed forwarded in your ~/Maildir/log.* file (the log file has an appended ISO date).

Send an email from your private Gmail address to your secure shell address. This is just a test that nothing wildly strange occurs. This message may be consumed by Gmail (because it is a short loop bounced straight back to Gmail). Whether or not you receive the message again in Gmail, the log should indicate that the forward left your web server (again ~/Maildir/log.*).

You may notice an “0K” (zero K) or some other number in Subject. This just means that the message is spam neutral (or more likely, your spam filter is not activated yet). The higher the number, the more likely it is spam. Let’s play with this number a bit.

Send another mail from some other email host account to private@gmail.com with “1K” in the Subject. This message should appear in the “Block” folder with a “new” label. Finally, send another email just like before except with “6K” in the Subject. The last email should also appear in the “Block” folder but without the “new” label.

It is a good idea to add addresses to your white list (below) and remove the 0K,1K,… or 5K from your Subject when replying to friends and associates.

Nearly all mail is now blocked because our filters now want all mail to be sent via the public email address. That way we guarantee all mail has passed through our spam filters (coming up soon).

Configure Email Forwarding

If all the tests above worked as planned, it should be now safe to forward your public email address to your secure shell address. If you are using DreamHost, you can do this via the panel. You do not need to set up a mailbox, just mail forwarding (you’ll keep the mail to Gmail).

Test Round Trip

From another email address (and/or your shell account), send an email to your public email address. Unless it looks like spam, it should end up in your Gmail Inbox in a moment. Check the log in the shell account. Gmail sometimes takes some time before displaying the email.

SpamAssassin Training

You will definitely want to add many/all of your addresses from your address book to ~/.spamassassin/user_prefs. This will ensure that your buddies do not get trapped by your zealous spam filter. Be sure to add all of your own email addresses, including the public, private, secure, old, aliases, everything. The list can handle simple wildcards (*):

whitelist_from        some*one@somewhere.com

After adding your own addresses, you can run all the tests again. You should no longer get 0K, etc in the subject from known addresses. White listed addresses have a negative 100 spam score plus or minus the calculated spam (which usually varies +/- 5 or 20 for obvious spam) so even the most outrageous pornographic Viagra pushing get rich quick scam from your mother is likely to end up in your box.

Currently, we have a procmail recipe running called rc.spamtraining. This filters out some of the more obvious spam while we train SpamAssassin. rc.spamtraining places its caught spam in ~/Maildir/delete.* (date suffix). The rc.spamassassin recipe divides mail between ~/Maildir/ham.* and ~/Maildir/spam.* . None of these files is guaranteed to be perfectly accurate, but generally delete.* and spam.* are unsolicited mails and ham.* are the mails you wanted. Periodically you should go through these files separating the bona fide spam into one mbox file and the legitimate ham in another mbox file. You may want to limit the ham to personal mail rather than mailing lists, daemon errors, etc.

If you don’t know how to read/write the mbox format you may open the files in elm, mutt or some other mail program and save the files appropriately. For example:

   mutt -f spam_mbox_file

Once you have the mbox files clearly divided between spam and ham, you can train SpamAssassin:

   sa-learn --mbox --spam    my_spam_file

   sa-learn --mbox --ham    my_ham_file

The SpamAssasssin documentation highly recommends that you teach SpamAssassin with more ham than spam. After 2000 messages you should have a very learned spam filter and might consider removing rc.spamtraining from your ~/.procmailrc file.

Advanced Teasers

Once you’ve trained SpamAssassin with thousands of your own emails, you can do some more clever stuff for example lowering the threshold and executing more sophisticated algorithms.

Gmail let’s you change the sender/from address. You should add your public email address and make that the default. Once that is done you can automatically add to your white list by sending an email (from your public Gmail address) to your secure shell address with a Subject like

Subject: CODEWORD.EXE.ADD myfriend@example.com

The downloaded scripts are prepared to handle multiple accounts, automatically reply with address change messages, update your white list by email (from you), a black list, and a few more features that would be too lengthy and boring to mention here. But I’m always up for a chat. The devil is always in the details; I’m game to take up case by case, as long as you can use ‘vi’. ;-)

Comments (6)

  1. Lars Jensen wrote::

    Have you considered using “only” Gmail for blocking spam? In the Gmail account there is this button report spam, and everytime you click it, Gmail gets better at filtering out spam.

    If I get these instructions correct, you will be adding an additional filter to your mails, and you would probably never get to press the “Report Spam” button, thus actually cheating Gmail for improving its spam fighting abilities?

    Actually thinking about it, can you tell me how this is exactly the same as not stealing and not cheating? Stealing from other Gmail users, and cheating Gmail by depriving it of means to combat spam. This is moral decay… ;-)

    And we keep improving our sophisticated spam filter every day. You can help by using the Report Spam button, which removes spam from your inbox and automatically improves spam filtering in the future.

    http://mail.google.com/mail/help/about.html#spam

    Wednesday, June 13, 2007 at 0:25 #
  2. Alex wrote::

    “Beware of the man who works hard to learn something, learns it, and finds himself no wiser than before…He is full of murderous resentment of people who are ignorant without having come by their ignorance the hard way.”
    -Kurt Vonnegut, Cat’s Cradle

    I am starting to think you are correct. I’ve learned the ins-and-outs of the otherwise completely asinine streaming ‘programming language’ procmail. While I like the greater white list control, address expiration messages, and the ability to create numerous aliases (alex.youtube@whatever.com), teaching spam assassin through ssh is an infinitely bigger pain in the butt than the “Report Spam” button. However, Gmail does not allow one to filter on the message headers nor up the spam aggressiveness. I have a higher tolerance for false positives (deleted ham) than false negatives (spam).

    But Lars, consider well the words of the late Kurt and be scared. Be scared.

    Wednesday, June 13, 2007 at 1:36 #
  3. Lars Jensen wrote::

    Oh, I had no intentions of “belittling” your effort in any way, or to mock you. Well okay, maybe I wanted to tease you just a little bit - but with the best of intentions - or, no wait a minute - no maybe I only wanted to mock you. Muhahaha…

    No, okay - all sillyness aside, I was actually thinking about the approach you had about white-listing people by giving them a reply-message and letting them answer a question to prove they were sincere emailers and not a spam-bot. Is that approach described anywhere by you?

    I know I thought that approach was cumbersome, but it would definetely sort out most spammers and also the ignorant people who wouldn’t be able to write you. Totally unlike this blog, where any ass-wife can leave a comment. ;-)

    “To understand everything is to forgive everything”

    Hindu Prince Gautama Siddharta, the founder of Buddhism, 563-483 B.C.

    Wednesday, June 13, 2007 at 23:52 #
  4. Alex wrote::

    I started all this to help an elder friend of mine to block *ALL* spam including the ’solicited spam’ that she seems to generate. Because this individual had the habit of clicking and replying to shady parties, I needed to err on the side of blocking legit mail with all but the squeakiest emails getting through. In the end (today), I have given up. There is nothing I can do for someone who replies thus “…Dear Gracious Controller of FreeLotto.Com, please update your records with the following corrected details…”

    Of course, I did most of my experiments on my own email. Lars, you were exposed to the first phase, which essentially required every user to register on a grey list pending white listing. This required aliases for computer-generated mail such as from banks and the like but produced quite some infinite mail loop headaches. Eventually, it seemed to work well for me. However, I have literally annoyed more people than not. Many close friends not already white listed made comment. I pulled the plug when my own father gave me a call. It’s a damn shame though. I really expected people would ask for the same system themselves. Why is it so hard to reply to an email with a certain subject line or click on a link?

    The second phase kept the white and black lists, aliases, but included, as you say, a second spam filter (actually two spam filters). This required constantly teaching of SpamAssassin. I suppose SpamAssassin has become the benchmark upon which spammers test their works because just recently spam started to come through despite fine tuned conditioning.

    The third phase as (or perhaps because) you suggest is quite similar to phase zero. Phase Four is out of my hands: DomainKeys Identified Mail.

    Thursday, June 14, 2007 at 0:56 #
  5. Paul wrote::

    I’m real fond of this: http://crm114.sourceforge.net/ with procmail as a sending and catching mechanism as per the default install. Does what it says on the label. It made two mistakes in the first few weeks and hasn’t made one since. YMMV.

    Monday, July 2, 2007 at 21:45 #
  6. Alex wrote::

    CRM114 logo by Liz Manicatide Cool logo! So, do you use this just for spam or also launching rockets? It’s gotta be better than procmail. Thanks for the tip.

    Monday, July 2, 2007 at 22:15 #