Scanning for Spam, Worms, and Viruses

Unwanted email is arguably the worst problem facing email administration today. Two types of unwanted email are common: spam and worms/viruses. Spam is unsolicited bulk email, usually commercial in nature. Most spam markets worthless body-enhancement products, questionable financial advice, and so on but is more of a nuisance than a threat—at least, if you ignore the substantial network bandwidth that spam consumes. Worms and viruses, on the other hand, are malicious computer code that, if executed on an unprotected computer, can spread and cause damage. Despite the fact that spam is quite different from worms or viruses in their intent, the two classes of junk email can be combated in similar ways.

Tip

The distinction between worms and viruses is a tricky one to define and depends on who you ask. Thus, I don’t try to distinguish the two types of menaces in this chapter, and hereafter I use the word worm to refer to both types of program. Sometimes I refer to “spam-fighting tools” or the like. Such tools can often be used to fight worms, as well, but such phrases omit this detail for brevity’s sake.

Dealing with spam and worms requires first knowing a bit about the types of approaches to dealing with the problem. One of the tools that can be used to directly combat spam and worms is Procmail, so I describe it shortly. Procmail can also be used to invoke other spam-fighting tools. SpamAssassin and Bogofilter are two such antispam tools. Finally, as a site policy issue, you may want to place suspicious attachments in a special holding area until you can examine them.

An Antispam and Antivirus Tool Rundown

Spam and viruses are difficult to detect. This is particularly true of spam, because spam identification is somewhat subjective: one person’s spam may be another person’s desirable commercial communication. The line between worms and non-worms is clearer, but worms can also be difficult to distinguish between legitimate email attachments, particularly in some environments (for instance, if you have a legitimate business reason to send or receive executable files). For this reason, the number of spam-fighting tools available is quite large. Indeed, the number of approaches to fighting spam and worms is large. Here are some general methods:

Blackhole lists: This approach, described in the earlier sections on sendmail and Postfix, relies on central authorities maintaining databases of IP addresses from which messages shouldn’t be accepted or should be accepted only with caution. Typically, these databases are updated frequently, based on spam reports from their users. This method is best implemented in receiving SMTP servers because they receive direct connections from the sending systems and therefore aren’t easily tricked into believing the message originated from a false IP address. (Headers are easily forged, so the originating IP address can be obfuscated by clever spammers if another system does this check.) Note that this approach doesn’t test the message’s content; it’s based solely on the IP address and so is susceptible to false alarms should an address send both spam and nonspam messages.
Distributed hashes: Some network databases work on more than the originating IP address; they store hashes of entire spam messages. When your server receives a message, it can hash the message (minus its headers) and query a network server for the presence of this hash. If it’s present, it means that somebody else has received an identical message and entered it as spam in the hash database. This approach is a potentially powerful one, but it can be easily “poisoned” with respect to legitimate mailing lists; that is, individuals can classify mailing list messages as spam, which can then cause these legitimate messages to be misclassified as spam. You can work around this problem by creating a “white list” (see entry later in this list) of addresses that aren’t tested against a distributed hash system.
Simple pattern matches: Examining the message’s content is the most reliable way to identify spam. The simplest type of examination relies on simple pattern matches. For instance, you might decide that any message containing the word Viagra is spam, and discard it. This approach can be implemented in either the SMTP server or in add-on software, such as Procmail. It has the disadvantage of great potential for false alarms, particularly if your rules are too broad. For instance, if you discard all messages containing the word Viagra, you may catch a lot of spam, but you’ll also discard legitimate email to people who are actually corresponding with others (perhaps their doctors) about this drug. Maintaining a good set of pattern match rules can also be quite time-consuming, although some packages, such as SpamAssassin, aim to minimize this problem by providing frequent updates to a general rule set.
White lists: A white list is a list of addresses or keywords that trigger automatic acceptance of a message. They’re frequently used with simple pattern matches or other spam-catching tools in order to minimize the risk of discarding important messages. Typically, you add your regular correspondents to your white list, and their messages get through even if another rule would reject them. They’re usually implemented using the same tools that can perform simple pattern match rejections.
Challenge-response tests: A challenge-response system is a variant on white lists. When a message arrives from a source other than one that’s on the white list, the recipient automatically sends a challenge to the message source. This challenge is a message asking the sender to perform some action to prove that the message isn’t spam, such as to respond with a keyword. Automated spamming systems can’t cope with this request, but humans can. Once a response is received, the original message is delivered, and the sender is usually added to the white list. This method of spam fighting can be quite effective, but it can generate more traffic and, because they must respond to challenges, places an extra burden on those who send mail. A poor implementation can also result in a continuous loop of challenges to challenges, should two systems use similar systems that don’t exempt challenges to their own challenges.
Statistical tests: A spam-catching tool that emerged on the scene in 2002 involves statistical tests (often called Bayesian tests , after Bayes’ Rule, a statistical principle they employ). These tests use a database of words, word pairs, and other message features. Typically, you feed the software a sample of spam and another sample of nonspam, and the software adds up the number of times a word appears in each category. For instance, Viagra might appear 50 times in spam and once in nonspam, whereas Linux might appear 50 times in nonspam and once in spam. If a message with the word Viagra is analyzed, then, a statistical filter will give it a high probability of being spam. The analysis is typically based on many words, though, so a single word isn’t likely to “poison” an analysis, as can happen with simple pattern matches. One statistical spam filter, Bogofilter, is described in more detail later. Some tools, such as SpamAssassin, employ statistical tests as part of their overall operation.

These same tools can detect worms, although some worm-detection tools rely on an analysis of the binary file that’s attached to the message rather than English words in the message body. (Some worms can also be reliably identified by their message texts.)

Some tools are hard to classify in just one way. For instance, Procmail directly implements pattern-matching tests but can call other tools that use other methods. The upcoming sections describe Procmail, SpamAssassin, and Bogofilter in more detail.

Sendmail Antispam Options

One way to deal with spam and worms is to use SMTP server features. One of these features in sendmail has already been described: the access.db file, in conjunction with the FEATURE(`access_db') option in your sendmail .mc file. You can block mail from sites known to send nothing but spam using this technique. Unfortunately, the world of spam is a fast-changing one, so by the time you add a hostname or address to this list, chances are the spammer will have started using another. The sheer quantity of spam also makes this approach an awkward one. Nonetheless, you can use this method for some particularly persistent offenders.

Another spam-fighting approach is to use a blackhole list, which is a frequently updated list of sites that are known or suspected spam sources or that shouldn’t be sending email directly. Blackhole lists work as services, much like DNS: your mail server queries the blackhole list with the IP address of a connecting server that’s trying to initiate a connection, and the blackhole list server returns a value that indicates the sender’s status. To use a blackhole list, you enter a line like the following in your sendmail .mc file:

FEATURE(`dnsbl', `relays.ordb.org', `"550 Email rejected due to sending 
server misconfiguration - see http://www.ordb.org/faq/\#why_rejected"')

This line tells sendmail to use the blackhole list at http://relays.ordb.org and to include a message with a URL in bounced emails. (This enables senders to check the messages, should nonspam messages be bounced.) Of course, this raises a question: how do you know which blackhole list to use? Many are available. You may want to peruse http://www.declude.com/Articles.asp?ID=97 or http://www.moensted.dk/spam/ for pointers to more than 100 blackhole databases with varying criteria for inclusion and other features. Some are free; others require you to pay for the privilege of using them. If you like, you can include multiple blackhole list definitions, each on its own line.

More sophisticated spam-fighting techniques require additional software. In particular, you can add Procmail to the mix to filter on keywords or to call other programs to check your incoming email in various ways. This topic is covered in a later section. If the sendmail server is an intermediary system, you may want to call Procmail as part of the forwarding configuration, as described earlier, in Section 13.2.3.3.

Postfix Antispam Options

Postfix provides a number of antispam options, some of them are quite sophisticated. In addition, you can use Procmail as a delivery agent to call external programs or perform checks Postfix alone can’t handle.

One of the simpler Postfix antispam configurations is to use a blackhole list. One main.cf option enables this feature:

smtpd_client_restrictions = reject_rbl_client relays.ordb.org

The smtpd_client_restrictions option tells Postfix when to reject mail. The reject_rbl_client value corresponds to a positive lookup in the blackhole list database specified after this value (relays.ordb.org in this example). Postfix can use the same blackhole lists as sendmail; consult http://www.declude.com/Articles.asp?ID=97 or http://www.moensted.dk/spam/ for pointers to more than 100 blackhole databases. Other values can be added to this line, separated by commas, to reject mail from systems that don’t have matching DNS A records for their PTR records (reject_unknown_client), to check an external database for rejection rules (check_client_access type:table), and so on. Consult the Postfix documentation for details.

Tip

Prior to Version 2.0, Postfix used a pair of options to achieve the effect described here. Specifically, maps_rbl_domains contained a comma-separated list of blackhole list servers; these were used only if the reject_maps_rbl option was passed to smtpd_client_restrictions.

Spam and worms can often be identified by the presence of strings in message headers or bodies. For instance, you might know from experience that any message with a subject of earn $$$ is spam and can be discarded. Postfix includes several options that check message headers and bodies for such content:

header_checks: This option points to a file that contains checks that are applied to message headers—the parts of a message that contain the subject, the return address, etc. Typically, you’ll check headers for suspicious email subjects, senders, and perhaps recipients.
mime_header_checks: Increasingly, email messages use Multipurpose Internet Mail Extension (MIME) to encode special formatting and nontextual data. MIME extensions are also loved by spammers and worm authors because they can deliver text that’s harder to identify as spam or malicious computer code. You can use this option to point to a file that matches suspicious MIME headers. This option is available in Postfix 2.0 and later, and defaults to $header_checks.
nested_header_checks: Users and programs sometimes attach one email message to another. To search such attached messages’ headers, you can use this option, which is available only in Postfix 2.0 and later, and defaults to $header_checks.
body_checks: This option searches email messages’ bodies—the parts of the message that users read, as opposed to the headers. Scanning message bodies can be a good way to identify worms and spam. This option is available only in Postfix 2.0 and later.

All of these options take an external filename, along with a code for the file’s format, as an option. This file is typically a plain-text file or a database file that’s derived from a plain-text file. The resulting entry in main.cf looks something like this:

header_checks = pcre:/etc/postfix/header_checks

The pcre code stands for Perl compatible regular expression. Alternatively, you can employ regexp to use non-Perl regular expressions. In either case, lines in the original text file take the specified form followed by one of the following action codes:

DISCARD optional text: Accepts the message for delivery but quietly rejects it. If optional text is present, enter it in the mail logs; otherwise, log a generic message.
DUNNO: Moves on to the next input line. This option is synonymous with OK.
FILTER transport:destination: Passes the message through the external content filter, as specified by the transport method (smtp, procmail, and so on) and destination (a hostname or filename, typically). The filter receives the message only after Postfix has examined all the message’s lines, so the message can be rejected before the filter is called.
HOLD: Places the message in the hold queue, which is a sort of limbo in which the message is neither delivered nor discarded. A system administrator can examine the hold queue using the postcat command and release messages from the queue or destroy them using postsuper.
IGNORE: Ignores the current line of input and moves to the next one.
PREPEND text: Places the specified text at the start of the input line. This can flag lines for further spam processing.
REDIRECT user@domain: Sends the message to the specified user rather than the recipient specified by the mail’s envelope. This feature can be used to forward mail for users who have moved elsewhere, as an alternative method of forwarding mail to internal servers, and in other ways. However, many potential uses of this action are better achieved through other means.
REJECT optional text: Rejects delivery of the message. If you specify optional text, it’s passed to the sender; if not, a generic error message is delivered to the sender.
WARN optional text: Logs a warning with the specified optional text in the mail log file. This action is intended primarily for testing new rules before implementing them.

Many of these action codes are available only in Postfix 2.0, 2.1, or later. As an example of their use, consider the following entries:

### Subject headers indicative of spam
/^Subject: ADV:/ REJECT
/^Subject: Accept Credit Cards/ DISCARD
### Additional header checks
/^(From|Received):.*iamspam\.biz/ REJECT
/^From: spammer@abigisp\.net/ FILTER procmail:/etc/procmailrcs/maybespam

This set of rules rejects mail with a subject header of ADV: or with from or received headers that include the string iamspam.biz. It also discards mail with a subject header of Accept Credit Cards and passes mail from spammer@abigisp.net through a Procmail filter, /etc/procmailrcs/maybespam. This filter presumably performs additional checks that are too complex for Postfix to handle by itself.

In addition to its own checks, Postfix can send mail through Procmail for processing. In fact, using Procmail is usually the default. If in doubt, check your main.cf file for a line like the following:

mailbox_command = /usr/bin/procmail

When called in this way, Procmail is used for final message delivery. You can call it in other ways, such as in a FILTER action in a header check. Broadly speaking, Procmail is a more powerful way of looking for suspicious patterns in email than Postfix’s own rules. Procmail can also be customized on a user-by-user basis, which is harder with Postfix’s rules. Thus, you may prefer to use Procmail alone, rather than use Postfix’s pattern matching tools. The main advantage of Postfix’s rules is that they can be used to reject messages before they’re fully received. In particular, if a header check causes a message to be rejected, Postfix refuses delivery before many bytes are transferred. This feature can help conserve bandwidth, at least if you can devise rules that correctly identify large spams or worms from their headers alone. Procmail delivery rules, by contrast, operate only after the mail server has accepted the mail for delivery. Unfortunately, spammers and worm writers have become very good at disguising their unwanted emails’ headers, so you may have no choice but to accept the entire email in order to properly identify it. The topic of spam and worm control is covered in more detail later in this chapter.

Using Procmail

Procmail is a very powerful mail processing tool. It does far more than spam filtering; it can redirect mail based on nonspam criteria, sort mail into folders, copy messages for archival purposes, pass mail through arbitrary external programs, and more. Still, one of Procmail’s main applications is as a spam-fighting tool; you can use its native pattern-matching features to discard mail or shunt it into a suspected spam folder. You can also pass messages to external programs for tests that Procmail can’t handle by itself.

Using Procmail requires calling it in some way. Typically, you do so by configuring your SMTP server to call Procmail as part of its mail delivery process. You can then move on to Procmail configuration. To configure Procmail you need to understand the Procmail configuration file format and be able to create Procmail recipes, which are the rules used to direct mail in Procmail.

Calling Procmail

The first step in Procmail use is to ensure that your mail system uses it. Most Linux SMTP server configurations use Procmail by default, so you may not need to change anything about your basic SMTP configuration to use Procmail. If you’re in doubt, though, or if you want to fine-tune the configuration, you can check some settings:

Sendmail: You should set three options in the sendmail .mc file to use Procmail. The first of these is:

define(`PROCMAIL_MAILER_PATH', `/usr/bin/procmail')

This tells sendmail where to find the Procmail binary. (Some configurations put this option in another configuration file, but you can override it in your sendmail .mc file if you need to do so.) The remaining options are FEATURE(`local_procmail') and MAILER(procmail), which collectively tell sendmail to use Procmail for local deliveries. As described in the earlier Section 13.3.3.3, you can also call Procmail in other ways, such as in a forwarding configuration.

Postfix: To call Procmail as part of the Postfix delivery rules, you must tell Postfix to use the Procmail binary as part of its delivery system: mailbox_command = /usr/bin/procmail. As described in an earlier section, you can also tell Postfix to use Procmail in mail forwarding configurations.

The Procmail configuration file

Procmail can use one or more of several configuration files:

/etc/procmailrc: This file is the global Procmail configuration file. It’s called as root to process all the mail that the SMTP server handles. For spam-control purposes, you use this file to apply rules you want to use on all the email that’s delivered to your local users. Typically, this means you use it to apply rules that are very unlikely to result in false alarms.
~/.procmailrc: Individual users can create .procmailrc files in their home directories. These files have the same format as /etc/procmailrc, but they’re applied only to email directed to specific users. This enables users to apply their own customized Procmail rules. Alternatively, you can provide some standard configuration files in specific locations and allow users to create symbolic links to those files to achieve preset effects.
Other configuration files: Some methods of calling Procmail, such as those that use Procmail as part of mail forwarding schemes, enable you to pass the name of a configuration file to Procmail. Sometimes these reside in a directory such as /etc/procmailrcs, but that location is arbitrary.

Warning

Procmail runs as the user who calls it, although when it’s called as root, it can drop its privileges under some circumstances. A rule that works well in ~/.procmailrc (when Procmail is called as the end user) may not work well when placed in /etc/procmailrc (when Procmail is called as root), or vice versa. Typically, you must be more careful about file permissions when calling Procmail as root, because writing to or creating a file (such as a mail folder) as root can make that file inaccessible to ordinary users, such as the mail’s intended recipient.

Whatever its name, a Procmail configuration file consists of three parts: comments (denoted by hash marks), environment variable assignments (similar to those in bash, such as MAILDIR = $HOME/Mail), and recipes (described next). The bulk of most Procmail configuration files consists of its recipes.

Creating Procmail recipes

Procmail recipes consist of three parts: the identification line, the conditions, and the action. The idea is that the action is initiated when the conditions are met. For instance, a condition might be that the string Viagra appear in the message body, and the action might be that the message is sent to /dev/null—that is, that the message be discarded. The form of the recipe is as follows:

:0 [flags] [:[lockfile]]
[conditions]
action

The identification line always begins with :0; that’s just the convention. The flags are described shortly; they specify where Procmail looks for condition matches, how it matches, and so on. The lockfile is a file that controls access to a mail file. If a file is locked, Procmail defers operating on it. Normally, a single colon (:) is sufficient, but you can specify the filename, if necessary. The conditions are technically optional, but in practice, most recipes have at least one condition line. (A recipe with no conditions lines matches all mail messages.) Including multiple conditions causes Procmail to require all of them to match before an action line is implemented. Precisely one action is required for each recipe.

Procmail’s default behavior is to match conditions against message headers in a case-insensitive way. Several flags are available to change how Procmail handles these matches, though. Here are the more common:

H: This value does matches on message headers, which is the default.
B: This value does matches on message bodies.
D: This value does a case-sensitive pattern match, as opposed to the normal case-insensitive match.
c: Ordinarily, if a recipe matches, it’s passed to the action, which may discard it, alter it, or otherwise make the original inaccessible. This option causes the action to act on a “carbon copy” of the original message, which is useful if you want to, for example, send a duplicate copy of a message to another account or mail folder.
w: This value causes Procmail to wait for the action to complete. If the action fails, Procmail leaves the message in the queue for other recipes.
W: This option is similar to w, but it suppresses error messages.
f: This option pipes a message through another program, treating that program as a filter.

The Procmail recipe conditions can look like Greek to the uninitiated. Each begins with an asterisk (*), followed by a regular expression . At its simplest, a regular expression is simply a string that must match exactly. For instance, the regular expression Viagra matches the word Viagra in the input. Many characters have special meanings, though, such as:

^: A caret symbol indicates the start of a line; for instance, ^Viagra denotes the string Viagra, but only at the start of a line. Many conditions begin with a caret.
$: This character signifies the end of a line.
.: A period matches any single character except for an end-of-line character. For instance, h.t matches hat, hut, hot, or any other similar string.
x*: This string (where x is any single character) matches any number of x characters, including none. This is often combined with a dot (.), as in .*, to match any arbitrary group of characters.
x+: This expression works much like x*, but matches any occurrence of one or more x characters, rather than 0 or more.
x ?: This string matches zero or one x characters.
( string1|string2 ): This expression matches one of two strings by separating them by a vertical bar within parentheses. This principle can be extended to more than two strings, as well.
( string )*: This expression matches zero or more instances of the specified string.
[ chars ]: Placing characters within square brackets causes Procmail to match any one of the enclosed characters. For instance, [abcz] matches any one of the characters a, b, c, or z. You can specify a range of characters by using a dash, as in [c-j] to indicate any letter between c and j.
\: The backslash character removes the special meaning from the subsequent character. For instance, to match a dot, you enter \. in the conditions.
!: This character appears only at the start of a conditions line and reverses its meaning; that is, if the regular expression matches, the recipe does not match.
?: Like !, this character appears only at the start of a conditions line. It tells Procmail to use the exit code of the specified program.

Regular expressions can be extremely complex, so you may need to consult the Procmail manpage or another source of information on regular expressions to learn more. The next section provides some examples.

Finally, each Procmail recipe ends with an action. Each action can take any of several forms:

A filename: An action that takes the form of a filename indicates that the message is to be stored in the specified file, which is treated as an mbox mail folder.
A subdirectory name: A filename that ends in a slash (/) is interpreted as a subdirectory name, in which case Procmail stores the message in this subdirectory in maildir format.
!: An exclamation mark denotes a list of email addresses to which the message should be forwarded. This can be useful for setting up individual mail forwarding to another system.
|: Procmail treats a vertical bar as a pipe character, much like bash. Its presence at the start of an action tells Procmail to pass the message to an external program for further processing.
{: You can nest multiple tests by using a left curly brace as the action line; subsequent lines, until a right curly brace (}), constitute one or more additional recipes that are used only if the initial recipe matches. You can use this feature to control whether or not to perform certain tests; for instance, to perform spam checks only if mail doesn’t come from certain addresses (that is, to implement a white list).

Because Procmail supports just one action per recipe, you may need to create an external script if you want to perform some complex action. Be sure your external script reads the entire message. If it doesn’t, Procmail may send the message through additional recipes, which can result in duplicate deliveries.

Examples of Procmail recipes

Example 13-1 shows a sample Procmail recipe file intended for use by individuals. (When used by the system, some file ownership issues can arise. This problem can be avoided by adding a DROPPRIVS = yes line to the start of the file.) This example illustrates several useful techniques:

Nesting: The first rule contains two nested subrules, the intent being to exclude mail from two regular correspondents from spam checks, which are nested. The nested rules are indented to set them off, but this indentation isn’t required.
Spam checks: The two spam-check rules look for strings that are indicative of spam. The first searches message bodies for the string 301 followed by 0 or more characters, followed by S, 0 or more characters, and 1618. This string is found in some spams that reference a failed piece of U.S. legislation, S.1618, which dealt with spam. The legislation failed years ago, but spam still references it, as if to legitimize itself. The second spam check looks for a string in subject headers that identifies messages encoded using a system that’s common for certain Asian languages. Most non-Asian users seldom or never receive nonspam mail with such subject headers, but a lot of spam uses them.
Flags: Several rules use flags to search the text of messages or to create carbon copies.
Mail sorting: The spam messages are “sorted” to /dev/null, which effectively discards the messages. The last rule saves mail from a mailing list (identified by a unique “to” header) into the genetics-list mbox mail folder in the $MAILDIR subdirectory, which is identified on the first line of the recipe file.

Example 13-1. Sample Procmail recipe file

MAILDIR = $HOME/Mail

# Do some spam checks, but exclude anything from good addresses
:0
*! ^From:.*(goodguy@pangaea\.edu|linnaeus@example\.com)
{
  :0 B
  * ^.*301.*S.*1618
  /dev/null

  :0
  * ^Subject:.*=\?big5\?*
  /dev/null
}

# Forward mail from goodguy@pangaea.edu with "peas" in the
# subject line to mendel@luna.edu
:0 c
* ^From:.*goodguy@pangaea\.edu
* ^Subject:.*peas
! mendel@luna.edu

# Shunt mail from a genetics mailing list into its own folder
:0:
* ^To:.*genetics@mailer\.example\.org
$MAILDIR/genetics-list

One of the major problems with using Procmail alone as a spam-control tool is that creating and maintaining a set of Procmail rules can be quite labor-intensive. This is particularly true because spam and worms are constantly changing, so a good set of rules for today may be inadequate tomorrow. You may want to search for a ready-made set of Procmail recipes, such as SpamBouncer (http://www.spambouncer.org) or the Sample Procmail Recipes with Comments (http://handsonhowto.com/pmail102.html). The first of these is specifically intended as an antispam tool, whereas the second is a practical teaching tool. If you periodically check back with such pages and update your filters, you can keep a reasonably up-to-date Procmail antispam configuration. On the other hand, rules created by somebody else are more likely to miss spam or, worse, falsely identify nonspam as spam.

Tip

Before deploying a new Procmail recipe, or especially an extensive set of recipe changes, try testing it on a small scale. You can create a test account, place your new recipe in its .procmailrc file, and send test messages—both spam and nonspam—to that account.

Using SpamAssassin

SpamAssassin (http://spamassassin.apache.org) is an antispam tool based on a large number of tests. Each test changes the score of the message. SpamAssassin doesn’t actually delete messages; instead, it adds headers identifying likely spam as such. The idea is that you’ll call SpamAssassin from Procmail, a mail server, or a mail reader and use it to detect the SpamAssassin spam report and delete or redirect messages based on that report.

Tip

SpamAssassin has grown into quite a large tool. In fact, it’s complex enough that it’s spawned its very own book: SpamAssassin (O’Reilly). If you need to perform complex tasks or configure SpamAssassin as part of a mail server for a large site, it’s worthwhile to read this or other SpamAssassin-specific documentation.

SpamAssassin basics

The SpamAssassin software comes with most major distributions, so installing it from your distribution medium is usually the simplest course of action. If you can’t find SpamAssassin with your distribution, go to the main SpamAssassin site, and download it. SpamAssassin is actually a Perl script and relies on several Perl modules, so you may need to install additional packages that hold these modules.

Once SpamAssassin is installed, you should test its operation by manually feeding it a few spam and nonspam messages. You do this by redirecting a message in a file into the spamassassin command. Adding the -t option adds an extra report to the end of the output, which appears on the screen:

$ spamassassin -t < 
                     message.txt

The message.txt file should contain a complete message, including full headers. Most mail readers have an option to save messages to disk with full headers so use that option to get your samples. The SpamAssassin output includes two additions to the message. The first addition appears at the end of the message headers, and constitutes SpamAssassin’s report, as intended for subsequent mail processing tools, such as Procmail or an email reader. For a nonspam message, this addition is likely to resemble the following:

X-Spam-Checker-Version: SpamAssassin 3.0.0-g3.0.0 (2004-09-13) on 
mail.example.com
X-Spam-Level:
X-Spam-Status: No, score=0.1 required=5.0 tests=RCVD_IN_SORBS 
autolearn=unavailable version=3.0.0-g3.0.0

The first line simply identifies the version of SpamAssassin and the computer on which it’s running. The second line holds the spam level, which is expressed as a number of asterisks (*). Because this is an innocuous nonspam message, no asterisks are displayed; however, some nonspam messages will have a small number of asterisks (five is the typical cutoff point for spam, although you can use something else if you like). The third line, which typically extends across multiple lines, summarizes the tests that raised alarms. In this case, the total spam score is 0.1 (hits=0.1). That 0.1 value came from the RCVD_IN_SORBS test, which isn’t explained at this point. The -t option to spamassassin, though, adds extra lines at the end of the message:

Content analysis details:   (0.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.1 RCVD_IN_SORBS          RBL: SORBS: sender is listed in SORBS
                            [172.24.98.102 listed in dnsbl.sorbs.net]

This text identifies the RCVD_IN_SORBS flag as meaning that the sender address is listed in the SORBS blackhole list. This information can help you understand what SpamAssassin is doing right (or wrong), but it’s not provided in normal operation. You can, of course, consult the SpamAssassin documentation to learn more about specific tests.

When you test a spam message, the spam headers added to the message are likely to report more serious problems:

X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.0-g3.0.0 (2004-09-13) on 
mail.example.com
X-Spam-Level: ******
X-Spam-Status: Yes, hits=6.8 required=5.0 tests=FORGED_MUA_OUTLOOK,
        FORGED_OUTLOOK_TAGS,HTML_40_50,HTML_FONTCOLOR_UNSAFE,HTML_MESSAGE,
        HTML_TAG_EXISTS_TBODY,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,
        RCVD_IN_SORBS autolearn=spam version=3.0.0-g3.0.0

This output includes one header line that’s not present in the nonspam output: X-Spam-Flag: YES. You can search for this line using a Procmail recipe, as described shortly, to detect spam after messages have been processed with SpamAssassin. The X-Spam-Level header shows six stars, corresponding to the 6.8 hit rating reported in the X-Spam-Status line. This line also shows quite a few hits on individual spam tests. These are reported in greater detail at the end of the message if you use the -t option to spamassassin.

You should run several spam and several nonspam messages through SpamAssassin. You should verify that none of the nonspam messages are rejected and that a significant number of spams are rejected. SpamAssassin might not detect all of your spams, though. You can take the time to fine-tune its operation by changing the points assigned to individual rules or by enabling its auto-learning feature, which enables it to update its rules on the fly. You can also combine SpamAssassin with other tools, such as your own custom Procmail filters.

Calling SpamAssassin from Procmail

You can call SpamAssassin in various ways. One is to use Procmail for local mail delivery. (Calling SpamAssassin as part of a mail gateway system is described next.) Add the following recipes to the start of your Procmail configuration file to call SpamAssassin and sort suspected spam into two folders, almost-certainly-spam and probably-spam:

:0fw
* < 256000
| spamassassin

:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
almost-certainly-spam

:0:
* ^X-Spam-Status: Yes
probably-spam

:0
* ^^rom[ ]
{
  LOG="*** Dropped F off From_ header! Fixing up. "

  :0 fhw
  | sed -e '1s/^/F/'
}

These rules are taken from the procmail.example file that ships with SpamAssassin. That file also includes several comments that describe its rules. In short, the first recipe passes messages that are smaller than 256,000 bytes through SpamAssassin, which adds its headers to the messages. (Larger messages are almost certainly not spam, although they can contain worms. SpamAssassin doesn’t cope well with very large messages, hence this size limitation.) The second recipe dumps messages with a spam score of 15 or higher into the almost-certainly-spam folder, while the third recipe places messages that are flagged as spam but that weren’t caught by the second recipe into the probably-spam folder. The final recipe fixes a Procmail bug that can cause the leading F in the From: to be dropped. (This bug has been fixed, but it’s included in case you’re running an old version of Procmail.)

Of course, you can change these rules if you like. For instance, you can send suspected spam to /dev/null, but doing so means that if any such messages really aren’t spam, you won’t be able to retrieve them. Placing suspected spam in folders means that you can open those folders and recover any misclassified messages.

Calling SpamAssassin from sendmail

Calling SpamAssassin from Procmail is fine for local mail delivery, but it doesn’t work well for a mail server that should operate as a spam filter for another server, such as a Microsoft Exchange server. For this configuration, you need a way to call SpamAssassin more directly as part of the mail relay process; the MIMEDefang tool (http://www.mimedefang.org) can do so. Although a complete description of MIMEDefang and the sendmail features it uses is beyond the scope of this book, a brief description should get you started.

The key to the process is to use the sendmail INPUT_MAIL_FILTER configuration line to call MIMEDefang, which in turn is configured to pass incoming messages through SpamAssassin and take actions accordingly. A full sendmail .mc file that implements these features appears in Example 13-2.

Example 13-2. Sample sendmail configuration with SpamAssassin

divert(-1)
#
# Spam-checking gateway configuration
#
divert(0)dnl
VERSIONID(`Spam-checking gateway')
OSTYPE(linux)dnl
DOMAIN(generic)dnl
FEATURE(virtusertable)dnl
FEATURE(mailertable)dnl
FEATURE(access_db)dnl
FEATURE(always_add_domain)dnl
FEATURE(nouucp,`reject')dnl
FEATURE(`relay_based_on_MX')dnl
define(`confDEF_USER_ID',``8:12'')dnl
define(`confPRIVACY_FLAGS', \
  `goaway,noreceipts,restrictmailq,restrictqrun,noetrn')dnl
define(`confTO_QUEUERETURN',`7d')dnl
define(`confTO_QUEUEWARN_NORMAL',`1h')dnl
define(`confMAX_DAEMON_CHILDREN',`60')dnl
define(`confMAX_MESSAGE_SIZE',`10000000')dnl
define(`confMAX_CONNECTION_RATE_THROTTLE',`10')dnl
define(`confMAX_RCPTS_PER_MESSAGE',`500')dnl
INPUT_MAIL_FILTER(`mimedefang',`S=unix:/var/spool/MIMEDefang/mimedefang.sock, \
   F=T, T=S:60s;R:60s;E:5m')dnl
MAILER(smtp)dnl
MAILER(local)dnl
MAILER(procmail)dnl

Tip

A couple of lines in Example 13-2 are very long; they’re denoted by trailing backslashes (\) at the end of the first line, but should be entered on single lines without the backslashes.

This configuration also requires you to set up the sendmail mailer table file (typically /etc/mail/mailertable) that was described earlier. It must include a line that points the system to an internal server that will receive the spam-filtered messages:

pangaea.edu   esmtp:internal.pangaea.edu

In addition to the sendmail configuration, you must configure MIMEDefang. This tool requires three directories, /var/spool/MIMEDefang, /var/spool/MD-Quarantine, and /var/spool/MD-Bayes. Assign ownership of these directories to the account used to run MIMEDefang (typically defang). Once this is done, edit mimedefang-filter (usually stored in /usr/local/etc/mimedefang). Set the $AdminAddress, $AdminName, and $DaemonAddress lines to point to your local postmaster’s email address, the postmaster’s name (often your domain’s name and Postmaster), and the email address used in messages MIMEDefang generates. You should also set the $SALocalTestsOnly item to 0 or 1 to forbid or allow SpamAssassin to use network-based tests.

Configure the internal server computer (http://internal.pangaea.edu in this example) to accept mail only from the spam-filtering gateway or from this system and any local systems that should be able to relay outgoing mail. This server shouldn’t accept mail directly from the outside. Certainly it shouldn’t be listed as an MX server in your domain’s DNS configuration; only the spam-filtering mail gateway should be listed in this capacity.

Using Bogofilter

Unlike SpamAssassin, which combines many different spam-fighting tools in one system, Bogofilter (http://bogofilter.sourceforge.net) takes a single approach to spam fighting. It’s an implementation of a statistical spam filter. As such, it requires training on a corpus of both spam and nonspam messages before it can work. Thus, you may need to save your spam for a few days before you can effectively use Bogofilter.

Tip

SpamAssassin can use a statistical filter as part of its rule set. To do so, you must give it sample messages to train it, using its sa-learn command. Consult this command’s manpage for details; the training process is similar to that for Bogofilter, although the command details differ.

Bogofilter can be installed like most other packages; check your distribution to see if a version is available with it. If not, go to the project’s home page, and download a binary or source code version from there. Like SpamAssassin, Bogofilter is called from Procmail or a mail reader program. Before you do that, though, you must train Bogofilter.

The training procedure requires examples of both spam and nonspam messages—the more, the better. (A collection of several thousand messages is not excessive, but Bogofilter can do some good with just a few dozen.) Ideally, these messages should be typical of spam and nonspam messages that you receive; you want Bogofilter to learn to differentiate your spam from your nonspam. Although you can find spam collections on the Internet, using them for Bogofilter training can cause problems, because other people may receive different types of spam, or because you might not classify everything in such collections as spam. The simplest way to train Bogofilter is to place all your spam messages in one file and all your nonspam messages in another file, both of which should be in mbox format. In subsequent examples, I refer to these as spam.mbox and nonspam.mbox, respectively.

Conceptually, the simplest way to train Bogofilter is to pass the spam and nonspam messages through the bogofilter command using the -s and -n options, respectively:

$ bogofilter -s < spam.mbox
$ bogofilter -n < nonspam.mbox

These commands create a database file, ~/.bogofilter/wordlist.db, which contains all the words contained in all the messages, along with counts of how often they appear in spam and nonspam messages. When Bogofilter later encounters a spam, it can then use these classifications to estimate the probability that a message is spam or nonspam.

Tip

Because the Bogofilter database file is stored in the user’s home directory, you should create the Bogofilter database file by running the program as that user. This user can conceivably be root, but for security reasons, it’s best if you find a way to run Bogofilter as a non-root user. If necessary, you can create an initial database, place the mail classification call to bogofilter in users’ individual ~/.procmailrc files, and modify the global configuration to use the global word files in addition to individual users’ word lists.

Another approach to Bogofilter training is to use a training script, such as bogominitrain.pl or randomtrain. These scripts might or might not be shipped with a distribution-provided Bogofilter package. If they’re not on your system, consult the main Bogofilter site. These scripts perform more sophisticated training; namely, they use the bogofilter command to classify each message and perform training only if the message isn’t classified correctly by Bogofilter. If necessary, this process is repeated until Bogofilter classifies every message correctly. The result tends to be smaller databases, and often more accurate results, but initial training takes longer. Consult the documentation that comes with the training script for details. Typically, you pass the script the names of both the spam and the nonspam files, and perhaps additional parameters:

$ bogominitrain.pl -fnv ~/.bogofilter nonspam.mbox spam.mbox '-o 0.9,0.3'

This example passes the location of the word list, the nonspam and spam files, and classification parameters (described in more detail shortly).

Whatever training method you use, you should also examine, and perhaps modify, the Bogofilter configuration file. By default, /etc/bogofilter.cf provides systemwide defaults, but individual users can override these by creating a configuration file called ~/.bogofilter.cf (this filename can be set in /etc/bogofilter.cf). Options in this file are well commented, so perusing it will give you some idea of what you can change. Some options you may want to modify include:

bogofilter_dir: This option points to the word list directory. Changing it is one way ordinary users can access a global word list; however, doing so may make it impossible for individuals to change that word list.
ignore_case: Ordinarily, Bogofilter pays attention to case; Viagra is distinct from viagra. You can set ignore_case=yes to have Bogofilter convert all words to lowercase, though. This can help overcome attempts to confuse antispam tools by mixing up case in words, but it can also reduce Bogofilter’s sensitivity to strings for which case can be important.
algorithm: Bogofilter can use several different algorithms for determining the spamicity of a message (that is, the probability that a message is spam). These algorithms are graham, robinson, and fisher. The default is fisher, which generates a three-way classification: spam, nonspam, or unsure.
ham_cutoff: This option sets the maximum spamicity score (between 0.0 and 1.0) that’s needed to classify a message as nonspam. A value of 0.10 is typical and usually works well.
spam_cutoff: This option sets the minimum spamicity score (between 0.0 and 1.0) that’s required for a spam classification. A value of 0.95 is typical and usually works well.

Once you’ve set these values and trained Bogofilter, you should test its operation by passing spam and nonspam messages through the bogofilter command. Ideally, you should use messages that you held back from the training so that you can judge how Bogofilter handles messages it’s never seen. Use the -v option to have the program generate a verbose report of the input messages, which you redirect as input:

$ bogofilter < 
                  message.txt

X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000, version=0.16.4

This result shows a classification of the message as spam (X-Bogosity: Yes), with a very high spamicity score (1.000000). A nonspam message is likely to generate a much lower score:

$ bogofilter < 
                  message.txt

X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.16.4

Because of its three-way output, Bogofilter can also tell you that it’s unsure of the status of the message:

$ bogofilter < 
                  message.txt

X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500008, version=0.16.4

If you find that Bogofilter isn’t classifying your messages correctly, you should revisit your training procedures. Perhaps you didn’t classify enough messages or delivered them with the wrong parameters (confusing spam and nonspam messages, for instance). Note that a classification of “unsure” works like a nonspam classification in most respects, so you shouldn’t be too concerned if some of your nonspam messages are classified in this way, unless the spamicity ratings are very close to the spam cutoff point. If you have classification problems, you might also consider fine-tuning the Bogofilter cutoff criteria (ham_cutoff and spam_cutoff). You can increase or decrease these values, but with certain risks; if you make either the nonspam or spam category too large, you’ll risk misclassifying messages.

Tip

Numerically, the largest range of spamicity values is above the ham_cutoff value but below the spam_cutoff value. Thus, you might expect that most messages will end up classified as “unsure.” In practice, though, most messages achieve very high (close to 1.0) or very low (close to 0.0) spamicity ratings.

With Bogofilter now correctly classifying at least most of your messages, it’s time to integrate it into your mail delivery system. One way to do this is by calling Bogofilter in Procmail. The following Procmail recipe will do this:

:0HB:
* ? bogofilter -u -l
probably-spam

This recipe passes the message through the bogofilter command. The -u option tells Bogofilter to automatically add messages that it classifies as spam or nonspam to the appropriate word lists. This option is both potentially useful and potentially dangerous; it’s useful because it can help keep your spam database updated, but it’s dangerous because if Bogofilter misclassifies a message, that misclassification can lead to more misclassifications. (If a message is classified as “unsure,” it won’t be added to the database.) The -l option logs Bogofilter activity. This recipe stores spam messages in the probably-spam folder; nonspam messages go on for normal delivery.

If you use the -u option, and Bogofilter misclassifies a message, you should correct the problem. You can do this with the -N and -S options, which undo previous registrations of a message as nonspam and spam, respectively. You can combine these options with -s and -n to reregister the messages correctly. For instance, if Bogofilter has registered a message as nonspam but in fact it’s spam, you can extract the message to a file (complete with its headers) and type the following command:

$ bogofilter -Ns < 
                  message.spam

To test that it’s worked correctly, pass the message through bogofilter again, using -v rather than -Ns; Bogofilter should now classify the message as spam, or at least give it a much higher spamicity score. (Register it again with bogofilter -s to strengthen Bogofilter’s tendency to classify the message as spam, if desired.) Use -Sn rather than -Ns to undo an incorrect classification of a nonspam message as spam.

Discarding or Quarantining Suspicious Attachments

The vast majority of email worms released over the years have been written for Windows systems. Any of the antispam tools described here can be used to locate and deal with worms. Using a Linux system for this task ensures that the mail server itself can’t become infected, even through gross negligence. (At least, assuming Windows worms are in play; theoretically, Linux worms could be written to take advantage of flaws in Linux software.)

The threat of Windows worms is such that many sites have taken drastic measures to protect themselves: they reject all mail carrying certain types of attachments, or even all email attachments. The reasoning is that nobody has a valid reason to email, say, Windows .exe executables, so any such executable must be a worm. The validity of such reasoning is uncertain, but it may be so close to the truth for certain sites that discarding or quarantining messages with such attachments may be worthwhile. Example 13-3 shows a couple of Procmail recipes that discard certain suspicious messages.

Example 13-3. Procmail recipes to discard suspicious attachments

:0 B
* ^Content-Type: audio/x-(wav|midi);
/dev/null

:0
* ^Content-Type: multipart/(mixed|alternate|alternative|related)
{
  :0 B
  * ^.*name=.*\.(bat|com|exe|pif|scr|vbs|zip)
  /dev/null
}

The first of these rules discards everything with a Content-Type line of audio/x-wav or audio/x-midi. Theoretically, these lines identify certain types of audio files, which might be legitimate attachments in some environments; however, in practice, worms often try to masquerade as these file types. The second rule looks for any of several content types in the header and, if found, searches for a line that includes name= followed by any of several filename extensions. Some of these, such as .bat, .com, and .exe, identify Windows executables. Others don’t, but again, Windows worms frequently try to masquerade as files of these types.

Unfortunately, rules such as these are likely to produce false alarms. The second rule is particularly overzealous because it discards messages with attached Zip files. You can, of course, eliminate some of these filename extensions, but that reduces the effectiveness of the tests. Alternatively, you can enclose the test in a white-list test to enable trusted senders to deliver mail containing these attachments. Another option is to rename attachments rather than discard them; for instance, rename a .zip file to .zip.txt. This enables users to access the files, but makes it harder for worms that are named in this way to do harm automatically.

These rules, as shown in Example 13-3, are also potentially deleterious because they discard the messages by sending them to /dev/null. Placing the messages in a folder to hold suspected worms might be a good alternative. Users can then open the messages only with extreme caution. If you place these rules in a systemwide Procmail configuration file, you can even send the suspect messages to a mail folder that only root can read.

Previous Chapter

Configuring POP and IMAP Servers

Next Chapter

Supplementing a Microsoft Exchange Server