Unwanted email is arguably the worst problem facing email administration today. Two types of unwanted email are common: spam and worms/viruses. Spam is unsolicited bulk email, usually commercial in nature. Most spam markets worthless body-enhancement products, questionable financial advice, and so on but is more of a nuisance than a threat—at least, if you ignore the substantial network bandwidth that spam consumes. Worms and viruses, on the other hand, are malicious computer code that, if executed on an unprotected computer, can spread and cause damage. Despite the fact that spam is quite different from worms or viruses in their intent, the two classes of junk email can be combated in similar ways.
The distinction between worms and viruses is a tricky one to define and depends on who you ask. Thus, I don’t try to distinguish the two types of menaces in this chapter, and hereafter I use the word worm to refer to both types of program. Sometimes I refer to “spam-fighting tools” or the like. Such tools can often be used to fight worms, as well, but such phrases omit this detail for brevity’s sake.
Dealing with spam and worms requires first knowing a bit about the types of approaches to dealing with the problem. One of the tools that can be used to directly combat spam and worms is Procmail, so I describe it shortly. Procmail can also be used to invoke other spam-fighting tools. SpamAssassin and Bogofilter are two such antispam tools. Finally, as a site policy issue, you may want to place suspicious attachments in a special holding area until you can examine them.
Spam and viruses are difficult to detect. This is particularly true of spam, because spam identification is somewhat subjective: one person’s spam may be another person’s desirable commercial communication. The line between worms and non-worms is clearer, but worms can also be difficult to distinguish between legitimate email attachments, particularly in some environments (for instance, if you have a legitimate business reason to send or receive executable files). For this reason, the number of spam-fighting tools available is quite large. Indeed, the number of approaches to fighting spam and worms is large. Here are some general methods:
This approach, described in the earlier sections on sendmail and Postfix, relies on central authorities maintaining databases of IP addresses from which messages shouldn’t be accepted or should be accepted only with caution. Typically, these databases are updated frequently, based on spam reports from their users. This method is best implemented in receiving SMTP servers because they receive direct connections from the sending systems and therefore aren’t easily tricked into believing the message originated from a false IP address. (Headers are easily forged, so the originating IP address can be obfuscated by clever spammers if another system does this check.) Note that this approach doesn’t test the message’s content; it’s based solely on the IP address and so is susceptible to false alarms should an address send both spam and nonspam messages.
Some network databases work on more than the originating IP address; they store hashes of entire spam messages. When your server receives a message, it can hash the message (minus its headers) and query a network server for the presence of this hash. If it’s present, it means that somebody else has received an identical message and entered it as spam in the hash database. This approach is a potentially powerful one, but it can be easily “poisoned” with respect to legitimate mailing lists; that is, individuals can classify mailing list messages as spam, which can then cause these legitimate messages to be misclassified as spam. You can work around this problem by creating a “white list” (see entry later in this list) of addresses that aren’t tested against a distributed hash system.
Examining the message’s content is the most reliable way to identify spam. The simplest type of examination relies on simple pattern matches. For instance, you might decide that any message containing the word Viagra is spam, and discard it. This approach can be implemented in either the SMTP server or in add-on software, such as Procmail. It has the disadvantage of great potential for false alarms, particularly if your rules are too broad. For instance, if you discard all messages containing the word Viagra, you may catch a lot of spam, but you’ll also discard legitimate email to people who are actually corresponding with others (perhaps their doctors) about this drug. Maintaining a good set of pattern match rules can also be quite time-consuming, although some packages, such as SpamAssassin, aim to minimize this problem by providing frequent updates to a general rule set.
A white list is a list of addresses or keywords that trigger automatic acceptance of a message. They’re frequently used with simple pattern matches or other spam-catching tools in order to minimize the risk of discarding important messages. Typically, you add your regular correspondents to your white list, and their messages get through even if another rule would reject them. They’re usually implemented using the same tools that can perform simple pattern match rejections.
A challenge-response system is a variant on white lists. When a message arrives from a source other than one that’s on the white list, the recipient automatically sends a challenge to the message source. This challenge is a message asking the sender to perform some action to prove that the message isn’t spam, such as to respond with a keyword. Automated spamming systems can’t cope with this request, but humans can. Once a response is received, the original message is delivered, and the sender is usually added to the white list. This method of spam fighting can be quite effective, but it can generate more traffic and, because they must respond to challenges, places an extra burden on those who send mail. A poor implementation can also result in a continuous loop of challenges to challenges, should two systems use similar systems that don’t exempt challenges to their own challenges.
A spam-catching tool that emerged on the scene in 2002 involves statistical tests (often called Bayesian tests , after Bayes’ Rule, a statistical principle they employ). These tests use a database of words, word pairs, and other message features. Typically, you feed the software a sample of spam and another sample of nonspam, and the software adds up the number of times a word appears in each category. For instance, Viagra might appear 50 times in spam and once in nonspam, whereas Linux might appear 50 times in nonspam and once in spam. If a message with the word Viagra is analyzed, then, a statistical filter will give it a high probability of being spam. The analysis is typically based on many words, though, so a single word isn’t likely to “poison” an analysis, as can happen with simple pattern matches. One statistical spam filter, Bogofilter, is described in more detail later. Some tools, such as SpamAssassin, employ statistical tests as part of their overall operation.
These same tools can detect worms, although some worm-detection tools rely on an analysis of the binary file that’s attached to the message rather than English words in the message body. (Some worms can also be reliably identified by their message texts.)
Some tools are hard to classify in just one way. For instance, Procmail directly implements pattern-matching tests but can call other tools that use other methods. The upcoming sections describe Procmail, SpamAssassin, and Bogofilter in more detail.
One way to deal with spam and worms is
to use SMTP server features. One of these features in sendmail has
already been described: the access.db file, in
conjunction with the
FEATURE(`access_db') option in
your sendmail .mc file. You can block mail from
sites known to send nothing but spam using this technique.
Unfortunately, the world of spam is a fast-changing one, so by the
time you add a hostname or address to this list, chances are the
spammer will have started using another. The sheer quantity of spam
also makes this approach an awkward one. Nonetheless, you can use
this method for some particularly persistent offenders.
Another spam-fighting approach is to use a blackhole
list, which is a frequently updated list of sites that
are known or suspected spam sources or that
shouldn’t be sending email directly. Blackhole lists
work as services, much like DNS: your mail server queries the
blackhole list with the IP address of a connecting server
that’s trying to initiate a connection, and the
blackhole list server returns a value that indicates the
sender’s status. To use a blackhole list, you enter
a line like the following in your sendmail .mc
file:
FEATURE(`dnsbl', `relays.ordb.org', `"550 Email rejected due to sending server misconfiguration - see http://www.ordb.org/faq/\#why_rejected"')
This line tells sendmail to use the blackhole list at http://relays.ordb.org and to include a message with a URL in bounced emails. (This enables senders to check the messages, should nonspam messages be bounced.) Of course, this raises a question: how do you know which blackhole list to use? Many are available. You may want to peruse http://www.declude.com/Articles.asp?ID=97 or http://www.moensted.dk/spam/ for pointers to more than 100 blackhole databases with varying criteria for inclusion and other features. Some are free; others require you to pay for the privilege of using them. If you like, you can include multiple blackhole list definitions, each on its own line.
More sophisticated spam-fighting techniques require additional software. In particular, you can add Procmail to the mix to filter on keywords or to call other programs to check your incoming email in various ways. This topic is covered in a later section. If the sendmail server is an intermediary system, you may want to call Procmail as part of the forwarding configuration, as described earlier, in Section 13.2.3.3.
Postfix provides a number of antispam options, some of them are quite sophisticated. In addition, you can use Procmail as a delivery agent to call external programs or perform checks Postfix alone can’t handle.
One of the simpler Postfix antispam configurations is to use a
blackhole list. One main.cf option enables this
feature:
smtpd_client_restrictions = reject_rbl_client relays.ordb.orgThe smtpd_client_restrictions option tells Postfix
when to reject mail. The reject_rbl_client value
corresponds to a positive lookup in the blackhole list database
specified after this value
(relays.ordb.org in this example). Postfix
can use the same blackhole lists as sendmail; consult http://www.declude.com/Articles.asp?ID=97 or
http://www.moensted.dk/spam/ for
pointers to more than 100 blackhole databases. Other values can be
added to this line, separated by commas, to reject mail from systems
that don’t have matching DNS A records for their PTR
records (reject_unknown_client), to check an
external database for rejection rules
(check_client_access
type:table),
and so on. Consult the Postfix documentation for details.
Prior to Version 2.0, Postfix used a pair of options to achieve the
effect described here. Specifically,
maps_rbl_domains contained a comma-separated list
of blackhole list servers; these were used only if the
reject_maps_rbl option was passed to
smtpd_client_restrictions.
Spam and worms can often be identified by the presence of strings in
message headers or bodies. For instance, you might know from
experience that any message with a subject of earn
$$$ is spam and can be discarded. Postfix includes
several options that check message headers and bodies for such
content:
header_checks
This option points to a file that contains checks that are applied to message headers—the parts of a message that contain the subject, the return address, etc. Typically, you’ll check headers for suspicious email subjects, senders, and perhaps recipients.
mime_header_checks
Increasingly, email messages use Multipurpose Internet Mail Extension
(MIME) to encode special formatting and nontextual data. MIME
extensions are also loved by spammers and worm authors because they
can deliver text that’s harder to identify as spam
or malicious computer code. You can use this option to point to a
file that matches suspicious MIME headers. This option is available
in Postfix 2.0 and later, and defaults to
$header_checks.
nested_header_checks
Users and programs sometimes attach one email message to another. To
search such attached messages’ headers, you can use
this option, which is available only in Postfix 2.0 and later, and
defaults to $header_checks.
body_checks
This option searches email messages’ bodies—the parts of the message that users read, as opposed to the headers. Scanning message bodies can be a good way to identify worms and spam. This option is available only in Postfix 2.0 and later.
All of these options take an external filename, along with a code for
the file’s format, as an option. This file is
typically a plain-text file or a database file
that’s derived from a plain-text file. The resulting
entry in main.cf looks something like this:
header_checks = pcre:/etc/postfix/header_checks
The pcre code stands for Perl
compatible regular expression. Alternatively, you can
employ regexp to use non-Perl regular expressions.
In either case, lines in the original text file take the specified
form followed by one of the following action codes:
DISCARD
optional
text
Accepts the message for delivery but quietly rejects it. If
optional
text
is present, enter it in the mail logs; otherwise, log a generic
message.
DUNNO
Moves on to the next input line. This option is synonymous with
OK.
FILTER
transport:destination
Passes the message through the external content filter, as specified
by the transport method
(smtp, procmail, and so on) and
destination (a hostname or filename,
typically). The filter receives the message only after Postfix has
examined all the message’s lines, so the message can
be rejected before the filter is called.
HOLD
Places the message in the hold queue, which is a sort of limbo in which the message is neither delivered nor discarded. A system administrator can examine the hold queue using the postcat command and release messages from the queue or destroy them using postsuper.
IGNORE
Ignores the current line of input and moves to the next one.
PREPEND
text
Places the specified text at the start of
the input line. This can flag lines for further spam processing.
REDIRECT
user@domain
Sends the message to the specified user rather than the recipient specified by the mail’s envelope. This feature can be used to forward mail for users who have moved elsewhere, as an alternative method of forwarding mail to internal servers, and in other ways. However, many potential uses of this action are better achieved through other means.
REJECT
optional text
Rejects delivery of the message. If you specify
optional
text,
it’s passed to the sender; if not, a generic error
message is delivered to the sender.
WARN
optional text
Logs a warning with the specified optional text in the mail log file. This action is intended
primarily for testing new rules before implementing them.
Many of these action codes are available only in Postfix 2.0, 2.1, or later. As an example of their use, consider the following entries:
### Subject headers indicative of spam /^Subject: ADV:/ REJECT /^Subject: Accept Credit Cards/ DISCARD ### Additional header checks /^(From|Received):.*iamspam\.biz/ REJECT /^From: spammer@abigisp\.net/ FILTER procmail:/etc/procmailrcs/maybespam
This set of rules rejects mail with a subject header of
ADV: or with from or received headers that include
the string iamspam.biz. It also discards mail with
a subject header of Accept Credit Cards and passes
mail from spammer@abigisp.net through a Procmail
filter, /etc/procmailrcs/maybespam. This filter
presumably performs additional checks that are too complex for
Postfix to handle by itself.
In addition to its own checks, Postfix can send mail through Procmail
for processing. In fact, using Procmail is usually the default. If in
doubt, check your main.cf file for a line like
the following:
mailbox_command = /usr/bin/procmail
When called in this way, Procmail is used for final message delivery.
You can call it in other ways, such as in a FILTER
action in a header check. Broadly speaking, Procmail is a more
powerful way of looking for suspicious patterns in email than
Postfix’s own rules. Procmail can also be customized
on a user-by-user basis, which is harder with
Postfix’s rules. Thus, you may prefer to use
Procmail alone, rather than use Postfix’s pattern
matching tools. The main advantage of Postfix’s
rules is that they can be used to reject messages before
they’re fully received. In particular, if a header
check causes a message to be rejected, Postfix refuses delivery
before many bytes are transferred. This feature can help conserve
bandwidth, at least if you can devise rules that correctly identify
large spams or worms from their headers alone. Procmail delivery
rules, by contrast, operate only after the mail server has accepted
the mail for delivery. Unfortunately, spammers and worm writers have
become very good at disguising their unwanted
emails’ headers, so you may have no choice but to
accept the entire email in order to properly identify it. The topic
of spam and worm control is covered in more detail later in this
chapter.
Procmail is a very powerful mail processing tool. It does far more than spam filtering; it can redirect mail based on nonspam criteria, sort mail into folders, copy messages for archival purposes, pass mail through arbitrary external programs, and more. Still, one of Procmail’s main applications is as a spam-fighting tool; you can use its native pattern-matching features to discard mail or shunt it into a suspected spam folder. You can also pass messages to external programs for tests that Procmail can’t handle by itself.
Using Procmail requires calling it in some way. Typically, you do so by configuring your SMTP server to call Procmail as part of its mail delivery process. You can then move on to Procmail configuration. To configure Procmail you need to understand the Procmail configuration file format and be able to create Procmail recipes, which are the rules used to direct mail in Procmail.
The first step in Procmail use is to ensure that your mail system uses it. Most Linux SMTP server configurations use Procmail by default, so you may not need to change anything about your basic SMTP configuration to use Procmail. If you’re in doubt, though, or if you want to fine-tune the configuration, you can check some settings:
You should set three options in the sendmail .mc
file to use Procmail. The first of these is:
define(`PROCMAIL_MAILER_PATH', `/usr/bin/procmail')
This tells sendmail where to find the Procmail binary. (Some
configurations put this option in another configuration file, but you
can override it in your sendmail .mc file if you
need to do so.) The remaining options are
FEATURE(`local_procmail') and
MAILER(procmail), which collectively tell sendmail
to use Procmail for local deliveries. As described in the earlier
Section 13.3.3.3, you can also
call Procmail in other ways, such as in a forwarding configuration.
To call Procmail as part of the Postfix delivery rules, you must tell
Postfix to use the Procmail binary as part of its delivery system:
mailbox_command
=
/usr/bin/procmail. As described in an earlier
section, you can also tell Postfix to use Procmail in mail forwarding
configurations.
Procmail can use one or more of several configuration files:
/etc/procmailrc
This file is the global Procmail configuration file. It’s called as root to process all the mail that the SMTP server handles. For spam-control purposes, you use this file to apply rules you want to use on all the email that’s delivered to your local users. Typically, this means you use it to apply rules that are very unlikely to result in false alarms.
~/.procmailrc
Individual users can create .procmailrc files in
their home directories. These files have the same format as
/etc/procmailrc, but they’re
applied only to email directed to specific users. This enables users
to apply their own customized Procmail rules. Alternatively, you can
provide some standard configuration files in specific locations and
allow users to create symbolic links to those files to achieve preset
effects.
Some methods of calling Procmail, such as those that use Procmail as
part of mail forwarding schemes, enable you to pass the name of a
configuration file to Procmail. Sometimes these reside in a directory
such as /etc/procmailrcs, but that location is
arbitrary.
Procmail runs as the user who calls it,
although when it’s called as root, it can drop its privileges under some
circumstances. A rule that works well in
~/.procmailrc (when Procmail is called as the
end user) may not work well when placed in
/etc/procmailrc (when Procmail is called as
root), or vice versa. Typically,
you must be more careful about file permissions when calling Procmail
as root, because writing to or
creating a file (such as a mail folder) as root can make that file inaccessible to
ordinary users, such as the mail’s intended
recipient.
Whatever its name, a Procmail configuration file consists of three
parts: comments (denoted by hash marks), environment variable
assignments (similar to those in bash, such as
MAILDIR
=
$HOME/Mail), and recipes (described next). The
bulk of most Procmail configuration files consists of its recipes.
Procmail
recipes consist of three parts: the identification line, the
conditions, and the action. The idea is that the action is initiated
when the conditions are met. For instance, a condition might be that
the string Viagra appear in the message body, and
the action might be that the message is sent to
/dev/null—that is, that the message be
discarded. The form of the recipe is as follows:
:0 [flags] [:[lockfile]] [conditions]action
The identification line always begins with :0;
that’s just the convention. The
flags
are described shortly; they specify where Procmail looks for
condition matches, how it matches, and so on. The
lockfile
is a file that controls access to a mail file. If a file is locked,
Procmail defers operating on it. Normally, a single colon
(:) is sufficient, but you can specify the
filename, if necessary. The
conditions
are technically optional, but in practice, most recipes have at least
one condition line. (A recipe with no
conditions lines matches all mail
messages.) Including multiple conditions causes Procmail to require
all of them to match before an action line
is implemented. Precisely one action is
required for each recipe.
Procmail’s default behavior is to match
conditions against message headers in a
case-insensitive way. Several flags are
available to change how Procmail handles these matches, though. Here
are the more common:
H
This value does matches on message headers, which is the default.
B
This value does matches on message bodies.
D
This value does a case-sensitive pattern match, as opposed to the normal case-insensitive match.
c
Ordinarily, if a recipe matches, it’s passed to the
action, which may discard it, alter it, or
otherwise make the original inaccessible. This option causes the
action to act on a
“carbon copy” of the original
message, which is useful if you want to, for example, send a
duplicate copy of a message to another account or mail folder.
w
This value causes Procmail to wait for the
action to complete. If the
action fails, Procmail leaves the message
in the queue for other recipes.
W
This option is similar to w, but it suppresses
error messages.
f
This option pipes a message through another program, treating that program as a filter.
The Procmail recipe conditions can look
like Greek to the uninitiated. Each begins with an asterisk
(*), followed by a regular
expression
. At its simplest, a regular
expression is simply a string that must match exactly. For instance,
the regular expression Viagra matches the word
Viagra in the input. Many characters have
special meanings, though, such as:
^
A caret symbol indicates the start of a line; for instance,
^Viagra denotes the string
Viagra, but only at the start of a line. Many
conditions begin with a caret.
$
This character signifies the end of a line.
A period matches any single character except for an end-of-line
character. For instance, h.t matches
hat, hut,
hot, or any other similar string.
x*This string (where x is any single
character) matches any number of x
characters, including none. This is often combined with a dot
(.), as in .*, to match any
arbitrary group of characters.
x+This expression works much like x*, but
matches any occurrence of one or more x
characters, rather than 0 or more.
x
?
This string matches zero or one x
characters.
(
string1|string2
)
This expression matches one of two strings by separating them by a vertical bar within parentheses. This principle can be extended to more than two strings, as well.
(
string
)*
This expression matches zero or more instances of the specified
string.
[
chars
]
Placing characters within square brackets causes Procmail to match
any one of the enclosed characters. For instance,
[abcz] matches any one of the characters
a, b, c, or
z. You can specify a range of characters by using
a dash, as in [c-j] to indicate any letter between
c and j.
\
The backslash character removes the special meaning from the
subsequent character. For instance, to match a dot, you enter
\. in the conditions.
!
This character appears only at the start of a
conditions line and reverses its meaning;
that is, if the regular expression matches, the recipe does
not match.
?
Like !, this character appears only at the start
of a conditions line. It tells Procmail to
use the exit code of the specified program.
Regular expressions can be extremely complex, so you may need to consult the Procmail manpage or another source of information on regular expressions to learn more. The next section provides some examples.
Finally, each
Procmail
recipe ends with an action. Each
action can take any of several forms:
An action that takes the form of a
filename indicates that the message is to be stored in the specified
file, which is treated as an mbox mail folder.
A filename that ends in a slash (/) is interpreted
as a subdirectory name, in which case Procmail stores the message in
this subdirectory in maildir format.
!
An exclamation mark denotes a list of email addresses to which the message should be forwarded. This can be useful for setting up individual mail forwarding to another system.
|
Procmail treats a vertical bar as a pipe character, much like
bash. Its presence at the start of an
action tells Procmail to pass the message
to an external program for further processing.
{
You can nest multiple tests by using a left curly brace as the
action line; subsequent lines, until a
right curly brace (}), constitute one or more
additional recipes that are used only if the initial recipe matches.
You can use this feature to control whether or not to perform certain
tests; for instance, to perform spam checks only if mail
doesn’t come from certain addresses (that is, to
implement a white list).
Because Procmail supports just one action
per recipe, you may need to create an external script if you want to
perform some complex action. Be sure your external script reads the
entire message. If it doesn’t, Procmail may send the
message through additional recipes, which can result in duplicate
deliveries.
Example 13-1
shows
a sample Procmail recipe file intended for use by individuals. (When
used by the system, some file ownership issues can arise. This
problem can be avoided by adding a DROPPRIVS
=
yes line to the start of the
file.) This example illustrates several useful
techniques:
The first rule contains two nested subrules, the intent being to exclude mail from two regular correspondents from spam checks, which are nested. The nested rules are indented to set them off, but this indentation isn’t required.
The two spam-check rules look for strings that are indicative of
spam. The first searches message bodies for the string
301 followed by 0 or more characters, followed by
S, 0 or more characters, and
1618. This string is found in some spams that
reference a failed piece of U.S. legislation, S.1618, which dealt
with spam. The legislation failed years ago, but spam still
references it, as if to legitimize itself. The second spam check
looks for a string in subject headers that identifies messages
encoded using a system that’s common for certain
Asian languages. Most non-Asian users seldom or never receive nonspam
mail with such subject headers, but a lot of spam uses them.
Several rules use flags to search the text of messages or to create carbon copies.
The spam messages are “sorted” to
/dev/null, which effectively discards the
messages. The last rule saves mail from a mailing list (identified by
a unique “to” header) into the
genetics-list mbox mail folder in the
$MAILDIR subdirectory, which is identified on the
first line of the recipe file.
Example 13-1. Sample Procmail recipe file
MAILDIR = $HOME/Mail
# Do some spam checks, but exclude anything from good addresses
:0
*! ^From:.*(goodguy@pangaea\.edu|linnaeus@example\.com)
{
:0 B
* ^.*301.*S.*1618
/dev/null
:0
* ^Subject:.*=\?big5\?*
/dev/null
}
# Forward mail from goodguy@pangaea.edu with "peas" in the
# subject line to mendel@luna.edu
:0 c
* ^From:.*goodguy@pangaea\.edu
* ^Subject:.*peas
! mendel@luna.edu
# Shunt mail from a genetics mailing list into its own folder
:0:
* ^To:.*genetics@mailer\.example\.org
$MAILDIR/genetics-listOne of the major problems with using Procmail alone as a spam-control tool is that creating and maintaining a set of Procmail rules can be quite labor-intensive. This is particularly true because spam and worms are constantly changing, so a good set of rules for today may be inadequate tomorrow. You may want to search for a ready-made set of Procmail recipes, such as SpamBouncer (http://www.spambouncer.org) or the Sample Procmail Recipes with Comments (http://handsonhowto.com/pmail102.html). The first of these is specifically intended as an antispam tool, whereas the second is a practical teaching tool. If you periodically check back with such pages and update your filters, you can keep a reasonably up-to-date Procmail antispam configuration. On the other hand, rules created by somebody else are more likely to miss spam or, worse, falsely identify nonspam as spam.
SpamAssassin (http://spamassassin.apache.org) is an antispam tool based on a large number of tests. Each test changes the score of the message. SpamAssassin doesn’t actually delete messages; instead, it adds headers identifying likely spam as such. The idea is that you’ll call SpamAssassin from Procmail, a mail server, or a mail reader and use it to detect the SpamAssassin spam report and delete or redirect messages based on that report.
SpamAssassin has grown into quite a large tool. In fact, it’s complex enough that it’s spawned its very own book: SpamAssassin (O’Reilly). If you need to perform complex tasks or configure SpamAssassin as part of a mail server for a large site, it’s worthwhile to read this or other SpamAssassin-specific documentation.
The SpamAssassin software comes with most major distributions, so installing it from your distribution medium is usually the simplest course of action. If you can’t find SpamAssassin with your distribution, go to the main SpamAssassin site, and download it. SpamAssassin is actually a Perl script and relies on several Perl modules, so you may need to install additional packages that hold these modules.
Once SpamAssassin is
installed, you should test its operation by manually feeding it a few
spam and nonspam messages. You do this by redirecting a message in a
file into the spamassassin command. Adding the
-t option adds an extra report to the end of the
output, which appears on the screen:
$ spamassassin -t <
message.txtThe message.txt file should contain a
complete message, including full headers. Most mail readers have an
option to save messages to disk with full headers so use that option
to get your samples. The SpamAssassin output includes two additions
to the message. The first addition appears at the end of the message
headers, and constitutes SpamAssassin’s report, as
intended for subsequent mail processing tools, such as Procmail or an
email reader. For a nonspam message, this addition is likely to
resemble the following:
X-Spam-Checker-Version: SpamAssassin 3.0.0-g3.0.0 (2004-09-13) on mail.example.com X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=RCVD_IN_SORBS autolearn=unavailable version=3.0.0-g3.0.0
The first line simply identifies the version of SpamAssassin and the
computer on which it’s running. The second line
holds the spam level, which is expressed as a number of asterisks
(*). Because this is an innocuous nonspam message,
no asterisks are displayed; however, some nonspam messages will have
a small number of asterisks (five is the typical cutoff point for
spam, although you can use something else if you like). The third
line, which typically extends across multiple lines, summarizes the
tests that raised alarms. In this case, the total spam score is 0.1
(hits=0.1). That 0.1 value came from the
RCVD_IN_SORBS test, which isn’t
explained at this point. The -t option to
spamassassin, though, adds extra lines at the end
of the message:
Content analysis details: (0.1 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.1 RCVD_IN_SORBS RBL: SORBS: sender is listed in SORBS
[172.24.98.102 listed in dnsbl.sorbs.net]This text identifies the RCVD_IN_SORBS flag as
meaning that the sender address is listed in the SORBS blackhole
list. This information can help you understand what SpamAssassin is
doing right (or wrong), but it’s not provided in
normal operation. You can, of course, consult the SpamAssassin
documentation to learn more about specific tests.
When you test a spam message, the spam headers added to the message are likely to report more serious problems:
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.0-g3.0.0 (2004-09-13) on
mail.example.com
X-Spam-Level: ******
X-Spam-Status: Yes, hits=6.8 required=5.0 tests=FORGED_MUA_OUTLOOK,
FORGED_OUTLOOK_TAGS,HTML_40_50,HTML_FONTCOLOR_UNSAFE,HTML_MESSAGE,
HTML_TAG_EXISTS_TBODY,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,
RCVD_IN_SORBS autolearn=spam version=3.0.0-g3.0.0This output includes one header line that’s not
present in the nonspam output: X-Spam-Flag: YES.
You can search for this line using a Procmail recipe, as described
shortly, to detect spam after messages have been processed with
SpamAssassin. The X-Spam-Level header shows six
stars, corresponding to the 6.8 hit rating reported in the
X-Spam-Status line. This line also shows quite a
few hits on individual spam tests. These are reported in greater
detail at the end of the message if you use the -t
option to spamassassin.
You should run several spam and several nonspam messages through SpamAssassin. You should verify that none of the nonspam messages are rejected and that a significant number of spams are rejected. SpamAssassin might not detect all of your spams, though. You can take the time to fine-tune its operation by changing the points assigned to individual rules or by enabling its auto-learning feature, which enables it to update its rules on the fly. You can also combine SpamAssassin with other tools, such as your own custom Procmail filters.
You can call SpamAssassin in
various ways. One is to use Procmail for local mail delivery.
(Calling SpamAssassin as part of a mail gateway system is described
next.) Add the following recipes to the start of your Procmail
configuration file to call SpamAssassin and sort suspected spam into
two folders, almost-certainly-spam and
probably-spam:
:0fw
* < 256000
| spamassassin
:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
almost-certainly-spam
:0:
* ^X-Spam-Status: Yes
probably-spam
:0
* ^^rom[ ]
{
LOG="*** Dropped F off From_ header! Fixing up. "
:0 fhw
| sed -e '1s/^/F/'
}These rules are taken from the procmail.example
file that ships with SpamAssassin. That file also includes several
comments that describe its rules. In short, the first recipe passes
messages that are smaller than 256,000 bytes through SpamAssassin,
which adds its headers to the messages. (Larger messages are almost
certainly not spam, although they can contain worms. SpamAssassin
doesn’t cope well with very large messages, hence
this size limitation.) The second recipe dumps messages with a spam
score of 15 or higher into the
almost-certainly-spam folder, while the third
recipe places messages that are flagged as spam but that
weren’t caught by the second recipe into the
probably-spam folder. The final recipe fixes a
Procmail bug that can cause the leading F in the
From: to be dropped. (This bug has been fixed, but
it’s included in case you’re
running an old version of Procmail.)
Of course, you can change these rules if you like. For instance, you
can send suspected spam to /dev/null, but doing
so means that if any such messages really
aren’t spam, you
won’t be able to retrieve them. Placing suspected
spam in folders means that you can open those folders and recover any
misclassified messages.
Calling SpamAssassin from Procmail is fine for local mail delivery, but it doesn’t work well for a mail server that should operate as a spam filter for another server, such as a Microsoft Exchange server. For this configuration, you need a way to call SpamAssassin more directly as part of the mail relay process; the MIMEDefang tool (http://www.mimedefang.org) can do so. Although a complete description of MIMEDefang and the sendmail features it uses is beyond the scope of this book, a brief description should get you started.
The key to the process is to use the sendmail
INPUT_MAIL_FILTER configuration line to call
MIMEDefang, which in turn is configured to pass incoming messages
through SpamAssassin and take actions accordingly. A full sendmail
.mc file that implements these features appears
in Example 13-2.
Example 13-2. Sample sendmail configuration with SpamAssassin
divert(-1) # # Spam-checking gateway configuration # divert(0)dnl VERSIONID(`Spam-checking gateway') OSTYPE(linux)dnl DOMAIN(generic)dnl FEATURE(virtusertable)dnl FEATURE(mailertable)dnl FEATURE(access_db)dnl FEATURE(always_add_domain)dnl FEATURE(nouucp,`reject')dnl FEATURE(`relay_based_on_MX')dnl define(`confDEF_USER_ID',``8:12'')dnl define(`confPRIVACY_FLAGS', \ `goaway,noreceipts,restrictmailq,restrictqrun,noetrn')dnl define(`confTO_QUEUERETURN',`7d')dnl define(`confTO_QUEUEWARN_NORMAL',`1h')dnl define(`confMAX_DAEMON_CHILDREN',`60')dnl define(`confMAX_MESSAGE_SIZE',`10000000')dnl define(`confMAX_CONNECTION_RATE_THROTTLE',`10')dnl define(`confMAX_RCPTS_PER_MESSAGE',`500')dnl INPUT_MAIL_FILTER(`mimedefang',`S=unix:/var/spool/MIMEDefang/mimedefang.sock, \ F=T, T=S:60s;R:60s;E:5m')dnl MAILER(smtp)dnl MAILER(local)dnl MAILER(procmail)dnl
A couple of lines in Example 13-2 are very long;
they’re denoted by trailing backslashes
(\) at the end of the first line, but should be
entered on single lines without the backslashes.
This configuration also requires you to set up the sendmail mailer
table file (typically /etc/mail/mailertable)
that was described earlier. It must include a line that points the
system to an internal server that will receive the spam-filtered
messages:
pangaea.edu esmtp:internal.pangaea.edu
In addition to the sendmail configuration, you must configure
MIMEDefang. This tool requires three directories,
/var/spool/MIMEDefang,
/var/spool/MD-Quarantine, and
/var/spool/MD-Bayes. Assign ownership of these
directories to the account used to run MIMEDefang (typically
defang). Once this is done, edit
mimedefang-filter (usually stored in
/usr/local/etc/mimedefang). Set the
$AdminAddress, $AdminName, and
$DaemonAddress lines to point to your local
postmaster’s email address, the
postmaster’s name (often your
domain’s name and Postmaster),
and the email address used in messages MIMEDefang generates. You
should also set the $SALocalTestsOnly item to
0 or 1 to forbid or allow
SpamAssassin to use network-based tests.
Configure the internal server computer (http://internal.pangaea.edu in this example) to accept mail only from the spam-filtering gateway or from this system and any local systems that should be able to relay outgoing mail. This server shouldn’t accept mail directly from the outside. Certainly it shouldn’t be listed as an MX server in your domain’s DNS configuration; only the spam-filtering mail gateway should be listed in this capacity.
Unlike SpamAssassin, which combines many different spam-fighting tools in one system, Bogofilter (http://bogofilter.sourceforge.net) takes a single approach to spam fighting. It’s an implementation of a statistical spam filter. As such, it requires training on a corpus of both spam and nonspam messages before it can work. Thus, you may need to save your spam for a few days before you can effectively use Bogofilter.
SpamAssassin can use a statistical filter as part of its rule set. To do so, you must give it sample messages to train it, using its sa-learn command. Consult this command’s manpage for details; the training process is similar to that for Bogofilter, although the command details differ.
Bogofilter can be installed like most other packages; check your distribution to see if a version is available with it. If not, go to the project’s home page, and download a binary or source code version from there. Like SpamAssassin, Bogofilter is called from Procmail or a mail reader program. Before you do that, though, you must train Bogofilter.
The
training
procedure requires examples of both spam and nonspam
messages—the more, the better. (A collection of several
thousand messages is not excessive, but Bogofilter can do some good
with just a few dozen.) Ideally, these messages should be typical of
spam and nonspam messages that you receive; you
want Bogofilter to learn to differentiate your spam from your
nonspam. Although you can find spam collections on the Internet,
using them for Bogofilter training can cause problems, because other
people may receive different types of spam, or because you might not
classify everything in such collections as spam. The simplest way to
train Bogofilter is to place all your spam messages in one file and
all your nonspam messages in another file, both of which should be in
mbox format. In subsequent examples, I refer to these as
spam.mbox and nonspam.mbox,
respectively.
Conceptually, the simplest way to train Bogofilter is to pass the
spam and nonspam messages through the bogofilter
command using the -s and -n
options, respectively:
$bogofilter -s < spam.mbox$bogofilter -n < nonspam.mbox
These commands create a database file,
~/.bogofilter/wordlist.db, which contains all
the words contained in all the messages, along with counts of how
often they appear in spam and nonspam messages. When Bogofilter later
encounters a spam, it can then use these classifications to estimate
the probability that a message is spam or nonspam.
Because the Bogofilter database file is stored in the
user’s home directory, you should create the
Bogofilter database file by running the program as that user. This
user can conceivably be root, but
for security reasons, it’s best if you find a way to
run Bogofilter as a non-root
user. If necessary, you can create an initial database, place the
mail classification call to bogofilter in
users’ individual ~/.procmailrc
files, and modify the global configuration to use the global word
files in addition to individual users’ word lists.
Another approach to Bogofilter training is to use a training script, such as bogominitrain.pl or randomtrain. These scripts might or might not be shipped with a distribution-provided Bogofilter package. If they’re not on your system, consult the main Bogofilter site. These scripts perform more sophisticated training; namely, they use the bogofilter command to classify each message and perform training only if the message isn’t classified correctly by Bogofilter. If necessary, this process is repeated until Bogofilter classifies every message correctly. The result tends to be smaller databases, and often more accurate results, but initial training takes longer. Consult the documentation that comes with the training script for details. Typically, you pass the script the names of both the spam and the nonspam files, and perhaps additional parameters:
$ bogominitrain.pl -fnv ~/.bogofilter nonspam.mbox spam.mbox '-o 0.9,0.3'This example passes the location of the word list, the nonspam and spam files, and classification parameters (described in more detail shortly).
Whatever training method you use, you should also examine, and
perhaps modify, the Bogofilter configuration file. By default,
/etc/bogofilter.cf provides systemwide defaults,
but individual users can override these by creating a configuration
file called ~/.bogofilter.cf (this filename can
be set in /etc/bogofilter.cf). Options in this
file are well commented, so perusing it will give you some idea of
what you can change. Some options you may want to modify include:
bogofilter_dir
This option points to the word list directory. Changing it is one way ordinary users can access a global word list; however, doing so may make it impossible for individuals to change that word list.
ignore_case
Ordinarily, Bogofilter pays attention to case;
Viagra is distinct from viagra.
You can set ignore_case=yes to have Bogofilter
convert all words to lowercase, though. This can help overcome
attempts to confuse antispam tools by mixing up case in words, but it
can also reduce Bogofilter’s sensitivity to strings
for which case can be important.
algorithm
Bogofilter can use several different algorithms for determining the
spamicity of a message (that is, the
probability that a message is spam). These algorithms are
graham, robinson, and
fisher. The default is fisher,
which generates a three-way classification: spam, nonspam, or unsure.
ham_cutoff
This option sets the maximum spamicity score (between 0.0 and 1.0)
that’s needed to classify a message as nonspam. A
value of 0.10 is typical and usually works well.
spam_cutoff
This option sets the minimum spamicity score (between 0.0 and 1.0)
that’s required for a spam classification. A value
of 0.95 is typical and usually works well.
Once you’ve set these values and trained Bogofilter,
you should test its operation by passing spam and nonspam messages
through the bogofilter command. Ideally, you
should use messages that you held back from the training so that you
can judge how Bogofilter handles messages it’s never
seen. Use the -v option to have the program
generate a verbose report of the input messages, which you redirect
as input:
$ bogofilter <
message.txt
X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000, version=0.16.4This result shows a classification of the message as spam
(X-Bogosity: Yes), with a very high spamicity
score (1.000000). A nonspam message is likely to generate a much
lower score:
$ bogofilter <
message.txt
X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.16.4Because of its three-way output, Bogofilter can also tell you that it’s unsure of the status of the message:
$ bogofilter <
message.txt
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500008, version=0.16.4If you find that Bogofilter isn’t classifying your
messages correctly, you should revisit your training procedures.
Perhaps you didn’t classify enough messages or
delivered them with the wrong parameters (confusing spam and nonspam
messages, for instance).
Note that a classification of
“unsure” works like a nonspam
classification in most respects, so you shouldn’t be
too concerned if some of your nonspam messages are classified in this
way, unless the spamicity ratings are very close to the spam cutoff
point. If you have classification problems, you might also consider
fine-tuning the Bogofilter cutoff criteria
(ham_cutoff and spam_cutoff).
You can increase or decrease these values, but with certain risks; if
you make either the nonspam or spam category too large,
you’ll risk misclassifying messages.
Numerically, the largest range of spamicity values is above the
ham_cutoff value but below the
spam_cutoff value. Thus, you might expect that
most messages will end up classified as
“unsure.” In practice, though, most
messages achieve very high (close to 1.0) or very low (close to 0.0)
spamicity ratings.
With Bogofilter now correctly classifying at least most of your messages, it’s time to integrate it into your mail delivery system. One way to do this is by calling Bogofilter in Procmail. The following Procmail recipe will do this:
:0HB: * ? bogofilter -u -l probably-spam
This recipe passes the message through the
bogofilter command. The -u
option tells Bogofilter to automatically add messages that it
classifies as spam or nonspam to the appropriate word lists. This
option is both potentially useful and potentially dangerous;
it’s useful because it can help keep your spam
database updated, but it’s dangerous because if
Bogofilter misclassifies a message, that misclassification can lead
to more misclassifications. (If a message is classified as
“unsure,” it won’t
be added to the database.) The -l option logs
Bogofilter activity. This recipe stores spam messages in the
probably-spam folder; nonspam messages go on for
normal delivery.
If you use the -u option, and Bogofilter
misclassifies a message, you should correct the problem. You can do
this with the -N and -S
options, which undo previous registrations of a message as nonspam
and spam, respectively. You can combine these options with
-s and -n to reregister the
messages correctly. For instance, if Bogofilter has registered a
message as nonspam but in fact it’s spam, you can
extract the message to a file (complete with its headers) and type
the following command:
$ bogofilter -Ns <
message.spamTo test that it’s worked correctly, pass the message
through bogofilter again, using
-v rather than -Ns; Bogofilter
should now classify the message as spam, or at least give it a much
higher spamicity score. (Register it again with
bogofilter
-s to
strengthen Bogofilter’s tendency to classify the
message as spam, if desired.) Use -Sn rather than
-Ns to undo an incorrect classification of a
nonspam message as spam.
The vast majority of email worms released over the years have been written for Windows systems. Any of the antispam tools described here can be used to locate and deal with worms. Using a Linux system for this task ensures that the mail server itself can’t become infected, even through gross negligence. (At least, assuming Windows worms are in play; theoretically, Linux worms could be written to take advantage of flaws in Linux software.)
The threat of Windows worms is such that many sites have taken
drastic measures to protect themselves: they reject all mail carrying
certain types of attachments, or even all email attachments. The
reasoning is that nobody has a valid reason to email, say, Windows
.exe executables, so any such executable must be
a worm. The validity of such reasoning is uncertain, but it may be so
close to the truth for certain sites that discarding or quarantining
messages with such attachments may be worthwhile. Example 13-3 shows a couple of Procmail recipes that
discard certain suspicious messages.
Example 13-3. Procmail recipes to discard suspicious attachments
:0 B
* ^Content-Type: audio/x-(wav|midi);
/dev/null
:0
* ^Content-Type: multipart/(mixed|alternate|alternative|related)
{
:0 B
* ^.*name=.*\.(bat|com|exe|pif|scr|vbs|zip)
/dev/null
}The first of these rules discards everything with a
Content-Type line of
audio/x-wav or audio/x-midi.
Theoretically, these lines identify certain types of audio files,
which might be legitimate attachments in some environments; however,
in practice, worms often try to masquerade as these file types. The
second rule looks for any of several content types in the header and,
if found, searches for a line that includes name=
followed by any of several filename extensions. Some of these, such
as .bat, .com, and
.exe, identify Windows executables. Others
don’t, but again, Windows worms frequently try to
masquerade as files of these types.
Unfortunately, rules such as these are likely to produce false
alarms. The second rule is particularly overzealous because it
discards messages with attached Zip files. You can, of course,
eliminate some of these filename extensions, but that reduces the
effectiveness of the tests. Alternatively, you can enclose the test
in a white-list test to enable trusted senders to deliver mail
containing these attachments. Another option is to rename attachments
rather than discard them; for instance, rename a
.zip file to .zip.txt. This
enables users to access the files, but makes it harder for worms that
are named in this way to do harm automatically.
These rules, as shown in Example 13-3, are also
potentially deleterious because they discard the messages by sending
them to /dev/null. Placing the messages in a
folder to hold suspected worms might be a good alternative. Users can
then open the messages only with extreme caution. If you place these
rules in a systemwide Procmail configuration file, you can even send
the suspect messages to a mail folder that only root can read.