Squid: The Definitive Guide

Chapter 11. Redirectors

A redirector is an external process that rewrites URIs from client requests. For example, although a user requests the page http://www.example.com/page1.html, a redirector can change the request to something else, such as http://www.example.com/page2.html. Squid fetches the new URI automatically, as though the client originally requested it. If the response is cachable, Squid stores it under the new URI.

The redirector feature allows you to implement a number of interesting things with Squid. Many sites use them for access controls, removing advertisements, local mirrors, or even working around browser bugs.

One of the nice things about using a redirector for access control is that you can send the user to a page that explains exactly why her request is denied. You may also find that a redirector offers more flexibility than Squid’s built-in access controls. As you’ll see shortly, however, a redirector doesn’t have access to the full spectrum of information contained in a client’s request.

Many people use a redirector to filter out web page advertisements. In most cases, this involves changing a request for a GIF or JPEG advertisement image into a request for a small, blank image, located on a local server. Thus, the advertisement just “disappears” and doesn’t interfere with the page layout.

So in essence, a redirector is really just a program that reads a URI and other information from its input and writes a new URI on its output. Perl and Python are popular languages for redirectors, although some authors use compiled languages such as C for better performance.

The Squid source code doesn’t come with any redirector programs. As an administrator, you are responsible for writing your own or downloading one written by someone else. The first part of this chapter describes the interface between Squid and a redirector process. I also provide a couple of simple redirector examples in Perl. If you’re interested in using someone else’s redirector, rather than programming your own, skip ahead to Section 11.3.

The Redirector Interface

A redirector receives data from Squid on stdin one line at a time. Each line contains the following four tokens separated by whitespace:

Request-URI
Client IP address and fully qualified domain name
User’s name, via either RFC 1413 ident or proxy authentication
HTTP request method

For example:

http://www.example.com/page1.html 192.168.2.3/user.host.name jabroni GET

The Request-URI is taken from the client’s request, including query terms, if any. Fragment identifier components (e.g., the # character and subsequent text) are removed, however.

The second token contains the client IP address and, optionally, its fully qualified domain name (FQDN). The FQDN is set only if you enable the log_fqdn directive or use a srcdomain ACL element. Even then, the FQDN may be unknown because the client’s network administrators didn’t properly set up the reverse pointer zones in their DNS. If Squid doesn’t know the client’s FQDN, it places a hyphen (-) in the field. For example:

http://www.example.com/page1.html 192.168.2.3/- jabroni GET

The client ident field is set if Squid knows the name of the user behind the request. This happens if you use proxy authentication, ident ACL elements, or enable ident_lookup_access. Remember, however, that the ident_lookup_access directive doesn’t cause Squid to delay request processing. In other words, if you enable that directive, but don’t use the access controls, Squid may not yet know the username when writing to the redirector process. If Squid doesn’t know the username, it displays a -. For example:

http://www.example.com/page1.html 192.168.2.3/- - GET

Squid reads back one token from the redirector process: a URI. If Squid reads a blank line, the original URI remains unchanged.

A redirector program should never exit until end-of-file occurs on stdin. If the process does exit prematurely, Squid writes a warning to cache.log:

WARNING: redirector #2 (FD 18) exited

If 50% of the redirector processes exit prematurely, Squid aborts with a fatal error message.

Handling URIs That Contain Whitespace

If the Request-URI contains whitespace, and the uri_whitespace directive is set to allow, any whitespace in the URI is passed to the redirector. A redirector with a simple parser may become confused in this case. You have two options for handling whitespace in URIs when using a redirector.

One option is to set the uri_whitespace directive to anything except allow. The default setting, strip, is probably a good choice in most situations because Squid simply removes the whitespace from the URI when it parses the HTTP request. See Appendix A for information on the other values for this directive.

If that isn’t an option, you need to make sure the redirector’s parser is smart enough to detect the extra tokens. For example, if it finds more than four tokens in the line received from Squid, it can assume that the last three are the IP address, ident, and request method. Everything before the third-to-last token comprises the Request-URI.

Generating HTTP Redirect Messages

When a redirector changes the client’s URI, it normally doesn’t know that Squid decided to fetch a different resource. This is, in all likelihood, a gross violation of the HTTP RFC. If you want to be nicer, and remain compliant, there is a little trick that makes Squid return an HTTP redirect message. Simply have the redirector insert 301:, 302:, 303:, or 307:, before the new URI.

For example, if a redirector writes this line on its stdout:

301:http://www.example.com/page2.html

Squid sends a response like this back to the client:

HTTP/1.0 301 Moved Permanently
Server: squid/2.5.STABLE4
Date: Mon, 29 Sep 2003 04:06:23 GMT
Content-Length: 0
Location: http://www.example.com/page2.html
X-Cache: MISS from zoidberg
Proxy-Connection: close

Some Sample Redirectors

Example 11-1 is a very simple redirector written in Perl. Its purpose is to send HTTP requests for the squid-cache.org site to a local mirror site in Australia. If the requested URI looks like it is for www.squid-cache.org or one of its mirror sites, this script outputs a new URI with the hostname set to www1.au.squid-cache.org.

A common problem first-time redirector writers encounter is buffered I/O. Note that here I make sure stdout is unbuffered.

Example 11-1. A simple redirector in Perl

#!/usr/bin/perl -wl
$|=1;   # don't buffer the output
while (<>) {
        ($uri,$client,$ident,$method) = ( );
        ($uri,$client,$ident,$method) = split;
        next unless ($uri =~ m,^http://.*\.squid-cache\.org(\S*),);
        $uri = "http://www1.au.squid-cache.org$1";
} continue {
        print "$uri";
}

Example 11-2 is another, somewhat more complicated, example. Here I make a feeble attempt to deny requests when the URI contains “bad words.” This script demonstrates an alternative way to parse the input fields. If I don’t get all five required fields, the redirector returns a blank line, leaving the request unchanged.

This example also gives preferential treatment to some users. If the ident string is equal to “BigBoss,” or comes from the 192.168.4.0 subnet, the request is passed through. Finally, I use the 301: trick to make Squid return an HTTP redirect to the client. Note, this program is neither efficient nor smart enough to correctly deny so-called bad requests.

Example 11-2. A slightly less simple redirector in Perl

#!/usr/bin/perl -wl
$|=1;   # don't buffer the output

$DENIED = "http://www.example.com/denied.html";
&load_word_list( );

while (<>) {
        unless (m,(\S+) (\S+)/(\S+) (\S+) (\S+),) {
                $uri = '';
                next;
        }
        $uri = $1;
        $ipaddr = $2;
        #$fqdn = $3;
        $ident = $4;
        #$method = $5;
        next if ($ident eq 'TheBoss');
        next if ($ipaddr =~ /^192\.168\.4\./);
        $uri = "301:$DENIED" if &word_match($uri);
} continue {
        print "$uri";
}

sub load_word_list {
        @words = qw(sex drugs rock roll);
}

sub word_match {
        my $uri = shift;
        foreach $w (@words) { return 1 if ($uri =~ /$w/); }
        return 0;
}

For more ideas about writing your own redirector, I recommend reading the source code for the redirectors mentioned in Section 11.5.

The Redirector Pool

A redirector can take an arbitrarily long time to return its answer. For example, it may need to make a database query, search through long lists of regular expressions, or make some complex computations. Squid uses a pool of redirector processes so that they can all work in parallel. While one is busy, Squid hands a new request off to another.

For each new request, Squid examines the pool of redirector processes in order. It submits the request to the first idle process. If your request rate is very low, the first redirector may be able to handle all requests itself.

You can control the size of the redirector pool with the redirect_children directive. The default value is five processes. Note that Squid doesn’t dynamically increase or decrease the size of the pool depending on the load. Thus, it is a good idea to be a little liberal. If all redirectors are busy, Squid queues pending requests. If the queue becomes too large (bigger than twice the pool size), Squid exits with a fatal error message:

FATAL: Too many queued redirector requests

In this case, you need to increase the size of the redirector pool or change something so that the redirectors can process requests faster. You can use the cache manager’s redirector page to find out if you have too few, or too many redirectors running. For example:

% squidclient mgr:redirector
...
Redirector Statistics:
program: /usr/local/squid/bin/myredir
number running: 5 of 5
requests sent: 147
replies received: 142
queue length: 2
avg service time: 953.83 msec

      #      FD     PID  # Requests     Flags      Time  Offset Request
      1      10   35200          46     AB        0.902       0 http://...
      2      11   35201          29     AB        0.401       0 http://...
      3      12   35202          25     AB        1.009       1 cache_o...
      4      14   35203          25     AB        0.555       0 http://...
      5      15   35204          21     AB        0.222       0 http://...

If, as in this example, you see that the last redirector has almost as many requests as the second to last, you should probably increase the size of the redirector pool. If, on the other hand, you see many redirectors with no requests, you can probably decrease the pool size.

Configuring Squid

The following five squid.conf directives control the behavior of redirectors in Squid.

redirect_program

The redirect_program directive specifies the command line for the redirector program. For example:

redirect_program /usr/local/squid/bin/my_redirector -xyz

Note, the redirector program must be executable by the Squid user ID. If, for some reason, Squid can’t execute the redirector, you should see an error message in cache.log.^[1] For example:

ipcCreate: /usr/local/squid/bin/my_redirector: (13) Permission denied

Due to the way Squid works, the main Squid process may be unaware of problems executing the redirector program. Squid doesn’t detect the error until it tries to write a request and read a response. It then prints:

WARNING: redirector #1 (FD 6) exited

Thus, if you see such a message for the first request sent to Squid, check cache.log closely for other errors, and make sure the program is executable by Squid.

redirect_children

The redirect_children directive specifies how many redirector processes Squid should start. For example:

redirect_children 20

Squid warns you (via cache.log) when all redirectors are simultaneously busy:

WARNING: All redirector processes are busy.
WARNING: 1 pending requests queued.

If you see this warning, you should increase the number of child processes and restart (or reconfigure) Squid. If the queue size becomes twice the number of redirectors, Squid aborts with a fatal message.

Don’t attempt to disable Squid’s use of the redirectors by setting redirect_children to 0. Instead, simply remove the redirect_program line from squid.conf.

redirect_rewrites_host_header

Squid normally updates a request’s Host header when using a redirector. That is, if the redirector returns a new URI with a different hostname, Squid puts the new hostname in the Host header. If you use Squid as a surrogate (see Chapter 15), you might want to disable this behavior by setting the redirect_rewrites_host_header directive to off:

redirect_rewrites_host_header off

redirector_access

Squid normally sends every request through a redirector. However, you can use the redirector_access rules to send certain requests through selectively. The syntax is identical to http_access:

redirector_access allow|deny [!]ACLname ...

For example:

acl Foo src 192.168.1.0/24
acl All src 0/0
redirector_access deny Foo
redirector_access allow All

In this case, Squid skips the redirector for any request that matches the Foo ACL.

redirector_bypass

If you enable the redirector_bypass directive, Squid bypasses the redirectors when all of them are busy. Normally, Squid queues pending requests until a redirector process becomes available. If this queue grows too large, Squid exits with a fatal error message. Enabling this directive ensures that Squid never reaches that state.

The tradeoff, of course, is that some user requests may not be redirected when the load is high. If that’s all right with you, simply enable the directive with this line:

redirector_bypass on

Popular Redirectors

As I already mentioned, the Squid source code doesn’t include any redirectors. However, you can find a number of useful third-party redirectors linked from the Related Software page on http://www.squid-cache.org. Here are some of the more popular offerings:

Squirm

http://squirm.foote.com.au/

Squirm comes from Chris Foote. It is written in C and distributed as source code under the GNU General Public License (GPL). Squirm’s features include:

Being very fast with minimal memory usage
Full regular expression pattern matching and replacement
Ability to apply different redirection lists to different client groups
Interactive mode for testing on the command line
Fail-safe mode passes requests through unchanged in the event that configuration files contain errors
Writing debugging, errors, and more to various log files

Jesred

http://www.linofee.org/~elkner/webtools/jesred/

Jesred comes from Jens Elkner. It is written in C, based on Squirm, and also released under the GNU GPL. Its features include:

Being faster than Squirm, with slightly more memory usage
Ability to reread its configuration files while running
Full regular expression pattern matching and replacement
Fail-safe mode passes requests through unchanged in the event that configuration files contain errors
Optionally logging rewritten requests to a log file

squidGuard

http://www.squidguard.org/

squidGuard comes from Pål Baltzersen and Lars Erik Håland at Tele Danmark InterNordia. It is released under the GNU GPL. The authors also make sure squidGuard compiles easily on modern Unix systems. Their site contains a lot of good documentation. Here are some of squidGuard’s features:

Highly configurable; you can apply different rules to different groups of clients or users and at different times or days
URI substitution, not just replacement, à la sed
printf-like substitutions allow passing parameters to CGI scripts for customized messages
Supportive of the 301/302/303/307 HTTP redirect status code feature for redirectors
Selective logging for rewrite rule sets

At the squidGuard site, you can also find a blacklist of more than 100,000 sites categorized as porn, aggressive, drugs, hacking, ads, and more.

AdZapper

http://www.adzapper.sourceforge.net

AdZapper is a popular redirector because it specifically targets removal of advertisements from HTML pages. It is a Perl script written by Cameron Simpson. AdZapper can block banners (images), pop-up windows, flash animations, page counters, and web bugs. The script includes a list of regular expressions that match URIs known to contain ads, pop-ups, etc. Cameron updates the script periodically with new patterns. You can also maintain your own list of patterns.

Exercises

Write a redirector that never changes the requested URI and configure Squid to use it.
While running tail -f cache.log, kill Squid’s redirector processes one by one until something interesting happens.
Download and install one of the redirectors mentioned in the previous section.

^[1]This message appears only in cache.log, and not on stdout, if you use the -d option, or in syslog, if you use the -s option.

Previous Chapter

10. Talking to Other Squids

Next Chapter

12. Authentication Helpers

Table of Contents for Squid: The Definitive Guide