Squid: The Definitive Guide

Chapter 13. Log Files

Log files are the primary sources of persistent information about Squid’s operation. In other words, they provide a record of what Squid has been doing. This includes URIs requested by users, objects that have been saved to disk, and various warnings and errors. When Squid appears to be malfunctioning, you’ll want to check the log files first. By the end of this chapter, you’ll know how to interpret and manage all of Squid’s various log files.

Depending on your configuration, Squid maintains, at most, seven log files. The three primary files are: cache.log, access.log, and store.log. Two optional log files, useragent.log and referer.log, are similar to access.log but contain additional information. I’ll also talk about the swap.state and netdb_state files. These are databases, used by Squid when it restarts.

Note that the filenames, such as access.log, are the default values. You can change most of the log file names with various squid.conf directives.

The following list contains a brief description of each log file:

cache.log: This log file contains human-oriented, informational messages about Squid’s operation. The filename is defined by the cache_log directive. Under normal conditions, the file grows by about 10-100 KB per day.
access.log: This log file contains an entry for every HTTP and (optionally) ICP transaction made by Squid’s clients. The filename is defined by the cache_access_log directive. It grows at a rate of 100-200 bytes per transaction.
store.log: This log file contains low-level information about objects that enter and leave the cache. The filename is defined by the cache_store_log directive. It grows at a rate of about 150 bytes per transaction.
referer.log: This optional log file contains HTTP Referer ^[1] headers for each client request. You must enable referer logging with the —enable-referer-log option when running ./configure. The filename is defined by the referer_log directive. It grows at a rate of about 80 bytes per transaction.
useragent.log: This optional log file contains HTTP User-Agent headers for each client request. You must enable user-agent logging with the —enable-useragent-log option when running ./configure. The filename is defined by the useragent_log directive. It grows at a rate of about 75 bytes per transaction.
swap.state: These files contain internal metadata about the objects stored on disk. Squid uses them to reconstruct the cache upon startup. By default, they are located in the cache_dir directories. However, you can change the location with the cache_swap_log directive. They grow at a rate of 100 bytes per cache miss.
netdb_state: This file holds the contents of the Network Measurement Database (see Section 10.5). It is always located in the first cache_dir directory. Its size is determined by the netdb_high value.

If Squid receives an error while writing a log file, it doesn’t silently continue. Instead, it exits with a fatal error message to get your attention. Make sure that you periodically rotate the log files, as described in Section 13.7, to reduce the possibility of filling your disks. For the same reason, I also recommend placing your log files on a partition separate from any of your cache directories.

cache.log

cache.log contains various messages such as information about Squid’s configuration, warnings about possible performance problems, and serious errors. Here is some sample cache.log output:

2003/09/29 12:09:45| Starting Squid Cache version 2.5.STABLE4 for i386-
unknown-freebsd4.8...
2003/09/29 12:09:45| Process ID 18990
2003/09/29 12:09:45| With 1064 file descriptors available
2003/09/29 12:09:45| Performing DNS Tests...
2003/09/29 12:09:45| Successful DNS name lookup tests...
2003/09/29 12:09:45| DNS Socket created at 0.0.0.0, port 1154, FD 5
2003/09/29 12:09:45| Adding nameserver 24.221.192.5 from /etc/resolv.conf
2003/09/29 12:09:45| Adding nameserver 24.221.208.5 from /etc/resolv.conf
2003/09/29 12:09:45| helperOpenServers: Starting 5 'redirector.pl' processes
2003/09/29 12:09:45| Unlinkd pipe opened on FD 15
2003/09/29 12:09:45| Swap maxSize 10240 KB, estimated 787 objects
2003/09/29 12:09:45| Target number of buckets: 39
2003/09/29 12:09:45| Using 8192 Store buckets
2003/09/29 12:09:45| Max Mem  size: 8192 KB
2003/09/29 12:09:45| Max Swap size: 10240 KB
2003/09/29 12:09:45| Rebuilding storage in /usr/local/squid/var/cache (CLEAN)
2003/09/29 12:09:45| Using Least Load store dir selection
2003/09/29 12:09:45| Set Current Directory to /usr/local/squid/var/cache
2003/09/29 12:09:45| Loaded Icons.
2003/09/29 12:09:45| Accepting HTTP connections at 0.0.0.0, port 3128, FD 16.
2003/09/29 12:09:45| Accepting ICP messages at 0.0.0.0, port 3130, FD 17.
2003/09/29 12:09:45| WCCP Disabled.
2003/09/29 12:09:45| Ready to serve requests.

Each cache.log entry starts with a timestamp showing when the message was generated. The very first entry in this sample reports the Squid version (2.5.STABLE4) and a string identifying the operating system for which Squid was configured (i386-unknown-freebsd4.8). The process ID (18990) follows. Many cache.log entries may look cryptic (Target number of buckets: 39). In most cases, under normal conditions, you can ignore entries you don’t understand. On the other hand, you may want to look over essential configuration details such as name-server addresses or HTTP server address. This sample output ends with a statement that Squid is ready to serve requests. At this point, Squid can accept HTTP connections from clients.

Usually, the cache.log file grows slowly. However, an unusual HTTP transaction or similar event may cause Squid to emit a debugging message. If such an event happens often (e.g., a DoS attack, a new virus, or sudden disk failure), the log file may grow quickly. Rotating log files reduces the chance that you’ll run out of disk space.

Major errors and abnormal conditions are likely to be reported in cache.log. I recommend archiving these logs so that it is possible to go back and find the first occurrence of an unusual event. When describing a particular Squid problem on the mailing list or a similar forum, the relevant lines from cache.log may be very useful. You may also want to increase debugging levels for some sections so that others can better understand and fix your problem.

Debugging Levels

The debug_options directive controls the level of detail for cache.log messages. The default value (ALL,1) is usually the best choice. At higher levels, the unimportant messages make it harder to find the important ones. Refer to Section 16.2 for a thorough description of the debug_options directive.

Note that debugging at the highest levels (9 or 10) may add thousands of lines for each request, quickly consuming disk space and significantly degrading Squid’s performance.

You can use Squid’s -X command-line option to enable full debugging for all sections. This mode is particularly useful if Squid refuses to start, and the debugging levels in squid.conf are insufficient to diagnose the problem. This is also a good way to enable full debugging of the configuration file parser, before it gets to the debug_options directive. You should never use the -X when Squid is operating properly.

You can use Squid’s -k debug command-line option to enable full debugging immediately on a running Squid process. This command operates as a toggle: the first invocation turns on full debugging, and the second invocation turns it off. See Chapter 5 for a general discussion about the -k option.

As I already mentioned, full debugging generates an overwhelming amount of data. This can make Squid, and the operating system, very slow. In extreme cases, you may find your terminal session becomes unresponsive after executing the first squid -k debug command. Locking yourself out while Squid is spitting megabytes of cache.log entries per second is an unpleasant experience. I find the following trick useful to get a compact, five-second debugging snapshot with less risk:

% squid -k debug; sleep 5; squid -k debug

Forwarding cache.log Messages to the System Log

To have Squid send copies of cache.log messages to the system log, use the -s command-line option. Only messages with debugging levels 0 and 1 are forwarded. Level 0 messages are logged with syslog level LOG_WARNING. Level 1 messages use syslog level LOG_NOTICE. All messages use the LOCAL4 syslog facility. Here is one way to configure syslogd so that these messages are saved:

local4.warning                           /var/log/squid.log

Using syslog in addition to cache.log is especially handy when you maintain several Squid boxes. You can configure each local syslog daemon to forward these messages to a central host and enjoy a unified view of all caches in one location. For example, you might use this entry in /etc/syslogd.conf:

local4.notice                            @192.168.45.1

Dumping cache.log Messages to Your Terminal

The -d level command-line option instructs Squid to dump cache.log messages to your terminal (i.e., stderr). The level argument specifies the maximum level for messages that are dumped. Note that you’ll see only messages that would appear in cache.log, subject to the debug_options setting. For example, if you have debug_options ALL,1, and run squid -d2, you won’t see any level 2 debugging messages.

The -d level and -N options are most useful for debugging Squid problems or quickly testing a change to the configuration file. They allow you to start Squid easily and see the cache.log messages. This option may also be useful when Squid starts from cron or a similar facility that automatically captures a program’s standard error output and reports it back to the user. For example, you may have a cron job that automatically reconfigures the running Squid process:

15 */4 * * * /usr/local/squid/sbin/squid -d1 -k reconfigure

access.log

Squid saves key information about HTTP transactions in access.log. This file is line-based, such that each line corresponds to one client request. Squid records the client IP address (or hostname), requested URI, response size, and other information.

Squid records all HTTP accesses in access.log, except for those that disconnect before sending any data. Squid also records all ICP (but not HTCP) transactions unless you disable them with the log_icp_queries directive. Section 13.2.4 describes the other squid.conf directives that affect the access log.

The default access.log format contains 10 fields. Here are some examples, with long lines split and indented:

1066037222.011  126389 9.121.105.207 TCP_MISS/503 1055
        GET http://home.gigigaga.com/n8342133/Miho.DAT.019 -
        DIRECT/203.187.1.180 -
1066037222.011   19120 12.83.179.11 TCP_MISS/200 359
        GET http://ads.x10.com/720x300/Z2FtZ3JlZXRpbmcxLmRhd/7/AMG -
        DIRECT/63.211.210.20 text/html
1066037222.011   34173 166.181.33.71 TCP_MISS/200 559
        GET http://coursesites.blackboard.com:8081/service/collab/../1010706448190/ -
        DIRECT/216.200.107.101 application/octet-stream
1066037222.011   19287 41.51.105.27 TCP_REFRESH_MISS/200 500
        GET http://fn.yam.com/include/tsemark/show.js -
        DIRECT/210.59.224.59 application/x-javascript
1066037222.011   19395 41.51.105.27 TCP_MISS/304 274
        GET http://fnasp.yam.com/image/coin3.gif -
        DIRECT/211.72.254.133 -
1066037222.011   19074 30.208.85.76 TCP_CLIENT_REFRESH_MISS/304 197
        GET http://ads.icq.com/content/B0/0/..bC6GygEYNeHGjBUin5Azfe68m5hD1jLk$/aol -
        DIRECT/64.12.184.121 -
1066037222.011   19048 12.83.179.11 TCP_MISS/200 261
        GET http://ads.adsag.com/js.ng/...ne&cat=friendship&subcat=girltalk -
        DIRECT/209.225.54.119 application/x-javascript
1066037222.118     106 41.51.105.27 TCP_HIT/200 536
        GET http://rcm-images.amazon.com/images/G/01/rcm/privacy.gif -
        NONE/- image/gif
1066037222.352   19475 27.34.49.248 TCP_MISS/200 12387
        GET http://espanol.geocities.com/lebastias/divulgacion/budismo-tarot.html -
        DIRECT/209.1.225.139 text/html
1066037222.352     132 144.157.100.17 TCP_MISS/504 1293
        GET http://ar.atwola.com/image/93101912/aol -
        NONE/- -

Here are the definitions for all fields:

1: timestamp

The completion time of the request, expressed as the number of seconds since the Unix epoch (Thu Jan 1 00:00:00 UTC 1970), with millisecond resolution. Squid uses this format, instead of something more human-friendly, to simplify the work of various log file processing programs.

You can use a simple Perl command to convert the Unix timestamps into local time. For example:

perl -pe 's/^\d+\.\d+/localtime($&)/e;' access.log

2: response time

For HTTP transactions, this field indicates how much time it took to process the request. The timer starts when Squid receives the HTTP request and stops when the response has been fully delivered. The response time is given in milliseconds.

The response time is usually 0 for ICP queries. This is because Squid answers ICP queries very quickly. Furthermore, Squid doesn’t update the process clock between receiving an ICP query and sending the reply.

While time values are reported with millisecond resolution, the precision of those entries is probably about 10 milliseconds. Timing becomes even less precise when Squid is heavily loaded.

3: client address

This field contains the client’s IP address, or hostname if you enable log_fqdn. For security or privacy reasons, you may want to mask a part of client’s address out using the client_netmask directive. However, that also makes it impossible to group requests coming from the same client.

4: result/status codes

This field consists of two tokens separated by a slash. The first token, result code, classifies the protocol and the result of a transaction (e.g., TCP_HIT or UDP_DENIED). These are Squid-specific codes, defined in Section 13.2.1. The codes that begin with TCP_ refer to HTTP requests, while UDP_ refers to ICP queries.

The second token is the HTTP response status code (e.g, 200, 304, 404, etc.). The status code normally comes from the origin server. In some cases, however, Squid may be responsible for selecting the status code. These codes, defined by the HTTP RFC, are summarized later in Table 13-1.

5: transfer size

This field indicates the number of bytes transferred to the client. Strictly speaking, it is the number of bytes that Squid told the TCP/IP stack to send to the client. Thus, it doesn’t include overheads from TCP/IP headers. Also note that the transfer size is normally larger than the response’s Content-Length. This value includes the HTTP response headers, while Content-Length does not.

These properties make the transfer size field useful for approximate bandwidth usage analysis but not for exact HTTP entity size calculations. If you need to know a response’s Content-Length, you can find it in the store.log file.

6: request method

This field contains the request method. Because Squid clients may use ICP or HTTP, the request method is either HTTP- or ICP-specific. The most common HTTP request method is GET. ICP queries are always logged with ICP_QUERY. See Section 6.1.2.8 for a list of HTTP methods Squid knows about.

7: URI

This field contains the URI from the client’s request. The vast majority of logged URIs are actually URLs (i.e., they have hostnames).

Squid uses a special format for certain failures. These are cases when Squid can’t parse the HTTP request or otherwise determine the URI. Instead of a URI/URL, you’ll see a string such as “error:invalid-request.” For example:

1066036250.603 310 192.0.34.70 NONE/400 1203 GET error:invalid-request - NONE/- -

Also in this field look out for whitespace characters in the URI. Depending on your uri_whitespace setting, Squid may print the URI in the log file with whitespace characters. When this happens, the tools that read access.log files may become confused by the extra fields.

When logging, Squid strips all URI characters after the first question mark unless the strip_query_terms directive is disabled.

8: client identity

Squid can determine a user’s identity in two different ways. One is with the RFC 1413 ident protocol; the other is from HTTP authentication headers.

Squid attempts ident lookups based on the ident_lookup_access rules, if any (see Section 6.2). Alternatively, if you use proxy authentication (or regular server authentication in surrogate mode), Squid places the given username in this field. If both methods provide Squid with a username, and you’re using the native access.log format, the HTTP authentication name is logged, and the RFC 1413 name is ignored. The common log file format has separate fields for both names.

9: peering code/peerhost

The peering information consists of two tokens, separated by a slash. It is relevant only for requests that are cache misses. The first token indicates how the next hop was chosen. The second token is the address of that next hop. The peering codes are listed in Section 13.2.3.

When Squid sends a request to a neighbor cache, the peerhost address is the neighbor’s hostname. If the request is sent directly to the origin server, however, Squid writes the origin server’s IP address or its hostname if log_ip_on_direct is disabled. The value NONE/- indicates that Squid didn’t forward this request to any other servers.

10: content type

The final field of the default, native access.log is the content type of the HTTP response. Squid obtains the content type value from the response’s Content-Type header. If that header is missing, Squid uses a hyphen (-).

If you enable the log_mime_headers directive, Squid appends two additional fields to each line:

11: HTTP request headers

Squid encodes the HTTP request headers and prints them between a pair of square brackets. The brackets are necessary because Squid doesn’t encode space characters. The encoding scheme is a little strange. Carriage return (ASCII 13) and newline (ASCII 10) are printed as \r and \n, respectively. Other non-printable characters are encoded with the RFC 1738 style, such that Tab (ASCII 9) becomes %09.

12: HTTP response headers

Squid encodes the HTTP response headers and prints them between a pair of square brackets. Note that these are the headers sent to the client, which may be different from headers received from the origin server.

Squid writes to access.log only after the entire response has been sent to the client. This allows Squid to include both request and response information in the log file. However, transactions that take minutes, or even hours, to complete aren’t visible in access.log at the time of the request. When these types of transactions present a performance or policy concern, the access.log may be unable help you. Instead, use the cache manager to view a list of pending transactions (see Section 14.2.1.37).

access.log Result Codes

The following labels may appear in the fourth field of the access.log file in response to HTTP requests:

TCP_HIT: Squid found a likely fresh copy of the requested resource and sent it immediately to the client.
TCP_MISS: Squid didn’t have a cached copy of the requested resource.
TCP_REFRESH_HIT: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. The origin server sent a 304 (Not Modified) response, indicating that Squid’s copy is still fresh.
TCP_REF_FAIL_HIT: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. However, the origin server failed to respond or sent a response that Squid didn’t understand. In any case, Squid sent the cached (and likely stale) copy to the client.
TCP_REFRESH_MISS: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. The server responded with new content, indicating the cached response was indeed stale.
TCP_CLIENT_REFRESH_MISS: Squid found a copy of the requested resource, but the client’s request included a Cache-Control: no-cache directive. Squid forwarded the client’s request to the origin server, forcing a cache validation.
TCP_IMS_HIT: The client sent a validation request, and Squid found a more recent, and likely fresh, copy of the requested resource. Squid sent the newer content to the client, without contacting the origin server.
TCP_SWAPFAIL_MISS: Squid found a valid copy of the requested resource but failed to load it from disk. Squid then sent the request to the origin server as though it were a cache miss.
TCP_NEGATIVE_HIT: When a request to an origin server results in an HTTP error, Squid may cache the response anyway. Repeated requests for these resources, within a short amount of time, result in negative hits. The negative_ttl directive controls the amount of time these errors may be cached. Also note that errors are cached only in memory and never written to disk. The following HTTP status codes may be negatively cached, subject to additional constraints: 204, 305, 400, 403, 404, 405, 414, 500, 501, 502, 503, 504.
TCP_MEM_HIT: Squid found a valid copy of the requested resource in the memory cache and sent it immediately to the client. Note that this doesn’t accurately represent all responses served from memory. For example, responses that are cached in memory, but require validation, are logged with TCP_REFRESH_HIT, TCP_REFRESH_MISS, etc.
TCP_DENIED: The client’s request was denied, due to either the http_access or http_reply_access rules. Note that requests denied by http_access have NONE/- in the ninth field, whereas those denied by http_reply_access have a valid entry.
TCP_OFFLINE_HIT: When offline_mode is enabled, Squid returns cache hits for almost any cached response, without considering its freshness.
TCP_REDIRECT: A redirector program told Squid to generate an HTTP redirect to a new URI (see Section 11.1). Normally, Squid doesn’t log these redirects. To do so, you must manually define the LOG_TCP_REDIRECTS preprocessor directive before compiling Squid.
NONE: Unclassified result used for certain errors, such as invalid hostnames.

The following labels may appear in the fourth field of the access.log file in response to ICP queries:

UDP_HIT: Squid found a likely fresh copy of the requested resource in the cache.
UDP_MISS: Squid didn’t find a likely fresh copy of the requested resource in the cache. If the same object is requested via HTTP, it would probably be a cache miss. Compare with UDP_MISS_NOFETCH.
UDP_MISS_NOFETCH: Like UDP_MISS, except that this also indicates Squid’s reluctance to handle the corresponding HTTP request. If you use the -Y command-line option, Squid returns this, instead of UDP_MISS, while rebuilding its in-memory indexes at startup.
UDP_DENIED: The ICP query is denied due to the icp_access rules. If more than 95% of the ICP replies to a client are UDP_DENIED, and the client database is enabled (see Appendix A), Squid stops sending any ICP replies to the client for an hour. When this happens you’ll also see a warning in cache.log.
UDP_INVALID: Squid received an invalid query (e.g., truncated message, invalid protocol version, whitespace in the URI, etc.). Squid sent an ICP_INVALID reply back to the client.

HTTP Response Status Codes

Table 13-1 lists the numerical HTTP response codes and reason phrases. Note that Squid and other HTTP agents care only about the numeric value. The reason phrase is purely informational and doesn’t affect the meaning of the response. For each status code, I also provide a reference to the particular section in RFC 2616 that describes it. Note that status codes 0 and 600 are nonstandard values used by Squid, and aren’t mentioned in the RFC.

Table 13-1. HTTP response status codes

Code	Reason phrase	RFC 2616 section
0	No Response Received (Squid-specific)	N/A
1xx	Informational	10.1
100	Continue	10.1.1
101	Switching Protocols	10.1.2
2xx	Successful	10.2
200	OK	10.2.1
201	Created	10.2.2
202	Accepted	10.2.3
203	Non-Authoritative Information	10.2.4
204	No Content	10.2.5
205	Reset Content	10.2.6
206	Partial Content	10.2.7
3xx	Redirection	10.3
300	Multiple Choices	10.3.1
301	Moved Permanently	10.3.2
302	Found	10.3.3
303	See Other	10.3.4
304	Not Modified	10.3.5
305	Use Proxy	10.3.6
306	(Unused)	10.3.7
307	Temporary Redirect	10.3.8
4xx	Client Error	10.4
400	Bad Request	10.4.1
401	Unauthorized	10.4.2
402	Payment Required	10.4.3
403	Forbidden	10.4.4
404	Not Found	10.4.5
405	Method Not Allowed	10.4.6
406	Not Acceptable	10.4.7
407	Proxy Authentication Required	10.4.8
408	Request Timeout	10.4.9
409	Conflict	10.4.10
410	Gone	10.4.11
411	Length Required	10.4.12
412	Precondition Failed	10.4.13
413	Request Entity Too Large	10.4.14
414	Request-URI Too Long	10.4.15
415	Unsupported Media Type	10.4.16
416	Requested Range Not Satisfiable	10.4.17
417	Expectation Failed	10.4.18
5xx	Server Error	10.5
500	Internal Server Error	10.5.1
501	Not Implemented	10.5.2
502	Bad Gateway	10.5.3
503	Service Unavailable	10.5.4
504	Gateway Timeout	10.5.5
505	HTTP Version Not Supported	10.5.6
6xx	Proxy Error	N/A
600	Unparseable Response Headers (Squid-specific)	N/A

You’ll see status code 0 in the access.log if Squid doesn’t receive any response from the origin server. You’ll see status code 600 if Squid received a response but couldn’t find any HTTP headers. In a small fraction of cases, certain origin servers send only the response body and omit any headers.

access.log Peering Codes

The following codes may appear in the ninth field of the access.log. Refer to Section 10.10 for a description of how Squid selects the next-hop for cache misses.

NONE

This indicates that Squid didn’t communicate with any other servers (neighbors, origin) for this request. You’ll see it in association with various types of cache hits, denied requests, cache manager requests, errors, and all ICP queries.

DIRECT

Squid forwarded the request directly to the origin server. The second half of the field shows the origin server’s IP address, or hostname if you’ve disabled log_ip_on_direct.

SIBLING_HIT

Squid sent the request to this sibling cache after the sibling returned an ICP or HTCP hit.

PARENT_HIT

Squid sent the request to this parent cache after the parent returned an ICP or HTCP hit.

DEFAULT_PARENT

Squid selected this parent because it was marked as default on the cache_peer line in squid.conf.

FIRST_UP_PARENT

Squid forwarded the request to this parent because it is the first parent in the list known to be alive.

FIRST_PARENT_MISS

Squid forwarded the request to the parent cache that was first to respond with an ICP/HTCP miss message. In other words, for this particular ICP/HTCP query, at this particular time, the selected parent had the best round-trip time. Note that measured RTTs may be artificially adjusted by the weight option to the cache_peer directive.

CLOSEST_PARENT_MISS

Squid selected this parent because it reports the lowest RTT to the origin server. This occurs only if both caches have netdb enabled (see Section 10.5), and the origin server (or other servers on its subnet) returns ICMP pings.

CLOSEST_PARENT

This is similar to CLOSEST_PARENT_MISS, except that the RTT measurements don’t come from the ICP/HTCP reply messages. Instead, they come from older measurements saved by Squid, such as the netdb exchange feature.

CLOSEST_DIRECT

Squid forwarded the request to the origin server based on netdb measurements. This happens if any of these conditions occur:

The RTT between Squid and the origin server is less than the configured minimum_direct_rtt value.
The measured number of router hops between Squid and the origin server is less than the configured minimum_direct_hops value.
The RTT values returned in ICP/HTCP replies indicate that Squid is closer to the origin server than any of its neighbors.

ROUNDROBIN_PARENT

Squid forwarded the request to this parent because the round-robin option was set, and it had the lowest usage counter.

CD_PARENT_HIT

Squid forwarded the request to this parent based on the Cache Digest algorithm (see Section 10.7).

CD_SIBLING_HIT

Squid forwarded the request to this sibling based on the Cache Digest algorithm.

CARP

Squid selected this parent based on the Cache Array Routing Protocol algorithm (see Section 10.9).

ANY_PARENT

Squid selected this parent as a last resort because none of the other methods resulted in a viable next-hop.

Note that most of these codes may be preceded by TIMEOUT_ to indicate that a timeout occurred while waiting for ICP/HTCP replies. For example:

1066038165.382    345 193.233.46.21 TCP_MISS/200 2836
        GET http://www.caida.org/home/images/home.jpg
        TIMEOUT_CLOSEST_DIRECT/213.219.122.19 image/jpeg

You can adjust the timeout with the icp_query_timeout directive.

Configuration Directives That Affect access.log

Following are the configuration file directives that affect the access.log in one way or another.

log_icp_queries

This directive, enabled by default, causes Squid to log all ICP queries. If you’re running a busy parent cache, this may make your access.log files huge. To save space, disable this directive:

log_icp_queries off

If you disable ICP query logging, I suggest that you monitor the number of queries, either through the cache manager or with SNMP.

emulate_httpd_log

The access.log file has two formats: common and native. The common format is the same as most HTTP servers (e.g., Apache) use. It contains less information than Squid’s native format. However, you might want to use the common log-file format if you use Squid as a surrogate (see Chapter 15). The common format may also be useful if you have log-file analysis tools that know how to parse it. Use this directive to enable the common format:

emulate_httpd_log on

See the site http://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format, for a description of this format.

log_mime_hdrs

Use the log_mime_hdrs directive to make Squid log the HTTP request and response headers:

log_mime_headers on

When enabled, Squid appends the request and response headers to access.log. This adds two fields to each line. Each field is surrounded by square brackets to make parsing easier. Certain characters are encoded to keep the log file readable. Table 13-2 shows the encoding scheme.

Table 13-2. Character encoding rules for HTTP headers in access.log

Character	Encoding
Newline	\n
Carriage return	\r
Backslash	\\
[	%5b
]	%5d
%	%25
ASCII 0-31	`%xx` (hexadecimal value)
ASCII 127-255	`%xx` (hexadecimal value)

log_fqdn

By default, Squid puts client IP addresses in the access.log. You can record hostnames, when available, by enabling this directive:

log_fqdn on

This causes Squid to make reverse DNS lookups for the client’s address when it receives a request. If an answer is available by the time the request is complete, Squid places it in the third field.

ident_lookup_access

This access rule list determines whether or not Squid makes an RFC 1413 ident query for the client’s TCP connection. By default, Squid doesn’t issue ident queries. To enable this feature, simply add one or more rules:

acl All src 0/0
ident_lookup_access allow All

If an answer is available by the time the request is complete, Squid places it in the eighth field. If you are also using HTTP authentication, that username is written instead of the ident answer.

log_ip_on_direct

When Squid forwards a cache miss to an origin server, it records the origin server’s IP address in the ninth field. You can disable this directive so that Squid writes the hostname instead:

log_ip_on_direct off

In this case, the hostname comes from the URI. If the URI contains an IP address, Squid doesn’t convert it to a hostname.

client_netmask

This directive exists to provide some level of privacy for your users. Rather than logging the entire client IP address, you can mask off some bits. For example:

client_netmask 255.255.255.0

With this setting, all client IP addresses in access.log have 0 as the last octet:

1066036246.918     35 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...
1066036246.932     16 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...
1066036247.616    313 140.132.252.0 TCP_MISS/200 1079 GET http://...
1066036248.598  44459 140.132.252.0 TCP_MISS/500 1531 GET http://...
1066036249.230     17 170.210.173.0 TCP_IMS_HIT/304 265 GET http://...
1066036249.752   2135 140.132.252.0 TCP_MISS/200 50230 GET http://...
1066036250.467      4 170.210.173.0 TCP_IMS_HIT/304 265 GET http://...
1066036250.762    102 163.11.255.0 TCP_IMS_HIT/304 265 GET http://...
1066036250.832     20 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...
1066036251.026     74 203.91.150.0 TCP_CLIENT_REFRESH_MISS/304 267 GET http://...

strip_query_terms

This directive is another privacy feature. It removes query terms from URIs before logging them. If your log files somehow fall into the wrong hands, they won’t be able to find any usernames and passwords. When this directive is enabled, all characters after a question mark (?) are removed. For example, a URI like this:

http://auto.search.msn.com/response.asp?MT=www.kimo.com.yw&srch=3
&prov=&utf8

is logged like this:

http://auto.search.msn.com/response.asp?

uri_whitespace

Earlier, I mentioned the problem with whitespace appearing in some URIs. The RFCs state that URIs must not contain whitespace, but in reality it happens all too often. The uri_whitespace directive dictates how Squid should handle such cases. The allowed settings are: strip (default), deny, allow, encode, and chop. Of these, strip, encode, and chop ensure that the URI field doesn’t contain any whitespace (thus adding more fields to access.log).

The allow setting allows the request to pass through Squid unmodified. It is likely to cause trouble for redirectors and log file parsers. The deny setting, on the other hand, causes Squid to deny the request. The user receives an error message, but the request is still written to access.log with the whitespace characters.

If you set it to encode, Squid changes the whitespace characters to their RFC 1738 equivalents. This is probably what the user-agent should have done in the first place. The chop setting causes Squid to cut off the URI at the first whitespace character.

The default setting is strip, which makes Squid remove the whitespace characters from the URI. It ensures that your log-file parsers and redirectors will be happy, but it might break certain things, such as improperly encoded search engine queries.

buffered_logs

By default, Squid disables buffering for the cache.log file, which allows you to run tail -f and watch log file entries appear in real time. If you think this will cause an unnecessary overhead, you can disable buffering:

buffered_logs off

However, it probably doesn’t matter unless you are running Squid with full debugging. Note that this option affects only cache.log. The others always use unbuffered writes.

access.log Analysis Tools

The access.log file contains a wealth of information, much more than you can see by just browsing through it. In order to get the big picture view, you’ll need to use a third-party log-file analysis package. You can find a long list of them linked from the Squid web page, or by going directly to http://www.squid-cache.org/Scripts/.

One of the most popular tools is Calamaris—a Perl script that parses the log file and generates either text or HTML-based reports. It provides a breakdown of traffic by request method, client IP address, origin server domain name, content types, filename extensions, reply size, and more. Calamaris also reports on ICP query traffic and even understands log files from other caching products. Check it out by visiting http://calamaris.cord.de/.

Squeezer, and its derivative, Squeezer2, are Squid-specific analysis tools. They provide many statistics that can help you understand Squid’s performance, especially when you have neighbors. Both generate HTML pages as output. Visit the Logfile Analysis page on the squid-cache.org site for links to these programs.

Webalyzer is another good utility. It is designed to be fast and produces HTML pages with tables and bar charts. It was originally designed for origin server access logs. Although it can parse Squid’s logs, it doesn’t report on such things as hit ratios and response times. It also uses some terms differently than I do. For example, Webalyzer calls any request a “hit,” which isn’t the same as a cache hit. It also makes a distinction between “pages” and “files.” For more information, visit the Webalyzer home page at http://www.mrunix.net/webalyzer/.

store.log

The store.log is a record of Squid’s decisions to store and remove objects from the cache. Squid creates an entry for each object it stores in the cache, each uncachable object, and each object that is removed by the replacement policy. The log file covers both in-memory and on-disk caches.

The store.log provides the following you can’t get from access.log:

Whether or not a particular response was cached.
The file number for cached objects. For UFS-based storage schemes, you can convert this to a pathname and examine the contents of the cache file.
The response’s content length: the Content-Length value, and the actual body length.
Values for the Date, Last-Modified, and Expires headers.
The response’s cache key (i.e., MD5 hash value).

As you can see, this is mostly low-level information you won’t need on a daily basis. Unless you do sophisticated analyses, or wish to debug a problem, you can probably get by without the store.log. You can disable it with a special setting:

cache_store_log none

As with other log files, Squid appends new store.log entries to the end of the file. A given URI may appear in the file more than once. For example, it gets cached, then released, then cached again. Only the most recent entry reflects the object’s current status.

The store.log is text-based and looks something like this:

1067299212.411 RELEASE -1 FFFFFFFF A5964B32245AC98592D83F9B6EA10B8D 206
    1067299212 1064287906 -1 application/octet-stream 6840/6840
    GET http://download.windowsupdate.com/msdownload/update/v3-19990518/cab...
1067299212.422 SWAPOUT 02 0005FD5F 6F34570785CACABC8DD01ABA5D73B392 200
    1067299210 1057899600 -1 image/gif 1125/1125
    GET http://forum.topsportsnet.com/shfimages/nav_members1.gif
1067299212.641 RELEASE -1 FFFFFFFF B0616CB4B7280F67672A40647DD08474 200
    1067299212 -1 -1 text/html -1/67191
    GET http://www.tlava.com/
1067299212.671 RELEASE -1 FFFFFFFF 5ECD93934257594825659B596D9444BC 200
    1067299023 1034873897 1067299023 image/jpeg 3386/3386
    GET http://ebiz0.ipixmedia.com/abc/ebiz/_EBIZ_3922eabf57d44e2a4c3e7cd234a...
1067299212.786 RELEASE -1 FFFFFFFF B388F7B766B307ADEC044A4099946A21 200
    1067297755 -1 -1 text/html -1/566
    GET http://www.evenflowrocks.com/pages/100303pic15.cfm
1067299212.837 RELEASE -1 FFFFFFFF ABC862C7107F3B7E9FC2D7CA01C8E6A1 304
    1067299212 -1 1067299212 unknown -1/0
    GET http://ebiz0.ipixmedia.com/abc/ebiz/_EBIZ_3922eabf57d44e2a4c3e7cd234a...
1067299212.859 RELEASE -1 FFFFFFFF 5ED2726D4A3AD83CACC8A01CFDD6082B 304
    1066940882 1065063803 -1 application/x-javascript -1/0
    GET http://www.bellsouth.com/scripts/header_footer.js

Each entry contains the following 13 fields:

1: timestamp

The timestamp when the event took place, expressed as seconds since the Unix epoch with millisecond resolution.

2: action

The action taken on the object. This field has three possible values: SWAPOUT, RELEASE, and SO_FAIL.

A SWAPOUT occurs when Squid successfully completes saving the object data to disk. Some objects, such as those that are negatively cached, are kept in memory, but not on disk. Squid doesn’t make a store.log entry for them.
A SO_FAIL entry indicates that Squid could not completely save the object to disk. Most likely it means that the storage scheme implementation refused to open a new disk file for writing.
A RELEASE occurs when Squid removes an object from the cache, or decides that the response isn’t cachable in the first place.

3: directory number

The directory number is a 7-bit index to the list of cache directories that’s written as a decimal number. For objects that aren’t saved to disk, this field contains the value -1.

4: file number

The file number is a 25-bit identifier used internally by Squid. It is written as an 8-character hexadecimal number. The UFS-based storage schemes have an algorithm for mapping file numbers to pathnames (see Section 13.3.1).

Objects that aren’t saved to disk don’t have a valid file number. For these, the file number field contains FFFFFFFF. This value appears only for RELEASE and SO_FAIL entries.

5: cache key

Squid uses MD5 hash values for the primary index to locate cached objects. The key is based on the request method, URI, and possibly other information.

You might be able to use the cache key to match up store.log entries. Note, however, that an object’s cache key can change. This happens, for example, whenever Squid logs a TCP_REFRESH_MISS request in access.log. It looks like this:

1065837334.045 SWAPOUT ... 554BACBD2CB2A0C38FF9BF4B2239A9E5 ... http://blah
1066031047.925 RELEASE ... 92AE17121926106EB12FA8054064CABA ... http://blah
1066031048.074 SWAPOUT ... 554BACBD2CB2A0C38FF9BF4B2239A9E5 ... http://blah

So what’s going on? The object is originally cached under one key (554B...). Some time later, Squid receives another request for the object and forwards a validation request to the origin server. When the response comes back with new content, Squid changes the cache key of the old object (to 92AE...) so that it can give the new object the correct key (554B...). The old object is then removed, and the new object is saved to disk.

6: status code

This field shows the HTTP status code of the response, just like access.log. See Table 13-1 for a list of status codes.

7: date

The value of the Date header in the HTTP response, expressed as seconds since the Unix epoch. The value -1 indicates an unparseable Date header, and -2 means the header was entirely absent.

8: last-modified

The value of the Last-Modified header in the HTTP response, expressed as seconds since the Unix epoch. The value -1 indicates an unparseable Last-Modified header, and -2 means the header was entirely absent.

9: expires

The value of the Expires header in the HTTP response, expressed as seconds since the Unix epoch. The value -1 indicates an unparseable Expires header, and -2 means the header was entirely absent.

10: content-type

The value of the Content-Type header in the HTTP response, excluding any media-type parameters. Squid inserts the value unknown if the Content-Type is missing.

11: content-length/size

This field contains two numbers, separated by a slash. The first is the value of the Content-Length header. A -1 indicates the Content-Length header is absent. The second is the actual size of the HTTP message body. You can use these two numbers to identify partially received responses and origin servers that incorrectly calculate the content length. In most cases, the two numbers are the same.

12: method

The HTTP request method for the object, as in access.log.

13: URI

The final field is the requested URI, as in access.log. This field also has the whitespace problem mentioned in the previous section. However, it is less worrisome here because you can safely ignore any extra fields.

For many of the RELEASE entries, you’ll see question marks (?) for the last eight fields. This is because most of those field values come from what Squid calls the MemObject structure. This structure is present only for objects that have just been received, or are being stored entirely in memory. Most of the objects in Squid’s cache don’t have a MemObject because they exist only on disk. For these, Squid puts question marks in the fields with missing information.

Mapping File Numbers to Pathnames

If you find you need to examine a particular cache file, you can, with some effort, turn a file number into a pathname. You’ll also need the directory number, and L1 and L2 values. In the Squid source code, the storeUfsDirFullPath( ) function does this. You can find it in the src/fs/ufs/store_dir_ufs.c file. This short Perl program mimics the current algorithm:

#!/usr/bin/perl
$L1 = 16;
$L2 = 256;
while (<>) {
    $filn = hex($_);
    printf("%02X/%02X/%08X\n",
        (($filn / $L2) / $L2) % $L1,
        ($filn / $L2) % $L2,
        $filn);
}

And here’s how you can use it:

% echo 000DCD06 | ./fileno-to-pathname.pl
0D/CD/000DCD06

To find this file in the Nth cache_dir, simply go to the corresponding directory and list or view the file:

% cd /cache2
% ls -l 0D/CD/000DCD06
-rw-------  1 squid  squid  391 Jun  3 12:40 0D/CD/000DCD06
% less 0D/CD/000DCD06

referer.log

The optional referer.log contains Referer header values from client requests. To use this feature, you must run ./configure with the —enable-referer-log option. You must also enter a pathname for the referer_log directive. For example:

referer_log /usr/local/squid/var/logs/referer.log

Set the filename to none if you want to disable referer logging.

The Referer header normally contains the URI from which the request was obtained (see Section 14.36 of RFC 2616). For example, when a web browser issues a request for an embedded image, the Referer header is set to the URI of the (HTML) page containing the images. It is also set when you click on an HTML link. Some web site operators use Referer values to find so-called dead links. You may find referer.log particularly useful if you use Squid as a surrogate.

The referer.log has a simple format, with only four fields. Here are a few examples:

1068047502.377 3.0.168.206
    http://www.amazon.com/exec/obidos/search-handle-form/002-7230223-8205634
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
1068047503.109 3.0.168.206
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
    http://g-images.amazon.com/images/G/01/gourmet/gourmet-segway.gif
1068047503.196 3.0.168.206
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
    http://g-images.amazon.com/images/G/01/marketing/cross-shop/arnold/appar...
1068047503.198 3.0.168.206
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
    http://g-images.amazon.com/images/G/01/marketing/cross-shop/arnold/appar...
1068047503.825 3.0.168.206
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
    http://images.amazon.com/images/P/B00005R8BC.01.TZZZZZZZ.jpg
1068047503.842 3.0.168.206
    http://www.amazon.com/exec/obidos/ASIN/0596001622/qid=1068047396/sr=2-1/...
    http://images.amazon.com/images/P/0596001622.01._PE_PI_SCMZZZZZZZ_.jpg

Note that requests that lack a Referer header aren’t logged. The four fields are as follows:

1: timestamp: The time of the request, expressed as the number of seconds since Unix epoch with millisecond resolution.

Note that, unlike access.log, a referer.log entry is made as soon as Squid receives the complete request. Thus, the referer.log entry occurs before the access.log, which waits for the end of the response.

2: client address: The same as the client address in access.log. The log_fqdn and client_netmask directives affect this log file as well.
3: referer: The value of the Referer header from the client’s request. Note that the referer value might have whitespace (or any other) characters. Squid doesn’t encode the value before writing to referer.log.
4: URI: The URI that the client is requesting. It matches the URI in access.log.

useragent.log

The optional useragent.log contains User-Agent header values from client requests. To use this feature, you must supply the —enable-useragent-log option when running ./configure. You also must enter a pathname for the useragent_log directive. For example:

useragent_log /usr/local/squid/var/logs/useragent.log

The User-Agent header normally contains a description of the agent that made the request. In most cases, the description is simply a list of product names with optional version information. You should be aware that applications can easily provide false user-agent information. Modern user-agents provide a way to customize the description. Even Squid can alter the User-Agent header in forwarded requests.

The useragent.log format is relatively simple. It looks like this:

3.0.168.206 [05/Nov/2003:08:51:43 -0700]
    "Mozilla/5.0 (compatible; Konqueror/3; FreeBSD)"
3.0.168.207 [05/Nov/2003:08:52:18 -0700]
    "Opera/7.21 (X11; FreeBSD i386; U)  [en]"
4.241.144.204 [05/Nov/2003:08:55:11 -0700]
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/103u (KHTM..."
3.0.168.206 [05/Nov/2003:08:51:43 -0700]
    "Java1.3.1_01"
64.68.82.28 [05/Nov/2003:08:52:50 -0700]
    "Googlebot/2.1 (http://www.googlebot.com/bot.html)"
3.0.168.205 [05/Nov/2003:08:52:50 -0700]
    "WebZIP/4.1 (http://www.spidersoft.com)"
4.241.144.201 [05/Nov/2003:08:52:50 -0700]
    "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; Hotbar 3.0)"
3.0.168.206 [05/Nov/2003:08:54:40 -0700]
    "Bookmark Renewal Check Agent [http://www.bookmark.ne.jp/] (Version 2.0..."

Unlike the other log files, it has just three fields:

1: client address

The same as the client address in access.log. The log_fqdn and client_netmask directives affect this log file as well.

2: timestamp

Unlike the other log files, which represent the time as seconds since the Unix epoch, this one uses a human-readable format. It is the HTTP common log-file format timestamp, which looks like this:

[10/Jun/2003:22:38:36 -0600]

Note that the square brackets delimit the timestamp, which includes a space character. Also note that, like referer.log, these entries are created as soon as Squid receives the complete request.

3: user-agent

The value of the User-Agent header. These strings almost always contain whitespace. Squid doesn’t encode User-Agent values before writing them in this log file.

swap.state

A swap.state file is a journal of objects that have been added to, and removed from, a cache directory. Each cache_dir has its own swap.state file. When Squid starts up, it reads the swap.state files to rebuild its in-memory indexes of cached objects. These files are a relatively critical part of Squid’s operation.

By default, each swap.state file is located in its corresponding cache directory. Thus, each state file automatically stays with each cache_dir. This is important if you ever decide to reorder your cache_dir lines or if you remove one or more from the list.

If you prefer to put them in a different location, you can use the cache_swap_log directive:

cache_swap_log /usr/local/squid/var/logs/swap.state

In this case, Squid creates a swap.state file for each directory by appending a numeric suffix. For example, if you have four cache directories, Squid creates the following:

/usr/local/squid/var/logs/swap.state.00
/usr/local/squid/var/logs/swap.state.01
/usr/local/squid/var/logs/swap.state.02
/usr/local/squid/var/logs/swap.state.03

In this situation, if you add, remove, or rearrange cache_dir lines, you may need to rename the swap.state files manually to keep everything consistent.

Technically, the swap.state format is storage scheme-dependent. However, all storage schemes use the same format in the current versions of Squid. The swap.state file uses a fixed-size (48-byte) binary format. Fields are written in host-byte order and are thus not necessarily portable between different operating systems. Table 13-3 describes the fields of a swap.state entry.

Table 13-3. swap.state entry fields

Name	Size, in bytes	Description
op	1	Operation on the entry: added (1) or deleted (2).
file number	4	Same as the fourth field of store.log, except it is stored in binary.
timestamp	4	A timestamp corresponding to the time when the response was generated or last validated. Taken from the `Date` header for responses that have one. Stored as the number of seconds since the Unix epoch.
lastref	4	A timestamp corresponding to the most recent access to the object.
expires	4	The object’s expiration time, taken from an `Expires` header or `Cache-Control` `max-age` directive.
last-modified	4	The object’s `Last-Modified` value.
swap file size	4	The amount of space the object occupies on disk. This includes HTTP headers and other Squid-specific meta-information.
refcount	2	The number of times this object has been requested.
flags	2	Various internal flags used by Squid.
key	16	The MD5 hash of the corresponding URI. Same as the key in store.log, except this one is stored in binary.

Rotating the Log Files

Squid always appends new entries to its log files. If your cache is busy, some of these files can become very large after a few days. Some operating systems even place limits on the size of a file (e.g., 2 GB) and return an error for writes beyond that size. To keep your log files manageable, and Squid happy, you must regularly rotate them.

Squid has a built-in feature for rotating log files. You can invoke it with the squid -k rotate command. You then tell Squid how many old copies of each file to keep with the logfile_rotate directive. For example, if you set it to 7, you’ll have eight versions of each log file: the current file and seven old ones.

Old log files are renamed with numeric extensions. For example, when you execute a rotation, Squid renames log.6 to log.7, then log.5 to log.6, and so on. The current log becomes log.0, and Squid creates a new, empty file named log.

Each time you execute squid -k rotate, Squid rotates the following files: cache.log, access.log, store.log, useragent.log (if enabled), and referer.log (if enabled). Squid also creates up-to-date versions of the swap.state files. Note, however, that swap.state isn’t archived with numeric extensions.

Squid doesn’t rotate the log files automatically. The best way to make it happen is with a daily cron job. For example:

0 0 * * * /usr/local/squid/sbin/squid -k rotate

If you’d rather write your own scripts to manage the log files, Squid has a special mode that you’ll find useful. Simply set the logfile_rotate directive to 0. Then, when you run squid -k rotate, Squid simply closes the current log files and opens new ones. This is very useful when the operating system allows you to rename files opened by another process. The following shell script illustrates the idea:

#!/bin/sh
set -e

yesterday_secs=`perl -e 'print time -43200'`
yesterday_date=`date -r $yesterday_secs +%Y%m%d`

cd /usr/local/squid/var/logs

# rename the current log file without interrupting the logging process
mv access.log access.log.$yesterday_date

# tell Squid to close the current logs and open new ones
/usr/local/squid/sbin/squid -k rotate

# give Squid some time to finish writing swap.state files
sleep 60

mv access.log.$yesterday_date /archive/location/
gzip -9 /archive/location/access.log.$yesterday_date

Privacy and Security

Squid’s log files, especially access.log, contain a record of users’ activities and, hence, are subject to privacy concerns. As the Squid administrator, you should take every precaution to keep the log files safe and secure. One of the best ways to do that is limit the number of people who have access to the system on which Squid runs. If that isn’t possible, carefully examine the file and directory permissions to make sure they can’t be viewed by untrusted or unauthorized users.

You can also help protect your users’ privacy by taking advantage of the client_netmask and strip_query_terms directives. The former makes it harder to identify individual users in the access.log; the latter removes URI query terms that may contain personal information. See Section 13.2.4 for more information.

You may also want to develop a policy for keeping old log files. Obviously access.log helps keep users accountable for their activities, but how far back would you ever need to go searching for something? A week? A year? What would you do if presented with a court order to hand over your log files for the last three months?

If you like to keep historical data for a long time, perhaps you can make the log files anonymous or somehow reduce the dataset. If you are interested only in which URIs were accessed, but not by whom, you can extract only that field from access.log. This not only makes the file smaller, it also reduces the risk of a privacy violation. Another technique is to randomize the client IP addresses. In other words, create a filter that maps real IP addresses to fake ones, such that the same real address is always changed to the same fake address. If you are using RFC 1413 identification or HTTP authentication, consider making those fields anonymous as well.

Exercises

Configure Squid so that it doesn’t create any log files, except for the swap.state file(s).
Write a simple Perl or awk script to calculate your cache hit ratio from access.log.
How does an “access denied” response appear in the access.log?
Does store.log have the same number of, more, or fewer, entries than access.log?
Take a file number from store.log and find the corresponding file in the disk cache. Examine the file and make sure you’ve found the correct response.
Develop and implement a policy for archiving old cache log files. Consider where and how they will be stored, for how long, and who has permission to access them.

^[1]No, this isn’t a typo. “Referer” has been historically misspelled by HTTP developers.

Previous Chapter

12. Authentication Helpers

Next Chapter

14. Monitoring Squid

Table of Contents for Squid: The Definitive Guide