Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

When a visitor clicks a link on your website that points to a file on your own site that does not exist, the visitor gets a “page not found” error. Your web server will write an entry in its log that contains the file that does not exist as the requested object, status code 404, and the page that contains the broken link as the referrer. So you need to extract the requested object and the referrer from log entries that have status code 404 and a referring URL on your own website.

One way to do this would be to use your favorite programming language to write a script that implements Combined Log Format. While iterating over all the matches, check whether the “status” group captured 404 and whether the “referrer” group’s match begins with http://www.yoursite.com. If it does, output the text matched by the “referrer” and “request” groups to indicate the broken link. This is a perfectly good solution. The benefit is that you can expand the script to do any other checks you may want to perform.

The stated problem for this recipe, however, can be handled easily with just one regular expression, without any procedural code. You could open the log file in the same text editor you use to edit your website and use the regular expression presented as the solution to find the 404 errors that indicate broken links on your own site. This regular expression is derived from the regex shown in Combined Log Format. We’ll explain the process for the variant using .NET-style named capture. The variants using Python-style named capture and numbered capture are the same, except for the syntax used for the capturing groups.

We really only had to make the “status” group match only 404 errors and make the “referrer” group check that the domain is on your own site:

^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]↵ ●"(?<method>[A-Z]+)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"↵ ●"(?<useragent>[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
The regular expression just shown already solves the problem. But it is not as efficient as it could be. It matches the entire log entry, but we only need the “request,” “status,” and “referrer” groups. The “useragent” group does not affect the match at all, so we can easily cut that off:
^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]↵ ●"(?<method>[A-Z]+)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
We cannot cut off the groups “client” through “method” so easily. These groups anchor the regex to the start of the line, making sure that the “request” through “referrer” groups match the right fields in the log. If we want to remove some of the groups at the start of the regex, we need to make sure that the regex will still match only the fields that we want. For our web logs, this is not a big issue. Most of the fields have unique content, and our regular expression is sufficiently detailed. Our regular expression explicitly requires enclosing brackets and quotes for the entries that have them, allows only numbers for numeric fields, matches fixed text such as “HTTP” exactly, and so on. Had we been lazy and used ‹\S+› to match all of the fields, then we would not be able to efficiently shorten the regex any further, as ‹\S+› matches pretty much anything.
We also need to make sure the regular expression remains efficient. The caret at the start of the regex makes sure that the regex is attempted only at the start of each line. If it fails to match a line, because the status code is not 404 or the referrer is on another domain, the regex immediately skips ahead to the next line in the log. If we were to cut off everything before the ‹(?<request>[^●"]+)?› group, our regex would begin with ‹[^●"]+›. The regex engine would go through its matching process at every character in the whole log file that is not a space or a double quote. That would make the regex very slow on large log files.
A good point to trim this regex is before ‹"(?<method>[A-Z]+)›. To further enhance efficiency, we also spell out the two request methods we’re interested in:
"(?<method>GET|POST)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
This regular expression begins with literal double quotes. Regular expressions that begin with literal text tend to be very efficient because regular expression engines are usually optimized for this case. Each entry in our log has six double-quote characters. Thus the regular expression will be attempted only six times on each log entry that is not a 404 error. Five times out of six, the attempt will fail almost immediately when ‹GET|POST› fails to match right after the double quote. Though six match attempts per line may seem less efficient than one match attempt, immediately failing with ‹GET|POST› is quicker than having to match ‹^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]●›.
The last optimization is to eliminate the capturing groups that we do not use. Some can be removed completely. The ones containing an alternation operator can be replaced with noncapturing groups. This gives us the regular expression presented in the section.
We left the “file” and “referrer” capturing groups in the final regular expression. When using this regular expression in a text editor or grep tool that can collect the text matched by capturing groups in a regular expression, you can set your tool to collect just the text matched by the “file” and “referrer” groups. That will give you a list of broken links and the pages on which they occur, without any unnecessary information.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Broken Links Reported in Web Logs

Problem

Solution

Discussion

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

Broken Links Reported in Web Logs

Problem

Solution

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition