Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: None |
| Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9 |
"(?:GET|POST)●(?P<file>[^#?●"]+)(?:[#?][^●"]*)?●HTTP/[0-9.]+"●404●↵ (?:[0-9]+|-)●"(?P<referrer>http://www\.yoursite\.com[^"]*)"
| Regex options: None |
| Regex flavors: PCRE 4, Perl 5.10, Python |
"(?:GET|POST)●([^#?●"]+)(?:[#?][^●"]*)?●HTTP/[0-9.]+"●404●↵ (?:[0-9]+|-)●"(http://www\.yoursite\.com[^"]*)"
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
When a visitor clicks a link on your website that points to a file on your own site that does not exist, the visitor gets a “page not found” error. Your web server will write an entry in its log that contains the file that does not exist as the requested object, status code 404, and the page that contains the broken link as the referrer. So you need to extract the requested object and the referrer from log entries that have status code 404 and a referring URL on your own website.
One way to do this would be to use your favorite programming
language to write a script that implements Combined Log Format. While iterating over all the
matches, check whether the “status” group captured 404 and whether the
“referrer” group’s match begins with http://www.yoursite.com. If it does, output
the text matched by the “referrer” and “request” groups to indicate the
broken link. This is a perfectly good solution. The benefit is that you
can expand the script to do any other checks you may want to
perform.
The stated problem for this recipe, however, can be handled easily with just one regular expression, without any procedural code. You could open the log file in the same text editor you use to edit your website and use the regular expression presented as the solution to find the 404 errors that indicate broken links on your own site. This regular expression is derived from the regex shown in Combined Log Format. We’ll explain the process for the variant using .NET-style named capture. The variants using Python-style named capture and numbered capture are the same, except for the syntax used for the capturing groups.
We really only had to make the “status” group match only 404 errors and make the “referrer” group check that the domain is on your own site:
^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]↵ ●"(?<method>[A-Z]+)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"↵ ●"(?<useragent>[^"]*)"
| Regex options: ^ and $ match at line breaks |
| Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9 |
The regular expression just shown already solves the problem. But it is not as efficient as it could be. It matches the entire log entry, but we only need the “request,” “status,” and “referrer” groups. The “useragent” group does not affect the match at all, so we can easily cut that off:
^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]↵ ●"(?<method>[A-Z]+)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"
| Regex options: ^ and $ match at line breaks |
| Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9 |
We cannot cut off the groups “client” through “method” so easily.
These groups anchor the regex to the start of the line, making sure that
the “request” through “referrer” groups match the right fields in the
log. If we want to remove some of the groups at the start of the regex,
we need to make sure that the regex will still match only the fields
that we want. For our web logs, this is not a big issue. Most of the
fields have unique content, and our regular expression is sufficiently
detailed. Our regular expression explicitly requires enclosing brackets
and quotes for the entries that have them, allows only numbers for
numeric fields, matches fixed text such as “HTTP” exactly, and so on.
Had we been lazy and used ‹\S+› to match all of the fields, then we would not
be able to efficiently shorten the regex any further, as ‹\S+› matches pretty much
anything.
We also need to make sure the regular expression remains
efficient. The caret at the start of the regex makes sure that the regex
is attempted only at the start of each line. If it fails to match a
line, because the status code is not 404 or the referrer is on another
domain, the regex immediately skips ahead to the next line in the log.
If we were to cut off everything before the ‹(?<request>[^●"]+)?› group, our regex would begin with
‹[^●"]+›. The regex engine would go through
its matching process at every character in the whole log file that is
not a space or a double quote. That would make the regex very slow on
large log files.
A good point to trim this regex is before ‹"(?<method>[A-Z]+)›. To further enhance
efficiency, we also spell out the two request methods we’re interested
in:
"(?<method>GET|POST)●(?<request>[^●"]+)?●HTTP/[0-9.]+"●(?<status>404)↵ ●(?<size>[0-9]+|-)●"(?<referrer>http://www\.yoursite\.com[^"]*)"
| Regex options: ^ and $ match at line breaks |
| Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9 |
This regular expression begins with literal double quotes. Regular
expressions that begin with literal text tend to be very efficient
because regular expression engines are usually optimized for this case.
Each entry in our log has six double-quote characters. Thus the regular
expression will be attempted only six times on each log entry that is
not a 404 error. Five times out of six, the attempt will fail almost
immediately when ‹GET|POST› fails to match right after the double
quote. Though six match attempts per line may seem less efficient than
one match attempt, immediately failing with ‹GET|POST› is quicker than having to match ‹^(?<client>\S+)●\S+●(?<userid>\S+)●\[(?<datetime>[^\]]+)\]●›.
The last optimization is to eliminate the capturing groups that we do not use. Some can be removed completely. The ones containing an alternation operator can be replaced with noncapturing groups. This gives us the regular expression presented in the section.
We left the “file” and “referrer” capturing groups in the final regular expression. When using this regular expression in a text editor or grep tool that can collect the text matched by capturing groups in a regular expression, you can set your tool to collect just the text matched by the “file” and “referrer” groups. That will give you a list of broken links and the pages on which they occur, without any unnecessary information.
Common Log Format explains how to match web log entries with a regular expression. It also has references to Chapter 2 where you can find explanations of the regex syntax used in this recipe.