Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
To match previously matched text later in a regex, we
first have to capture the previous text. We do that with a capturing
group, as shown in Recipe 2.9. After that, we can
match the same text anywhere in the regex using a backreference. You can reference
the first nine capturing groups with a backslash followed by a single
digit one through nine. For groups 10 through 99, use ‹\10› to ‹\99›.
Do not use ‹\01›.
That is either an octal escape or an error. We don’t use octal escapes
in this book at all, because the ‹\xFF› hexadecimal escapes are much easier to
understand.
When the regular expression ‹\b\d\d(\d\d)-\1-\1\b› encounters 2008-08-08, the first
‹\d\d› matches 20. The regex engine then
enters the capturing group, noting the position reached in the subject
text.
The ‹\d\d› inside
the capturing group matches 08, and the engine reaches the group’s
closing parenthesis. At this point, the partial match 08 is stored in capturing
group 1.
The next token is the hyphen, which matches literally. Then comes
the backreference. The regex engine checks the contents of the first
capturing group: 08. The engine tries to match this text
literally. If the regular expression is case-insensitive, the captured
text is matched in this way. Here, the backreference succeeds. The next
hyphen and backreference also succeed. Finally, the word boundary
matches at the end of the subject text, and an overall match is found:
2008-08-08. The
capturing group still holds 08.
If a capturing group is repeated, either by a quantifier (Recipe 2.12) or by backtracking (Recipe 2.13), the stored match is overwritten each time the capturing group matches something. A backreference to the group matches only the text that was last captured by the group.
If the same regex encounters 2008-05-24 2007-07-07, the first time the
group captures something is when ‹\b\d\d(\d\d)› matches 2008, storing 08 for the first (and only) capturing
group. Next, the hyphen matches itself. The backreference, which tries
to match ‹08›, fails
against 05.
Since there are no other alternatives in the regular expression,
the engine gives up the match attempt. This involves clearing all the
capturing groups. When the engine tries again, starting at the first
0 in the
subject, ‹\1› holds no
text at all.
Still processing 2008-05-24 2007-07-07, the next time the
group captures something is when ‹\b\d\d(\d\d)› matches 2007, storing 07. Next, the hyphen matches itself. Now
the backreference tries to match ‹07›. This succeeds, as do the next hyphen,
backreference, and word boundary. 2007-07-07 has been found.
Because the regex engine proceeds from start to end, you should
put the capturing parentheses before the backreference. The regular
expressions ‹\b\d\d\1-(\d\d)-\1› and ‹\b\d\d\1-\1-(\d\d)\b› could never match anything.
Since the backreference is encountered before the capturing group, it
has not captured anything yet. Unless you’re using JavaScript, a
backreference always fails if it points to a group that hasn’t already
participated in the match attempt.
A group that hasn’t participated is not the same as a group that
has captured a zero-length match. A backreference to a group with a
zero-length capture always succeeds. When ‹(^)\1› matches at the start of the string, the
first capturing group captures the caret’s zero-length match, causing
‹\1› to succeed. In
practice, this can happen when the contents of the capturing group are
all optional.
JavaScript is the only flavor we know that goes against
decades of backreference tradition in regular expressions. In
JavaScript, or at least in implementations that follow the JavaScript
standard, a backreference to a group that hasn’t participated always
succeeds, just like a backreference to a group that captured a
zero-length match. So, in JavaScript, ‹\b\d\d\1-\1-(\d\d)\b› can match 12--34.
Recipe 2.9 explains the capturing groups that backreferences refer to.
Recipe 2.11 explains named capturing groups and named backreferences. Naming the groups and backreferences in your regex makes the regex easier to read and maintain.
Recipe 2.21 explains how to make the replacement text reinsert text matched by a capturing group when doing a search-and-replace.
Recipe 3.9 explains how to retrieve the text matched by a capturing group in procedural code.
Recipes 5.8, 5.9, and show how you can solve some real-world problems using backreferences.