Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: Dot matches line breaks |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Standard JavaScript doesn’t have a “dot matches line breaks” option, but you can use an all-inclusive character class in place of the dot, as follows:
<!--[\s\S]*?-->
| Regex options: None |
| Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
For each comment you find using one of the regexes just shown,
you can then search within the matched text for the literal characters
‹TODO›. If you prefer,
you can make it a case-insensitive regex with word boundaries on each
end to make sure that only the complete word TODO is matched, like
so:
\bTODO\b
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Follow the code in Recipe 3.13 to search within matches of an outer regex.
Lookahead (described in Recipe 2.16)
lets you solve this problem with a single regex, albeit less
efficiently. In the following regex, positive lookahead is used to
make sure that the word TODO is followed by the closing comment
delimiter -->. On its own, that doesn’t tell
whether the word appears within a comment or is simply followed by a
comment, so a nested negative lookahead is used to ensure that the
opening comment delimiter <!-- does not appear before the
-->:
\bTODO\b(?=(?:(?!<!--).)*?-->)
| Regex options: Case insensitive, dot matches line breaks |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Since standard JavaScript doesn’t have a “dot matches line
breaks” option, use ‹[\s\S]› in place of the dot:
\bTODO\b(?=(?:(?!<!--)[\s\S])*?-->)
| Regex options: Case insensitive |
| Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Recipe 3.13 shows the code you need to
search within matches of another regex. It takes an inner and outer
regex. The comment regex serves as the outer regex, and ‹\bTODO\b› as the inner regex.
The main thing to note here is the lazy ‹*?›
quantifier that follows the dot or character class in the comment
regex. As explained in Recipe 2.13, that
lets you match up to the first --> (the one that ends the comment),
rather than the very last occurrence of --> in your subject string.
This solution is more complex, and slower. On the plus side, it combines the two steps of the previous approach into one regex. Thus, it can be used when working with a text editor, IDE, or other tool that doesn’t allow searching within matches of another regex.
Let’s break this regex down in free-spacing mode, and take a closer look at each part:
\b TODO \b # Match the characters "TODO", as a complete word
(?= # Followed by:
(?: # Group but don't capture:
(?! <!-- ) # Not followed by: "<!--"
. # Match any single character
)*? # Repeat zero or more times, as few as possible (lazy)
--> # Match the characters "-->"
)| Regex options: Dot matches line breaks, free-spacing |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This commented version of the regex doesn’t work in JavaScript unless you use the XRegExp library, since standard JavaScript lacks both “free-spacing” and “dot matches line breaks” modes.
Notice that the regex contains a negative lookahead nested
within an outer, positive lookahead. That lets you require that any
match of TODO is followed by --> and that
<!-- does
not occur in between.
If it’s clear to you how all of this works together, great: you can skip the rest of this section. But in case it’s still a little hazy, let’s take another step back and build the outer, positive lookahead in this regex step by step.
Let’s say for a moment that we simply want to match occurrences
of the word TODO that are followed at some point in
the string by -->. That gives us the regex
‹\bTODO\b(?=.*?-->)›
(with “dot matches line breaks” enabled), which matches the underlined
text in <!--TODO--> just fine. We need the ‹.*?› at the beginning of the
lookahead, because otherwise the regex would match only when
TODO is
immediately followed by -->, with no characters in between.
The ‹*?› quantifier
repeats the dot zero or more times, as few times as possible, which is
great since we only want to match until the first following -->.
As an aside, the regex so far could be rewritten as ‹\bTODO(?=.*?-->)\b›—with the
second ‹\b› moved
after the lookahead—without any affect on the text that is matched.
That’s because both the word boundary and the lookahead are
zero-length assertions (see Lookaround). However, it’s
better to place the word boundary first for readability and
efficiency. In the middle of a partial match, the regex engine can
more quickly test a word boundary, fail, and move forward to try the
regex again at the next character in the string without having to
spend time testing the lookahead when it isn’t necessary.
OK, so the regex ‹\bTODO\b(?=.*?-->)› seems to work fine so
far, but what about when it’s applied to the subject string TODO <!--separate
comment-->? The regex still matches TODO since it’s followed
by -->,
even though TODO is not within a comment this time.
We therefore need to change the dot within the lookahead from matching
any character to matching any character that is not part of the string
<!--,
since that would indicate the start of a new comment. We can’t use a
negated character class such as ‹[^<!-]›, because we want to allow <, !, and - characters that are
not grouped into the exact sequence <!--.
That’s where the nested negative lookahead comes in. ‹(?!<!--).› matches any single
character that is not part of an opening comment delimiter. Placing
that pattern within a noncapturing group as ‹(?:(?!<!--).)› allows us to repeat the whole
sequence with the lazy ‹*?› quantifier we’d previously applied to just
the dot.
Putting it all together, we get the final regex that was listed
as the solution for this
problem: ‹\bTODO\b(?=(?:(?!<!--).)*?-->)›. In
JavaScript, which lacks the necessary “dot matches line breaks”
option, ‹\bTODO\b(?=(?:(?!<!--)[\s\S])*?-->)› is
equivalent.
Although the “single-step approach” regex ensures that any match
of TODO is
followed by --> without <!-- occurring in between, it doesn’t
check the reverse: that the target word is also preceded by <!-- without
--> in
between. There are several reasons we left that rule out:
You can usually get away with not doing this double-check, especially since the single-step regex is meant to be used with text editors and the like, where you can visually verify your results.
Having less to verify means less time spent performing the verification. In other words, the regex is faster when the extra check is left out.
Most importantly, since you don’t know how far back the comment may have started, looking backward like this requires infinite-length lookbehind, which is supported by the .NET regex flavor only.
If you’re working with .NET and want to include this added check, use the following regex:
(?<=<!--(?:(?!-->).)*?)\bTODO\b(?=(?:(?!<!--).)*?-->)
| Regex options: Case insensitive, dot matches line breaks |
| Regex flavor: .NET |
This stricter, .NET-only regex adds a positive lookbehind at the
front, which works just like the lookahead at the end but in reverse.
Because the lookbehind works forward from the position where it finds
<!--, the
lookbehind contains a nested negative lookahead that lets it match any
characters that are not part of the sequence -->.
Since the leading lookahead and trailing lookbehind are both
zero-length assertions, the final match is just the word TODO. The strings matched
within the lookarounds do not become a part of the final matched
text.
Recipe 9.9 includes a detailed discussion of how to match XML-style comments.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.16 explains lookaround.
[23] PowerGREP—described in Tools for Working with Regular Expressions in Chapter 1—is one tool that’s able to search within matches.