Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Single-step approach

Lookahead (described in Recipe 2.16) lets you solve this problem with a single regex, albeit less efficiently. In the following regex, positive lookahead is used to make sure that the word TODO is followed by the closing comment delimiter -->. On its own, that doesn’t tell whether the word appears within a comment or is simply followed by a comment, so a nested negative lookahead is used to ensure that the opening comment delimiter :

\bTODO\b(?=(?:(?!)
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Since standard JavaScript doesn’t have a “dot matches line breaks” option, use ‹[\s\S]› in place of the dot:
\bTODO\b(?=(?:(?!)
Regex options: Case insensitive
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Two-step approach

Recipe 3.13 shows the code you need to search within matches of another regex. It takes an inner and outer regex. The comment regex serves as the outer regex, and ‹\bTODO\b› as the inner regex. The main thing to note here is the lazy ‹*?› quantifier that follows the dot or character class in the comment regex. As explained in Recipe 2.13, that lets you match up to the first --> (the one that ends the comment), rather than the very last occurrence of --> in your subject string.

Single-step approach

This solution is more complex, and slower. On the plus side, it combines the two steps of the previous approach into one regex. Thus, it can be used when working with a text editor, IDE, or other tool that doesn’t allow searching within matches of another regex.

Let’s break this regex down in free-spacing mode, and take a closer look at each part:

\b TODO \b # Match the characters "TODO", as a complete word (?= # Followed by: (?: # Group but don't capture: (?!  # Match the characters "-->" )
Regex options: Dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
This commented version of the regex doesn’t work in JavaScript unless you use the XRegExp library, since standard JavaScript lacks both “free-spacing” and “dot matches line breaks” modes.
Notice that the regex contains a negative lookahead nested within an outer, positive lookahead. That lets you require that any match of TODO is followed by --> and that . That gives us the regex ‹\bTODO\b(?=.*?-->)› (with “dot matches line breaks” enabled), which matches the underlined text in  just fine. We need the ‹.*?› at the beginning of the lookahead, because otherwise the regex would match only when TODO is immediately followed by -->, with no characters in between. The ‹*?› quantifier repeats the dot zero or more times, as few times as possible, which is great since we only want to match until the first following -->.
As an aside, the regex so far could be rewritten as ‹\bTODO(?=.*?-->)\b›—with the second ‹\b› moved after the lookahead—without any affect on the text that is matched. That’s because both the word boundary and the lookahead are zero-length assertions (see Lookaround). However, it’s better to place the word boundary first for readability and efficiency. In the middle of a partial match, the regex engine can more quickly test a word boundary, fail, and move forward to try the regex again at the next character in the string without having to spend time testing the lookahead when it isn’t necessary.
OK, so the regex ‹\bTODO\b(?=.*?-->)› seems to work fine so far, but what about when it’s applied to the subject string TODO ? The regex still matches TODO since it’s followed by -->, even though TODO is not within a comment this time. We therefore need to change the dot within the lookahead from matching any character to matching any character that is not part of the string )›. In JavaScript, which lacks the necessary “dot matches line breaks” option, ‹\bTODO\b(?=(?:(?!)› is equivalent.

Variations

Although the “single-step approach” regex ensures that any match of TODO is followed by --> without  in between. There are several reasons we left that rule out:

You can usually get away with not doing this double-check, especially since the single-step regex is meant to be used with text editors and the like, where you can visually verify your results.
Having less to verify means less time spent performing the verification. In other words, the regex is faster when the extra check is left out.
Most importantly, since you don’t know how far back the comment may have started, looking backward like this requires infinite-length lookbehind, which is supported by the .NET regex flavor only.

If you’re working with .NET and want to include this added check, use the following regex:

(?<=).)*?)\bTODO\b(?=(?:(?!)
Regex options: Case insensitive, dot matches line breaks
Regex flavor: .NET
This stricter, .NET-only regex adds a positive lookbehind at the front, which works just like the lookahead at the end but in reverse. Because the lookbehind works forward from the position where it finds .
Since the leading lookahead and trailing lookbehind are both zero-length assertions, the final match is just the word TODO. The strings matched within the lookarounds do not become a part of the final matched text.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.10. Find Words Within XML-Style Comments

Problem

Solution

Two-step approach

Single-step approach

Discussion

Two-step approach

Single-step approach

Variations

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.10. Find Words Within XML-Style Comments

Problem

Solution

Two-step approach

Single-step approach

Discussion

Two-step approach

Single-step approach

Variations

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition