Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

All four solutions use ‹/[^/\\\r\n]*(?:\\.[^/\\\r\n]*)*/› to match the regular expression. This is the same regular expression that was the Solution to Strings with Escapes, except that it has forward slashes instead of quotes. A literal regular expression really is just a string quoted with forward slashes that can contain forward slashes if escaped with a backslash.

The difference between the four solutions is how they check whether the regex is preceded by an equals sign, a colon, an opening parenthesis, or a comma, possibly with an exclamation point between that character and the regular expression. We could easily do that with lookbehind if we didn’t also want to allow any amount of whitespace between the regex and the preceding character. That complicates matters because the regex flavors in this book vary widely in their support for lookbehind.

The .NET regex flavor is the only one in this book that allows infinite repetition inside lookbehind. So for .NET we have a perfect solution: ‹(?<=[=:(,](?:\s*!)?\s*)›. The character class ‹[=:(,]› checks for the presence of any of the four characters. ‹(?:\s*!)?› allows the character to be followed by an exclamation point, with any amount of whitespace between the character and the exclamation point. The second ‹\s*› allows any amount of whitespace before the forward slash that opens the regex.

Perl and PCRE do not allow repetition inside lookbehind. A solution using lookbehind wouldn’t be flexible enough in Perl or PCRE. But Perl 5.10 and PCRE 7.2 added a new regex token ‹\K› that we can use instead. We use ‹[=:(,](?:\s*!)?\s*› to match any of the four characters, optionally followed by any amount of whitespace and an exclamation point, and also optionally followed by any amount of whitespace. After the regex has matched this, the ‹\K› tells the regex engine to keep what it has just matched. The punctuation characters just matched by our regex will not be included in the overall match result. The matching process will continue normally with ‹/[^/\\\r\n]*(?:\\.[^/\\\r\n]*)*/› to match the regular expression.

Java does not allow infinite repetition in lookbehind, but does allow finite repetition. So instead of using ‹\s*› to check for absolutely any amount of whitespace, we use ‹\s{0,10}› to check for up to 10 whitespace characters. The number 10 is arbitrary; we just need something sufficiently large to make sure we don’t miss any regexes that are deeply indented. We also need to keep the number reasonably small to make sure we don’t needlessly slow down the regular expression. The greater the number of repetitions we allow, the more characters Java will scan while looking for a match to what’s inside the lookbehind.

The other regex flavors either don’t support repetition inside lookbehind or don’t support lookbehind or ‹\K› at all. For these flavors, we simply use ‹[=:(,](?:\s*!)?+\s*› to match the punctuation we want before the regex, and ‹(/[^/\\\r\n]*(?:\\.[^/\\\r\n]*)*/)› to match the regex itself and store it in a capturing group. The overall regex match will include both the punctuation and the regex. The capturing group makes it easier to retrieve just the regex. This solution will work only if the application with which you’ll use this regex can work on the text matched by a capturing group rather than the whole regex match.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Regex Literals

Problem

Solution

Discussion

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

Regex Literals

Problem

Solution

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition