Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: Free-spacing |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Regular expressions can quickly become complicated and difficult to understand. Just as you should comment source code, you should comment all but the most trivial regular expressions.
All regular expression flavors in this book, except JavaScript, offer an alternative regular expression syntax that makes it very easy to clearly comment your regular expressions. You can enable this syntax by turning on the free-spacing option. It has different names in various programming languages.
In .NET, set the RegexOptions.IgnorePatternWhitespace option. In
Java, pass the Pattern.COMMENTS flag. Python expects
re.VERBOSE. PHP, Perl, and Ruby use the
/x
flag.
Though standard JavaScript does not support free-spacing regular
expressions, the XRegExp library adds that option. Simply add 'x' to the flags passed as the
second parameter to the XRegExp()
constructor.
Turning on free-spacing mode has two effects. It turns the hash symbol (#) into a metacharacter, outside character
classes. The hash starts a comment that runs until the end of the line
or the end of the regex (whichever comes first). The hash and
everything after it is simply ignored by the regular expression
engine. To match a literal hash sign, either place it inside a character class ‹[#]› or escape it ‹\#›.
The other effect is that whitespace, which includes spaces,
tabs, and line breaks, is also
ignored outside character classes. To match a literal space,
either place it inside a character class ‹[●]› or escape it ‹\●›. If you’re concerned about
readability, you could use the hexadecimal escape ‹\x20› or the Unicode escape
‹\u0020› or ‹\x{0020}› instead. To match a
tab, use ‹\t›. For
line breaks, use ‹\r\n›
(Windows) or ‹\n›
(Unix/Linux/OS X).
Free-spacing mode does not change anything inside character classes. A character class is a single token. Any whitespace characters or hashes inside character classes are literal characters that are added to the character class. You cannot break up character classes to comment their parts.
Regular expressions wouldn’t live up to their reputation unless at least one flavor was incompatible with the others. In this case, Java is the odd one out.
In Java, character classes are not parsed as single tokens. If
you turn on free-spacing mode, Java ignores whitespace in character
classes, and hashes inside character classes do start comments. This
means you cannot use ‹[●]› and ‹[#]› to match these characters
literally. Use ‹\u0020›
and ‹\#›
instead.
(?#Year)\d{4}(?#Separator)-(?#Month)\d{2}-(?#Day)\d{2}| Regex options: None |
| Regex flavors: .NET, XRegExp, PCRE, Perl, Python, Ruby |
If, for some reason, you can’t or don’t want to use free-spacing
syntax, you can still add comments by way of ‹(?#comment)›. All characters between ‹(?#› and ‹)› are ignored.
Unfortunately, JavaScript, the only flavor in this book that
doesn’t support free-spacing, also
doesn’t support this comment syntax. XRegExp, which adds support for
free-spacing regular expressions
to JavaScript, also adds support for the comment syntax. While Java supports
comments in free-spacing regular expressions, it does not support the
‹(?#comment)› syntax.
(?x)\d{4} # Year
- # Separator
\d{2} # Month
- # Separator
\d{2} # Day| Regex options: None |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
If you cannot turn on free-spacing mode outside the regular
expression, you can place the mode modifier ‹(?x)› at
the very start of the regular expression. Make sure there’s no
whitespace before the ‹(?x)›.
Free-spacing mode begins only at this mode modifier; any whitespace
before it is significant.
Mode modifiers are explained in detail in Case-insensitive matching, a subsection of Recipe 2.1.