Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

How it works

At the beginning and end of this regular expression are the literal character sequences ‹›. Since none of those characters are special in regex syntax (except within character classes, where hyphens create ranges), they don’t need to be escaped. That just leaves the ‹.*?› or ‹[\s\S]*?› in the middle of the regex to examine further.

Thanks to the “dot matches line breaks” option, the dot in the regex shown first matches any single character. In the JavaScript version, the character class ‹[\s\S]› takes its place. However, the two regexes are exactly equivalent. ‹\s› matches any whitespace character, and ‹\S› matches everything else. Combined, they match any character.

The lazy ‹*?› quantifier repeats its preceding “any character” element zero or more times, as few times as possible. Thus, the preceding token is repeated only until the first occurrence of -->, rather than matching all the way to the end of the subject string, and then backtracking until the last -->. (See Recipe 2.13 for more on how backtracking works with lazy and greedy quantifiers.) This simple strategy works well since XML-style comments cannot be nested within each other. In other words, they always end at the first (leftmost) occurrence of -->.

When comments can’t be removed

Most web developers are familiar with using HTML comments within <script> and <style> elements for backward compatibility with ancient browsers. These days, it’s just a meaningless incantation, but its use lives on in part because of copy-and-paste coding. We’re going to assume that when you remove comments from an (X)HTML document, you don’t want to strip out embedded JavaScript and CSS. You probably also want to leave the contents of <textarea> elements, CDATA sections, and the values of attributes within tags alone.

Earlier, we said removing comments wasn’t a difficult task. As it turns out, that was only true if you ignore some of the tricky areas of (X)HTML or XML where the syntax rules change. In other words, if you ignore the hard parts of the problem, it’s easy.

Of course, in some cases you might evaluate the markup you’re dealing with and decide it’s OK to ignore these problem cases, maybe because you wrote the markup yourself and know what to expect. It might also be OK if you’re doing a search-and-replace in a text editor and are able to manually inspect each match before removing it.

But getting back to how to work around these issues, in Skip Tricky (X)HTML and XML Sections we discussed some of these same problems in the context of matching XML-style tags. We can use a similar line of attack when searching for comments. Use the code in Recipe 3.18 to first search for tricky sections using the regular expression shown next, and then replace comments found between matches with the empty string (in other words, remove the comments):

<(script|style|textarea|title|xmp)\b(?:[^>"']|"[^"]*"|'[^']*')*>↵ .*?</\1\s*>|<plaintext\b(?:[^>"']|"[^"]*"|'[^']*')*>.*|↵ <[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<!\[CDATA\[.*?]]>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Adding some whitespace and a few comments to the regex in free-spacing mode makes this a lot easier to follow:
# Special element: tag and its content <( script | style | textarea | title | xmp )\b (?:[^>"']|"[^"]*"|'[^']*')* > .*? </\1\s*> | # <plaintext/> continues until the end of the string <plaintext\b (?:[^>"']|"[^"]*"|'[^']*')* > .* | # Standard element: tag only <[a-z] # Tag name initial character (?:[^>"']|"[^"]*"|'[^']*')* > | # CDATA section <!\[CDATA\[ .*? ]]>
Regex options: Case insensitive, dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Here’s an equivalent version for standard JavaScript, which lacks both “dot matches line breaks” and “free-spacing” options:
<(script|style|textarea|title|xmp)\b(?:[^>"']|"[^"]*"|'[^']*')*>↵ [\s\S]*?</\1\s*>|<plaintext\b(?:[^>"']|"[^"]*"|'[^']*')*>[\s\S]*|↵ <[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<!\[CDATA\[[\s\S]*?]]>
Regex options: Case insensitive
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Variations

Find valid XML comments

There are in fact a few syntax rules for XML comments that go beyond simply starting with . Specifically:

Two hyphens cannot appear in a row within a comment. For example,  is invalid because of the two hyphens in the middle.
The closing delimiter cannot be preceded by a hyphen that is part of the comment. For example,  is invalid, but the completely empty comment  is allowed.
Whitespace may occur between the closing -- and >. For example, <!-- comment -- > is a valid, complete comment.

It’s not hard to work these rules into a regex:

. However, if a hyphen occurs between the delimiters, it must be followed by at least one nonhyphen character. And since the inner portion of the regex can no longer match two hyphens in a row, the lazy quantifier from the regexes at the beginning of this recipe has been replaced with greedy quantifiers. Lazy quantifiers would still work fine, but sticking with them here would result in unnecessary backtracking (see Recipe 2.13).
Some readers might look at this new regex and wonder why the ‹[^-]› negated character class is used twice, rather than just making the hyphen inside the noncapturing group optional (i.e., ‹ at the end of a partial match (as is required when you plug this pattern segment into the comment-matching regex), the engine must try all possible repetition combinations before failing the match attempt and moving on. This number of options expands extremely rapidly with each additional character that the engine must try to match. However, there is nothing dangerous about the nested quantifiers if this situation is avoided. For example, the pattern ‹(?:-[^-]+)*› does not pose a risk even though it contains a nested ‹+› quantifier, because now that exactly one hyphen must be matched per repetition of the group, the potential number of backtracking points increases linearly with the length of the subject string.
Another way to avoid the potential backtracking problem we’ve just described is to use an atomic group. The following is equivalent to the first regex shown in this section, but it’s a few characters shorter and isn’t supported by JavaScript or Python:
<!--(?>-?[^-]+)*--\s*>
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby
See Recipe 2.14 for the details about how atomic groups (and their counterpart, possessive quantifiers) work.

Find valid HTML comments

HTML 4.01 officially used the XML comment rules we described earlier, but web browsers never paid much attention to the finer points. HTML5 comment syntax has two differences from XML, which brings it closer to what web browsers actually implement. First, whitespace is not allowed between the closing -- and >. Second, the text within comments is not allowed to start with > or -> (in web browsers, that ends the comment early).

Here are the HTML5 comment rules translated into regex:

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Compared to the earlier regex for matching valid XML comments, this one doesn’t include ‹\s*› before the trailing ‹>›, and adds the negative lookahead ‹(?!-?>)› just after the opening ‹› (with “dot matches line breaks”) or ‹› regexes shown in this recipe’s main section.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.9. Remove XML-Style Comments

Problem

Solution

Discussion

How it works

When comments can’t be removed

Variations

Find valid XML comments

Find valid HTML comments

Tip

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.9. Remove XML-Style Comments

Problem

Solution

Discussion

How it works

When comments can’t be removed

Variations

Find valid XML comments

Find valid HTML comments

Tip

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition