Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Solution 2: Match tags except and , and any tags that contain attributes

With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match any  and  tags that contain attributes:

</?(?!(?:em|strong)\s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Once again, the same regex in free-spacing mode:
< /? # Permit closing tags (?! (?: em | strong ) # List of tags to avoid matching \s* > # Only avoid tags if they contain no attributes ) [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Discussion

This recipe’s regular expressions have a lot in common with those we’ve included earlier in this chapter for matching XML-style tags. Apart from the negative lookahead added to prevent some tags from being matched, these regexes are nearly equivalent to the “(X)HTML tags (loose)” regex from Recipe 9.1. The other main difference here is that we’re not capturing the tag name to backreference 1.

So let’s look more closely at what’s new in this recipe. Solution 1 never matches  or  tags, regardless of whether they have any attributes, but matches all other tags. Solution 2 matches all the same tags as Solution 1, and additionally matches  and  tags that contain one or more attributes. Table 9-2 shows a few example subject strings that illustrate this.

Table 9-2. A few example subject strings
Subject string
Solution 1
Solution 2

Match
Match

Match
Match

Match
Match

No match
No match

No match
No match

No match
Match

Subject string	Solution 1	Solution 2
`<i>`	Match	Match
`</i>`	Match	Match
`<i style="font-size:500%; color:red;">`	Match	Match
`<em>`	No match	No match
`</em>`	No match	No match
`<em style="font-size:500%; color:red;">`	No match	Match

Since the point of these regexes is to replace matches with empty strings (in other words, remove the tags), Solution 2 is less prone to abuse of the allowed  and  tags to provide unexpected formatting or other shenanigans.

Caution

This recipe has (until now) intentionally avoided the word “whitelist” when describing how only a few tags are left in place, since that word has security connotations. There are a variety of ways to work around this pattern’s constraints using specially crafted, malicious HTML strings. If you’re worried about malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to convert all <, >, and & characters to their corresponding named character references (<, >, and &), then bring back tags that are known to be safe (as long as they contain no attributes or only use those within a select list of approved attributes). style is an example of an attribute that is not safe, since some browsers let you embed scripting language code in your CSS. To bring back  and  tags with no attributes after replacing <, >, and & with character references, search case-insensitively using the regex ‹<(/?)(em|strong)>› and replace matches with «<$1$2>» (or in Python and Ruby, «<\1\2>»).

Variations

Whitelist specific attributes

Consider these new requirements: you need to match all tags except <a>, , and , with two exceptions. Any <a> tags that have attributes other than href or title should be matched, and if  or  tags have any attributes at all, match them too. All matched strings will be removed.

In other words, you want to remove all tags except those on your whitelist (<a>, , and ). The only whitelisted attributes are href and title, and they are allowed only within <a> tags. If a nonwhitelisted attribute appears in any tag, the entire tag should be removed.

Here’s a regex that can get the job done:

<(?!(?:em|strong|a(?:\s+(?:href|title)\s*=\s*(?:"[^"]*"|'[^']*'))*)\s*>)↵ [a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
With free-spacing:
< /? # Permit closing tags (?! (?: em # Dont match | strong # or | a # or <a> (?: # Only avoid matching <a> tags that use only \s+ # href and/or title attributes (?:href|title) \s*=\s* (?:"[^"]*"|'[^']*') # Quoted attribute value )* ) \s* > # Only avoid matching these tags when they're ) # limited to any attributes permitted above [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
This pushes the boundary of where it makes sense to use such a complicated regex. If your rules get any more complex than this, it would probably be better to write some code based on Recipe 3.11 or 3.16 that checks the value of each matched tag to determine how to process it (based on the tag name, included attributes, or whatever else is needed).

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Discussion

Caution

Variations

Whitelist specific attributes

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Discussion

Caution

Variations

Whitelist specific attributes

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition