Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

The previous recipe (9.1) included a detailed discussion of many ways to match any XML-style tag. That frees this recipe to focus on a straightforward approach to search for a specific type of tag.  and its replacement  are offered as examples, but you can substitute those tag names with any two others.

The regex starts by matching a literal ‹<›—the first character of any tag. It then optionally matches the forward slash found in closing tags using ‹/?›, within capturing parentheses. Capturing the result of this pattern (which will be either an empty string or a forward slash) allows you to easily restore the forward slash in the replacement string, without any conditional logic.

Next, we match the tag name itself, ‹b›. You could use any other tag name instead if you wanted to. Use the case-insensitive option to make sure that you also match an uppercase B.

The word boundary (‹\b›) that follows the tag name is easy to forget, but it’s one of the most important pieces of this regex. The word boundary lets us match only  tags, and not  , <body>, <blockquote>, or any other tags that merely start with the letter “b.” We could alternatively match a whitespace token (‹\s›) after the name as a safeguard against this same problem, but that wouldn’t work for tags that have no attributes and thus might not have any whitespace following their tag name. The word boundary solves this problem simply and elegantly.

Tip

When working with XML and XHTML, be aware that the colon used for namespaces, as well as hyphens and some other characters allowed as part of XML names, create a word boundary. For example, the regex could end up matching something like <b-sharp>. If you’re worried about this, you might want to use the lookahead ‹(?=[\s/>])› instead of a word boundary. It achieves the same result of ensuring that we do not match partial tag names, and does so more reliably.

After the tag name, the pattern ‹((?:[^>"']|"[^"]*"|'[^']*')*)› is used to match anything remaining within the tag up until the closing right angle bracket. Wrapping this pattern in a capturing group as we’ve done here lets us easily bring back any attributes and other characters (such as the trailing slash for singleton tags) in our replacement string. Within the capturing parentheses, the pattern repeats a noncapturing group with three alternatives. The first, ‹[^>"']›, matches any single character except >, ", or '. The remaining two alternatives match an entire double- or single-quoted string, which lets you match attribute values that contain right angle brackets without having the regex think it has found the end of the tag.

Variations

Replace a list of tags

If you want to match any tag from a list of tag names, a simple change is needed. Place all of the desired tag names within a group, and alternate between them.

The following regex matches opening and closing , , , and <big> tags. The replacement text shown later replaces all of them with a corresponding  or  tag, while preserving any attributes:

<(/?)([bi]|em|big)\b((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Here’s the same regex in free-spacing mode:
< (/?) # Capture the optional leading slash to backreference 1 ([bi]|em|big) \b # Capture the tag name to backreference 2 ( # Capture any attributes, etc. to backreference 3 (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* ) >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
We’ve used the character class ‹[bi]› to match both  and  tags, rather than separating them with the alternation metacharacter ‹|› as we’ve done for  and <big>. Character classes are faster than alternation because they are implemented using bit vectors (or other fast implementations) rather than backtracking. When the difference between two options is a single character, use a character class.
We’ve also added a capturing group for the tag name, which shifted the group that matches attributes, etc. to store its match as backreference 3. Although there’s no need to refer back to the tag name if you’re just going to replace all matches with  tags, storing the tag name in its own backreference can help you check what type of tag was matched, when needed.
To preserve all attributes while replacing the tag name, use the following replacement text:
<$1strong$3>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<\1strong\3>
Replacement text flavors: Python, Ruby
Omit backreference 3 in the replacement string if you want to discard attributes for matched tags as part of the same process:
<$1strong>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<\1strong>
Replacement text flavors: Python, Ruby

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.2. Replace <b> Tags with <strong>

Problem

Solution

Discussion

Tip

Variations

Replace a list of tags

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.2. Replace <b> Tags with <strong>

Problem

Solution

Discussion

Tip

Variations

Replace a list of tags

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition