Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Tags that contain an id attribute (more reliable)

Unlike the regex just shown, this next take on the same problem supports quoted attribute values that contain literal > characters, and it doesn’t match tags that merely contain the word id within one of their attributes’ values:

<(?:[^>"']|"[^"]*"|'[^']*')+?\sid\s*=\s*("[^"]*"|'[^']*')↵ (?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
In free-spacing mode:
< (?: [^>"'] # Tag and attribute names, etc. | "[^"]*" # and quoted attribute values | '[^']*' )+? \s id # The target attribute name, as a whole word \s* = \s* # Attribute name-value delimiter ( "[^"]*" | '[^']*' ) # Capture the attribute value to backreference 1 (?: [^>"'] # Any remaining characters | "[^"]*" # and quoted attribute values | '[^']*' )* >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
This regex captures the id attribute’s value and surrounding quote marks to backreference 1. This allows you to use the value in code outside of the regex or in a replacement string. If you don’t need to reuse the value, you can switch to a noncapturing group or replace the entire ‹\s*=\s*("[^"]*"|'[^']*')› sequence with ‹\b›. The remainder of the regex will pick up the slack and match the id attribute’s value.

<div> tags that contain an id attribute

To search for a specific tag type, you need to add its name to the beginning of the regex and make a couple of other minor changes. In the following regex, we’ve added ‹div\s› after the opening ‹<›. The ‹\s› (whitespace) token ensures that we don’t match tags whose names merely start with the letters “div.” We know there will be a whitespace character following the tag name because the tags we’re searching for have at least one attribute (id). Additionally, the ‹+?\sid› sequence has been changed to ‹*?\bid›, so that the regex works when id is the first attribute within the tag and there are no additional separating characters (beyond the initial space) after the tag name:

<div\s(?:[^>"']|"[^"]*"|'[^']*')*?\bid\s*=\s*("[^"]*"|'[^']*')↵ (?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Here is the same thing in free-spacing mode:
<div \s # Tag name and following whitespace character (?: [^>"'] # Tag and attribute names, etc. | "[^"]*" # and quoted attribute values | '[^']*' )*? \b id # The target attribute name, as a whole word \s* = \s* # Attribute name-value delimiter ( "[^"]*" | '[^']*' ) # Capture the attribute value to backreference 1 (?: [^>"'] # Any remaining characters | "[^"]*" # and quoted attribute values | '[^']*' )* >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Tags that contain an id attribute with the value “my-id”

Compared to the regex titled Tags that contain an id attribute (more reliable), this time we’ll remove the capturing group around the id attribute’s value since we know the value in advance. Specifically, the subpattern ‹("[^"]*"|'[^']*')› has been replaced with ‹(?:"my-id"|'my-id')›:

<(?:[^>"']|"[^"]*"|'[^']*')+?\sid\s*=\s*(?:"my-id"|'my-id')↵ (?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
And the free-spacing version:
< (?: [^>"'] # Tag and attribute names, etc. | "[^"]*" # and quoted attribute values | '[^']*' )+? \s id # The target attribute name, as a whole word \s* = \s* # Attribute name-value delimiter (?: "my-id" # The target attribute value | 'my-id' ) # surrounded by single or double quotes (?: [^>"'] # Any remaining characters | "[^"]*" # and quoted attribute values | '[^']*' )* >
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Going back to the ‹(?:"my-id"|'my-id')› subpattern for a second, you could alternatively avoid repeating “my-id” (at the cost of some efficiency) by using ‹(["'])my-id\1›. That uses a capturing group and backreference to ensure that the value starts and ends with the same type of quote mark.

Tags that contain “my-class” within their class attribute value

If the previous regular expressions haven’t already passed this threshold, this is where it becomes obvious that we’re pushing the boundary of what can sensibly be accomplished using a single regex. Splitting up the process using multiple regexes helps, so we’ll split this search into three parts. The first regex will match tags, the next will find the class attribute within it (and store its value within a backreference), and finally we’ll search within the value for my-class.

Find tags:

<(?:[^>"']|"[^"]*"|'[^']*')+>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Tip
Recipe 9.1 is dedicated to matching XML-style tags. It explains how the regex just shown works, and provides a number of alternatives with varying degrees of complexity and accuracy.
Next, follow the code in Recipe 3.13 to search within each match for a class attribute using the following regex:
^(?:[^>"']|"[^"]*"|'[^']*')+?\sclass\s*=\s*("[^"]*"|'[^']*')
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
This captures the entire class value and its surrounding quote marks to backreference 1. Everything before the class attribute is matched using ‹^(?:[^>"']|"[^"]*"|'[^']*')+?›, which matches quoted values in single steps to avoid finding the word “class” inside another attribute’s value. On the right side of the pattern, the match ends as soon as we reach the end of the class attribute’s value. Nothing after that is relevant to our search, so there’s no reason to match all the way to the end of the tag within which we’re searching.
The caret at the beginning of the regex anchors it to the start of the subject string. This doesn’t change what is matched, but it’s there so that if the regex engine can’t find a match starting at the beginning of the string, it doesn’t try again (and inevitably fail) at each subsequent character position.
Finally, if both of the previous regexes found matches, use the following pattern to search within backreference 1 of each match found by the second regex:
["'\s]my-class["'\s]
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Since classes are separated by whitespace, my-class must be bordered on both ends by either whitespace or a quote mark. If it weren’t for the fact that class names can include hyphens, you could use word boundary tokens instead of the two character classes here. However, hyphens create word boundaries, and thus ‹\bmy-class\b› would match within not-my-class.

Discussion

The section of this recipe already covers the details of how these regular expressions work, so we’ll avoid rehashing it all here. Remember that regular expressions are often not the ideal solution for markup searches, especially those that reach the complexity described in this recipe. Before using these regular expressions, consider whether you’d be better served by an alternative solution, such as XPath, a SAX parser, or a DOM. We’ve included these regexes since it’s not uncommon for people to try to pull off this kind of thing, but don’t say you weren’t warned. Hopefully this has at least helped to show some of the issues involved in markup searches, and helped you avoid even more naïve solutions.

Tip

The regular expressions in this recipe are written with the expectation that attribute values are always enclosed in single or double quotes. Unquoted attribute values are not supported.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.7. Find a Specific Attribute in XML-Style Tags

Problem

Solution

Tags that contain an id attribute (quick and dirty)

Tags that contain an id attribute (more reliable)

<div> tags that contain an id attribute

Tags that contain an id attribute with the value “my-id”

Tags that contain “my-class” within their class attribute value

Tip

Discussion

Tip

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.7. Find a Specific Attribute in XML-Style Tags

Problem

Solution

Tags that contain an id attribute (quick and dirty)

Tags that contain an id attribute (more reliable)

<div> tags that contain an id attribute

Tags that contain an id attribute with the value “my-id”

Tags that contain “my-class” within their class attribute value

Tip

Discussion

Tip

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition