Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Hexadecimal number

\b[a-f0-9]{1,8}\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Hexadecimal number with optional suffix

\b[a-f0-9]{1,8}h?\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Floating-point number

\d*\.\d+(e\d+)?
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Fixed repetition

The quantifier ‹{n}›, where n is a nonnegative integer, repeats the preceding regex token n number of times. The ‹\d{100}› in ‹\b\d{100}\b› matches a string of 100 digits. You could achieve the same by typing ‹\d› 100 times.

‹{1}› repeats the preceding token once, as it would without any quantifier. ‹ab{1}c› is the same regex as ‹abc›.

‹{0}› repeats the preceding token zero times, essentially deleting it from the regular expression. ‹ab{0}c› is the same regex as ‹ac›.

Variable repetition

For variable repetition, we use the quantifier ‹{n,m}›, where n is a nonnegative integer and m is greater than n. ‹\b[a-f0-9]{1,8}\b› matches a hexadecimal number with one to eight digits. With variable repetition, the order in which the alternatives are attempted comes into play. Recipe 2.13 explains that in detail.

If n and m are equal, we have fixed repetition. ‹\b\d{100,100}\b› is the same regex as ‹\b\d{100}\b›.

Infinite repetition

The quantifier ‹{n,}›, where n is a nonnegative integer, allows for infinite repetition. Essentially, infinite repetition is variable repetition without an upper limit.

‹\d{1,}› matches one or more digits, and ‹\d+› does the same. A plus after a regex token that’s not a quantifier means “one or more.” Recipe 2.13 shows the meaning of a plus after a quantifier.

‹\d{0,}› matches zero or more digits, and ‹\d*› does the same. The asterisk always means “zero or more.” In addition to allowing infinite repetition, ‹{0,}› and the asterisk also make the preceding token optional.

Making something optional

If we use variable repetition with n set to zero, we’re effectively making the token that precedes the quantifier optional. ‹h{0,1}› matches the ‹h› once or not at all. If there is no h, ‹h{0,1}› results in a zero-length match. If you use ‹h{0,1}› as a regular expression all by itself, it will find a zero-length match before each character in the subject text that is not an h. Each h will result in a match of one character (the h).

‹h?› does the same as ‹h{0,1}›. A question mark after a valid and complete regex token that is not a quantifier means “zero or once.” The next recipe shows the meaning of a question mark after a quantifier.

Tip

A question mark, or any other quantifier, after an opening parenthesis is a syntax error. Perl and the flavors that copy it use this to add “Perl extensions” to the regex syntax. Preceding recipes show noncapturing groups and named capturing groups, which all use a question mark after an opening parenthesis as part of their syntax. These question marks are not quantifiers at all; they’re simply part of the syntax for noncapturing groups and named capturing groups. Following recipes will show more styles of groups using the ‹(?› syntax.

Repeating groups

If you place a quantifier after the closing parenthesis of a group, the whole group is repeated. ‹(?:abc){3}› is the same as ‹abcabcabc›.

Quantifiers can be nested. ‹(e\d+)?› matches an e followed by one or more digits, or a zero-length match. In our floating-point regular expression, this is the optional exponent.

Capturing groups can be repeated. As explained in Recipe 2.9, the group’s match is captured each time the engine exits the group, overwriting any text previously matched by the group. ‹(\d\d){1,3}› matches a string of two, four, or six digits. The engine exits the group three times. When this regex matches 123456, the capturing group will hold 56, because 56 was stored by the last iteration of the group. The other two matches by the group, 12 and 34, cannot be retrieved.

‹(\d\d){3}› captures the same text as ‹\d\d\d\d(\d\d)›. If you want the capturing group to capture all two, four, or six digits rather than just the last two, you have to place the capturing group around the quantifier instead of repeating the capturing group: ‹((?:\d\d){1,3})›. Here we used a noncapturing group to take over the grouping function from the capturing group. We also could have used two capturing groups: ‹((\d\d){1,3})›. When this last regex matches 123456, ‹\1› holds 123456 and ‹\2› holds 56.

.NET’s regular expression engine is the only one that allows you to retrieve all the iterations of a repeated capturing group. If you directly query the group’s Value property, which returns a string, you’ll get 56, as with every other regular expression engine. Backreferences in the regular expression and replacement text also substitute 56, but if you use the group’s CaptureCollection, you’ll get a stack with 56, 34, and 12.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

2.12. Repeat Part of the Regex a Certain Number of Times

Problem

Solution

Googol

Hexadecimal number

Hexadecimal number with optional suffix

Floating-point number

Discussion

Fixed repetition

Variable repetition

Infinite repetition

Making something optional

Tip

Repeating groups

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

2.12. Repeat Part of the Regex a Certain Number of Times

Problem

Solution

Googol

Hexadecimal number

Hexadecimal number with optional suffix

Floating-point number

Discussion

Fixed repetition

Variable repetition

Infinite repetition

Making something optional

Tip

Repeating groups

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition