Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

.NET, PCRE, Perl, and Python support conditionals using numbered capturing groups. ‹(?(1)then|else)› is a conditional that checks whether the first capturing group has already matched something. If it has, the regex engine attempts to match ‹then›. If the capturing group has not participated in the match attempt thus far, the ‹else› part is attempted.

The parentheses, question mark, and vertical bar are all part of the syntax for the conditional. They don’t have their usual meaning. You can use any kind of regular expression for the ‹then› and ‹else› parts. The only restriction is that if you want to use alternation for one of the parts, you have to use a group to keep it together. Only one vertical bar is permitted directly in the conditional.

If you want, you can omit either the ‹then› or ‹else› part. The empty regex always finds a zero-length match. The solution for this recipe uses three conditionals that have an empty ‹then› part. If the capturing group participated, the conditional simply matches.

An empty negative lookahead, ‹(?!)›, fills the ‹else› part. Since the empty regex always matches, a negative lookahead containing the empty regex always fails. Thus, the conditional ‹(?(1)|(?!))› always fails when the first capturing group did not match anything.

By placing each of the three required alternatives in their own capturing group, we can use three conditionals at the end of the regex to test if all the capturing groups captured something. If one of the words was not matched, the conditional referencing its capturing group will evaluate the “else” part, which will cause the conditional to fail to match because of our empty negative lookahead. Thus the regex will fail to match if one of the words was not matched.

To allow the words to appear in any order and any number of times, we place the words inside a group using alternation, and repeat this group with a quantifier. Since we have three words, and we require each word to be matched at least once, we know the group has to be repeated at least three times.

.NET, Python, and PCRE 6.7 allow you to specify the name of a capturing group in a conditional. ‹(?(name)then|else)› checks whether the named capturing group name participated in the match attempt thus far. Perl 5.10 and later also support named conditionals. But Perl requires angle brackets or quotes around the name, as in ‹(?(<name>)then|else)› or ‹(?('name')then|else)›. PCRE 7.0 and later also supports Perl’s syntax for named conditional, while also supporting the syntax used by .NET and Python.

To better understand how conditionals work, let’s examine the regular expression ‹(a)?b(?(1)c|d)›. This is essentially a complicated way of writing ‹abc|bd›.

If the subject text starts with an a, this is captured in the first capturing group. If not, the first capturing group does not participate in the match attempt at all. It is important that the question mark is outside the capturing group because this makes the whole group optional. If there is no a, the group is repeated zero times, and never gets the chance to capture anything at all. It can’t capture a zero-length string.

If you use ‹(a?)›, the group always participates in the match attempt. There’s no quantifier after the group, so it is repeated exactly once. The group will either capture a or capture nothing.

Regardless of whether ‹a› was matched, the next token is ‹b›. The conditional is next. If the capturing group participated in the match attempt, even if it captured the zero-length string (not possible here), ‹c› will be attempted. If not, ‹d› will be attempted.

In English, ‹(a)?b(?(1)c|d)› either matches ab followed by c, or matches b followed by d.

With .NET, PCRE, and Perl, but not with Python, conditionals can also use lookaround. ‹(?(?=if)then|else)› first tests ‹(?=if)› as a normal lookahead. Recipe 2.16 explains how this works. If the lookaround succeeds, the ‹then› part is attempted. If not, the ‹else› part is attempted. Since lookaround is zero-width, the ‹then› and ‹else› regexes are attempted at the same position in the subject text where ‹if› either matched or failed.

You can use lookbehind instead of lookahead in the conditional. You can also use negative lookaround, though we recommend against it, as it only confuses things by reversing the meaning of “then” and “else.”

Tip

A conditional using lookaround can be written without the conditional as ‹(?=if)then|(?!if)else›. If the positive lookahead succeeds, the ‹then› part is attempted. If the positive lookahead fails, the alternation kicks in. The negative lookahead then does the same test. The negative lookahead succeeds when ‹if› fails, which is already guaranteed because ‹(?=if)› failed. Thus, ‹else› is attempted. Placing the lookahead in a conditional saves time, as the conditional attempts ‹if› only once.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

2.17. Match One of Two Alternatives Based on a Condition

Problem

Solution

Discussion

Tip

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

2.17. Match One of Two Alternatives Based on a Condition

Problem

Solution

Discussion

Tip

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition