Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

Although a negated character class (written as ‹[^⋯]›) makes it easy to match anything except a specific character, you can’t just write ‹[^cat]› to match anything except the word cat. ‹[^cat]› is a valid regex, but it matches any character except c, a, or t. Hence, although ‹\b[^cat]+\b› would avoid matching the word cat, it wouldn’t match the word time either, because it contains the forbidden letter t. The regular expression ‹\b[^c][^a][^t]\w*› is no good either, because it would reject any word with c as its first letter, a as its second letter, or t as its third. Furthermore, that doesn’t restrict the first three letters to word characters, and it only matches words with at least three characters since none of the negated character classes are optional.

With all that in mind, let’s take another look at how the regular expression shown at the beginning of this recipe solved the problem:

\b # Assert position at a word boundary. (?! # Not followed by: cat # Match "cat". \b # Assert position at a word boundary. ) # End the negative lookahead. \w+ # Match one or more word characters.
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
The key to this pattern is its negative lookahead, ‹(?!⋯)›. The negative lookahead disallows the sequence cat followed by a word boundary, without preventing the use of those letters when they do not appear in that exact sequence, or when they appear as part of a longer or shorter word. There’s no word boundary at the very end of the regular expression, because it wouldn’t change what the regex matches. The ‹+› quantifier in ‹\w+› repeats the word character token as many times as possible, which means that it will always match until the next word boundary.
When applied to the subject string categorically match any word except cat, the regex will find five matches: categorically, match, any, word, and except.

Variations

Find words that don’t contain another word

If, instead of trying to match any word that is not cat, you are trying to match any word that does not contain cat, a slightly different approach is needed:

\b(?:(?!cat)\w)+\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
In the earlier section of this recipe, the word boundary at the beginning of the regular expression provided a convenient anchor that allowed us to simply place the negative lookahead at the beginning of the word. The solution used here is not as efficient, but it’s nevertheless a commonly used construct that allows you to match something other than a particular word or pattern. It does this by repeating a group containing a negative lookahead and a single word character. Before matching each character, the regex engine makes sure that the word cat cannot be matched starting at the current position.
Unlike the previous regular expression, this one requires a terminating word boundary. Otherwise, it could match just the first part of a word, up to where cat appears within it.
When applied to the subject string categorically match any word except cat, the regex will find four matches: match, any, word, and except.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

5.4. Find All Except a Specific Word

Problem

Solution

Discussion

Variations

Find words that don’t contain another word

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

5.4. Find All Except a Specific Word

Problem

Solution

Discussion

Variations

Find words that don’t contain another word

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition