Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

The word boundaries at both ends of the regular expression ensure that cat is matched only when it appears as a complete word. More precisely, the word boundaries require that cat is set apart from other text by the beginning or end of the string, whitespace, punctuation, or other nonword characters.

Regular expression engines consider letters, numbers, and underscores to all be word characters. Recipe 2.6 is where we first talked about word boundaries, and covers them in greater detail.

A problem can occur when working with international text in JavaScript, PCRE, and Ruby, since those regular expression flavors only consider letters in the ASCII table to create a word boundary. In other words, word boundaries are found only at the positions between a match of ‹[^A-Za-z0-9_]|^› and ‹[A-Za-z0-9_]›, or between ‹[A-Za-z0-9_]› and ‹[^A-Za-z0-9_]|$›. The same is true in Python when the UNICODE or U flag is not set. This prevents ‹\b› from being useful for a “whole word only” search within text that contains accented letters or words that use non-Latin scripts. For example, in JavaScript, PCRE, and Ruby, ‹\büber\b› will find a match within darüber, but not within dar über. In most cases, this is the exact opposite of what you would want. The problem occurs because ü is considered a nonword character, and a word boundary is therefore found between the two characters rü. No word boundary is found between a space character and ü, because they create a contiguous sequence of nonword characters.

You can deal with this problem by using lookahead and lookbehind (collectively, lookaround—see Recipe 2.16) instead of word boundaries. Like word boundaries, lookarounds match zero-width positions. In PCRE (when compiled with UTF-8 support) and Ruby 1.9, you can emulate Unicode-based word boundaries using, for example, ‹(?<=[^\p{L}\p{M}]|^)cat(?=[^\p{L}\p{M}]|$)›. This regular expression also uses Unicode Letter and Mark category tokens (‹\p{L}› and ‹\p{M}›), which are discussed in Recipe 2.7. If you want the lookarounds to also treat any Unicode decimal numbers and connector punctuation (underscore and similar) as word characters, like ‹\b› does in regex flavors that correctly support Unicode, replace the two instances of ‹[^\p{L}\p{M}]› with ‹[^\p{L}\p{M}\p{Nd}\p{Pc}]›.

JavaScript and Ruby 1.8 support neither lookbehind nor Unicode categories. You can work around the lack of lookbehind support by matching the nonword character preceding each match, and then either removing it from each match using procedural code or putting it back into the string when replacing matches (see the examples of using parts of a match in a replacement string in Recipe 3.15). The additional lack of support for matching Unicode categories (coupled with the fact that both programming languages’ ‹\w› and ‹\W› tokens consider only ASCII word characters) means you might need to make do with a more restrictive solution. Code points in the Letter and Mark categories are scattered throughout Unicode’s character set, so it would take thousands of characters to emulate ‹[^\p{L}\p{M}]› using Unicode escape sequences and character class ranges. A good compromise might be ‹[^A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\xFF]›, which matches all except Unicode letter characters in eight-bit address space (i.e., the first 256 Unicode code points, from positions 0x00 to 0xFF). There are no code points in the Mark category within this range. See Figure 5-1 for the list of nonmatched characters. This negated character class lets you exclude (or in nonnegated form, match) some of the most commonly used, accented characters.

Figure 5-1. Unicode letter characters in eight-bit address space

Following is an example of how to replace all instances of the word “cat” with “dog” in JavaScript. It correctly accounts for common, accented characters, so écat is not altered. To do this, you’ll need to construct your own character class instead of relying on the built-in ‹\b› or ‹\w›:

// 8-bit-wide letter characters var pL = "A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\xFF", pattern = "([^{L}]|^)cat([^{L}]|$)".replace(/{L}/g, pL), regex = new RegExp(pattern, "gi"); // replace cat with dog, and put back any // additional matched characters subject = subject.replace(regex, "$1dog$2");
Note that JavaScript string literals use \xHH (where HH is a two-digit hexadecimal number) to insert special characters. Hence, the pL variable that is passed to the regular expression actually ends up containing the literal versions of the characters. If you wanted the \xHH metasequences to be passed through to the regex itself, you would have to escape the backslashes in the string literal (i.e., "\\xHH"). However, in this case it doesn’t matter and will not change what the regular expression matches.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

5.1. Find a Specific Word

Problem

Solution

Discussion

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

5.1. Find a Specific Word

Problem

Solution

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition