Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

There are two things needed to match something that was previously matched: a capturing group and a backreference. Place the thing you want to match more than once inside a capturing group, and then match it again using a backreference. This works differently from simply repeating a token or group using a quantifier. Consider the difference between the simplified regular expressions ‹(\w)\1› and ‹\w{2}›. The first regex uses a capturing group and backreference to match the same word character twice, whereas the latter uses a quantifier to match any two word characters. Recipe 2.10 discusses the magic of backreferences in greater depth.

Back to the problem at hand. This recipe only finds repeated words that are composed of letters from A to Z and a to z (since the case insensitive option is enabled). To also allow accented letters and letters from other scripts, you can use the Unicode Letter category ‹\p{L}› if your regex flavor supports it (see Unicode category).

Between the capturing group and backreference, ‹\s+› matches any whitespace characters, such as spaces, tabs, or line breaks. If you want to restrict the characters that can separate repeated words to horizontal whitespace (i.e., no line breaks), replace the ‹\s› with ‹[●\t\xA0]›. This prevents matching repeated words that appear across multiple lines. The ‹\xA0› in the character class matches a no-break space, which is sometimes found in text copied and pasted from the Web (most web developers are familiar with using   to insert a no-break space in their content). PCRE 7.2 and Perl 5.10 include the shorthand character class ‹\h› that you might prefer to use here since it is specifically designed to match horizontal whitespace, and matches some additional esoteric horizontal whitespace characters.

Finally, the word boundaries at the beginning and end of the regular expression ensure that it doesn’t match within other words ( e.g., with “this thistle”).

Note that the use of repeated words is not always incorrect, so simply removing them without examination is potentially dangerous. For example, the constructions “that that” and “had had” are generally accepted in colloquial English. Homonyms, names, onomatopoeic words (such as “oink oink” or “ha ha”), and some other constructions also occasionally result in intentionally repeated words. In most cases you should visually examine each match.

Variations

The solution shown earlier was intentionally kept simple. That simplicity came at the cost of not accounting for a variety of special cases:

Repeated words that use letters with accents or other diacritical marks, such as “café café” or “naïve naïve.”
Repeated words that include hyphens, single quotes, or right single quotes, such as “co-chair co-chair,” “don’t don’t,” or “rollin’ rollin.’”
Repeated words written in a non-English alphabet, such as the Russian words “друзья друзья.”

Dealing with these issues prevents us from relying on the ‹\b› word boundary token, which we previously used to ensure that complete words only are matched. There are two reasons ‹\b› won’t work when accounting for the special cases just mentioned. First, hyphens and apostrophes are not word characters, so there is no word boundary to match between the whitespace or punctuation that separates words, and a hyphen or apostrophe that appears at the beginning or end of a word. Second, ‹\b› is not Unicode aware in some regex flavors (see Word Characters in Recipe 2.6), so it won’t always work correctly if your data uses letters other than A to Z without diacritics.

Instead of ‹\b›, we’ll therefore need to use lookahead and lookbehind (see Recipe 2.16) to make sure that we still match complete words only. We’ll also use Unicode categories (see Recipe 2.7) to match letters (‹\p{L}›) and diacritical marks (‹\p{M}›) in any alphabet or script:

(?<![\p{L}\p{M}\-'\u2019])([\-'\u2019]?(?:[\p{L}\p{M}][\-'\u2019]?)+)↵ \s+\1(?![\p{L}\p{M}\-'\u2019])
Regex options: Case insensitive
Regex flavors: .NET, Java, Ruby 1.9
Even though ‹\p{L}› matches letters in any casing, you still need to enable the “case insensitive” option, because the backreference matched by ‹\1› might use different casing than the initially matched word.
The ‹\u2019› tokens in the regular expression match a right single quote mark (’). Perl and PCRE use a different syntax for matching individual Unicode code points, so we need to change the regex slightly for them:
(?<![\p{L}\p{M}\-'\x{2019}])([\-'\x{2019}]?(?:[\p{L}\p{M}]↵ [\-'\x{2019}]?)+)\s+\1(?![\p{L}\p{M}\-'\x{2019}])
Regex options: Case insensitive
Regex flavors: Java 7, PCRE, Perl
Neither of these regexes work in JavaScript, Python, or Ruby 1.8, because those flavors lack support for Unicode categories like ‹\p{L}›. JavaScript and Ruby 1.8 additionally lack support for lookbehind.
Following are several examples of repeated words that these regexes will match:
The the
café café
друзья друзья
don't don't
rollin’ rollin’
O’Keeffe’s O’Keeffe’s
co-chair co-chair
devil-may-care devil-may-care
Here are some examples of strings that are not matched:
hello, hello
1000 1000
- -
test’’ing test’’ing
one--two one--two

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

5.8. Find Repeated Words

Problem

Solution

Discussion

Variations

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

5.8. Find Repeated Words

Problem

Solution

Discussion

Variations

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition