Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

If you are using a text editor that does not have the built-in ability to sort lines, or if it is important to preserve the original line order, the following solution lets you remove duplicates even when they are separated by other lines:

^([^\r\n]*)(?:\r?\n|\r)(?=.*^\1$)
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Here’s the same thing as a regex compatible with standard JavaScript, without the requirement for the “dot matches line breaks” option:
^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Replace with:
(The empty string—that is, nothing.)
Replacement text flavors: N/A

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

If you want to preserve the first occurrence of each duplicate line, you’ll need to use a somewhat different approach. First, here is the regular expression and replacement string we will use:

^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Once again, we need to make a couple changes to make this compatible with JavaScript-flavor regexes, since standard JavaScript doesn’t have a “dot matches line breaks” option.
^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Replace with:
$1$2
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
\1\2
Replacement text flavors: Python, Ruby
Unlike the Option 1 and 2 regexes, this version cannot remove all duplicate lines with one search-and-replace operation. You’ll need to continually apply “replace all” until the regex no longer matches your string, meaning that there are no more duplicates to remove. See the section of this recipe for further details.

Discussion

Option 1: Sort lines and remove adjacent duplicates

This regex removes all but the first of duplicate lines that appear next to each other. It does not remove duplicates that are separated by other lines. Let’s step through the process.

First, the ‹^› at the front of the regular expression matches the start of a line. Normally it would only match at the beginning of the subject string, so you need to make sure that the option to let ^ and $ match at line breaks is enabled (Recipe 3.4 shows you how to set regex options in code). Next, the ‹.*› within the capturing parentheses matches the entire contents of a line (even if it’s blank), and the value is stored as backreference 1. For this to work correctly, the “dot matches line breaks” option must not be set; otherwise, the dot-asterisk combination would match until the end of the string.

Within an outer, noncapturing group, we’ve used ‹(?:\r?\n|\r)› to match a line separator used in Windows/MS-DOS (‹\r\n›), Unix/Linux/BSD/OS X (‹\n›), or legacy Mac OS (‹\r›) text files. The backreference ‹\1› then tries to match the line we just finished matching. If the same line isn’t found at that position, the match attempt fails and the regex engine moves on. If it matches, we repeat the group (composed of a line break sequence and backreference 1) using the ‹+› quantifier to match any immediately following duplicate lines.

Finally, we use the dollar sign at the end of the regex to assert position at the end of the line. This ensures that we only match identical lines, and not lines that merely start with the same characters as a previous line.

Because we’re doing a search-and-replace, each entire match (including the original line and line breaks) is removed from the string. We replace this with backreference 1 to put the original line back in.

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

There are several changes here compared to the Option 1 regex that finds duplicate lines only when they appear next to each other. First, in the non-JavaScript version of the Option 2 regex, the dot within the capturing group has been replaced with ‹[^\r\n]› (any character except a line break), and the “dot matches line breaks” option has been enabled. That’s because a dot is used later in the regex to match any character, including line breaks. Second, a lookahead has been added to scan for duplicate lines at any position further along in the string. Since the lookahead does not consume any characters, the text matched by the regex is always a single line (along with its following line break) that is known to appear again later in the string. Replacing all matches with the empty string removes the duplicate lines, leaving behind only the last occurrence of each.

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

Lookbehind is not as widely supported as lookahead, and where it is supported, you still may not be able to look as far backward as you need to. Thus, the Option 3 regex is conceptually different from Option 2. Instead of matching lines that are known to be repeated earlier in the string (which would be comparable to Option 2’s tactic), this regex matches a line, the first duplicate of that line that occurs later in the string, and all the lines in between. The original line is stored as backreference 1, and the lines in between (if any) as backreference 2. By replacing each match with both backreference 1 and 2, you put back the parts you want to keep, leaving out the trailing, duplicate line and its preceding line break.

This alternative approach presents a couple of issues. First, because each match of a set of duplicate lines may include other lines in between, it’s possible that there are duplicates of a different value within your matched text, and those will be skipped over during a “replace all” operation. Second, if a line is repeated more than twice, the regex will first match duplicates one and two, but after that, it will take another set of duplicates to get the regex to match again as it advances through the string. Thus, a single “replace all” action will at best remove only every other duplicate of any specific line. To solve both of these problems and make sure that all duplicates are removed, you’ll need to continually apply the search-and-replace operation to your entire subject string until the regex no longer matches within it. Consider how this regex will work when applied to the following text:

value1 value2 value2 value3 value3 value1 value2
Removing all duplicate lines from this string will take three passes. Table 5-1 shows the result of each pass.
Table 5-1. Replacement passes
Pass one
Pass two
Pass three
Final string
One match/replacement Two matches/replacements One match/replacement No duplicates remain
「 value1
value1
value1
value1
value2
「 value2
「 value2
value2
value2
value2 」
value3
value3
value3
「 value3
value2 」

value3
value3 」

value1 」
value2

value2

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

5.9. Remove Duplicate Lines

Problem

Solution

Option 1: Sort lines and remove adjacent duplicates

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

Discussion

Option 1: Sort lines and remove adjacent duplicates

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

5.9. Remove Duplicate Lines

Problem

Solution

Option 1: Sort lines and remove adjacent duplicates

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

Discussion

Option 1: Sort lines and remove adjacent duplicates

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition