Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

Matching a string that cannot contain quotes or line breaks would be easy with ‹"[^\r\n"]*"›. Double quotes are literal characters in regular expressions, and we can easily match a sequence of characters that are not quotes or line breaks with a negated character class.

But our strings can contain quotes if they are specified as two consecutive quotes. Matching these is not much more difficult if we handle the quotes separately. After the opening quote, we use ‹[^\r\n"]*› to match anything but quotes and line breaks. This may be followed by zero or more pairs of double quotes. We could match those with ‹(?:"")*›, but after each pair of double quotes, the string can have more characters that are not quotes or line breaks. So we match one pair of double quotes and following nonquote, nonbreak characters with ‹""[^\r\n"]*›, or all the pairs with ‹(?:""[^\r\n"]*)*›. We end the regex with the double quote that closes the string.

The match returned by this regex will be the whole string, including enclosing quotes, and pairs of quotes inside the string. To get only the contents of the string, the code that processes the regex match needs to do some extra work. First, it should strip off the quotes at the start and the end of the match. Then it should search for all pairs of double quotes and replace them with individual double quotes.

You may wonder why we don’t simply use ‹"(?:[^"\r\n]|"")*"› to match our strings. This regex matches a pair of quotes containing ‹(?:[^"\r\n]|"")*›, which matches zero or more occurrences of any combination of two alternatives. ‹[^"\r\n]› matches a character that isn’t a double quote or a line break. ‹""› matches a pair of double quotes. Put together, the overall regex matches a pair of double quotes containing zero or more characters that aren’t quotes or line breaks or that are a pair of double quotes. This is the definition of a string in the stated problem. This regex indeed correctly matches the strings we want, but it is not very efficient. The regular expression engine has to enter a group with two alternatives for each character in the string. With the regex from the section, the regex engine only enters a group for each pair of double quotes in the string, which is a rare occurrence.

You could try to optimize the inefficient regex as ‹"(?:[^"\r\n]+|"")*"›. The idea is that this regex only enters the group for each pair of double quotes and for each sequence of characters without quotes or line breaks. That is true, as long as the regex encounters only valid strings. But if this regex is ever used on a file that contains a string without the closing quote, this will lead to catastrophic backtracking. When the closing quote fails to match, the regex engine will try each and every permutation of the plus and the asterisk in the regex to match all the characters between the string’s opening quote and the end of the line.

Table 7-1 shows how this regex attempts all different ways of matching "abcd. The cells in the table show the text matched by ‹[^"\r\n]+›. At first, it matches abcd, but when the closing quote fails to match, the ‹+› will backtrack, giving up part of its match. When it does, the ‹*› will repeat the group, causing the next iteration of ‹[^"\r\n]+› to match the remaining characters. Now we have two iterations that will backtrack. This continues until each iteration of ‹[^"\r\n]+› matches a single character, and «*» has repeated the group as many times as there are characters on the line.

Table 7-1. Line separators
Permutation
1^st ‹[^"\r\n]+›
2^nd ‹[^"\r\n]+›
3^rd ‹[^"\r\n]+›
4^th ‹[^"\r\n]+›
1
abcd
n/a
n/a
n/a
2
abc
d
n/a
n/a
3
ab
cd
n/a
n/a
4
ab
c
d
n/a
5
a
bcd
n/a
n/a
6
a
bc
d
n/a
7
a
b
cd
n/a
8
a
b
c
d

Permutation	1^st ‹`[^"\r\n]+`›	2^nd ‹`[^"\r\n]+`›	3^rd ‹`[^"\r\n]+`›	4^th ‹`[^"\r\n]+`›
1	`abcd`	n/a	n/a	n/a
2	`abc`	`d`	n/a	n/a
3	`ab`	`cd`	n/a	n/a
4	`ab`	`c`	`d`	n/a
5	`a`	`bcd`	n/a	n/a
6	`a`	`bc`	`d`	n/a
7	`a`	`b`	`cd`	n/a
8	`a`	`b`	`c`	`d`

As you can see, the number of permutations grows exponentially^[10] with the number of characters after the opening double quote. For a file with short lines, this will result in your application running slowly. For a file with very long lines, your application may lock up or crash. If you use the variant ‹"(?:[^"]+|"")*"› to match multiline strings, the permutations may run all the way to the end of the file if there are no further double quotes in the file.

You could prevent that backtracking with an atomic group, as in ‹"(?>[^"\r\n]+|"")*"›, or with possessive quantifiers, as in ‹"(?:[^"\r\n]++|"")*+"›, if your regex flavor supports either of these features. But having to resort to special features defeats the purpose of trying to come up with something simpler than the regex presented in the section.

Variations

Strings delimited with single quotes can be matched just as easily:

'[^'\r\n]*(?:''[^'\r\n]*)*'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
If your language supports both single-quoted and double-quoted strings, you’ll need to handle those as separate alternatives:
"[^"\r\n]*(?:""[^"\r\n]*)*"|'[^'\r\n]*(?:''[^'\r\n]*)*'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
If strings can include line breaks, simply remove them from the negated character classes:
"[^"]*(?:""[^"]*)*"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
If the regex will be used in a system that needs to deal with source code files while they’re being edited, you may want to make the closing quote optional. Then everything until the end of the line will be matched as a string while it is being typed in, until the closing quote has been typed in. Syntax coloring in text editors, for example, usually works this way. Making the closing quote optional does not change how this regex works on files that only have properly closed strings. The quantifier for the closing quote is greedy, so the quote will be matched if present. The negated character classes make sure that the regex does not incorrectly match closing quotes as part of the string.
"[^"\r\n]*(?:""[^"\r\n]*)*"?
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Strings

Problem

Solution

Discussion

Variations

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

Strings

Problem

Solution

Discussion

Variations

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition