Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 5.6. Find Any Word Not Preceded by a Specific Word

    Problem

    You want to match any word that is not immediately preceded by the word cat, ignoring any whitespace, punctuation, or other nonword characters that come between.

    Solution

    Lookbehind you

    Lookbehind lets you check if text appears before a given position. It works by instructing the regex engine to temporarily step backward in the string, checking whether something can be found ending at the position where you placed the lookbehind. See Recipe 2.16 if you need to brush up on the details of lookbehind.

    The following regexes use negative lookbehind, (?<!). Unfortunately, the regex flavors covered by this book differ in what kinds of patterns they allow you to place within lookbehind. The solutions therefore end up working a bit differently in each case. Read on to the section of this recipe for further details.

    Words not preceded by “cat”

    Any number of separating nonword characters:

    (?<!\bcat\W+)\b\w+
    Regex options: Case insensitive
    Regex flavor: .NET

    Limited number of separating nonword characters:

    (?<!\bcat\W{1,9})\b\w+
    Regex options: Case insensitive
    Regex flavors: .NET, Java

    Single separating nonword character:

    (?<!\bcat\W)\b\w+
    Regex options: Case insensitive
    Regex flavors: .NET, Java, PCRE, Perl, Python
    (?<!\Wcat\W)(?<!^cat\W)\b\w+
    Regex options: Case insensitive
    Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

Simulate lookbehind

JavaScript and Ruby 1.8 do not support lookbehind at all, even though they do support lookahead. However, because the lookbehind for this problem appears at the very beginning of the regex, it’s possible to simulate the lookbehind by splitting the regex into two parts, as demonstrated in the following JavaScript example:

var subject = "My cat is fluffy.",
    mainRegex = /\b\w+/g,
    lookbehind = /\bcat\W+$/i,
    lookbehindType = false, // false for negative, true for positive
    matches = [],
    match,
    leftContext;

while (match = mainRegex.exec(subject)) {
    leftContext = subject.substring(0, match.index);

    if (lookbehindType == lookbehind.test(leftContext)) {
        matches.push(match[0]);
    } else {
        mainRegex.lastIndex = match.index + 1;
    }
}

// matches:  ["My", "cat", "fluffy"]

Discussion

Fixed, finite, and infinite length lookbehind

The first regular expression uses the negative lookbehind (?<!\bcat\W+). Because the + quantifier used inside the lookbehind has no upper limit on how many characters it can match, this version works with the .NET regular expression flavor only. All of the other regular expression flavors covered by this book require a fixed or maximum (finite) length for lookbehind patterns.

The second regular expression replaces the + within the lookbehind with {1,9}. As a result, it can be used with .NET and Java, both of which support variable-length lookbehind when there is a known upper limit to how many characters can be matched within them. I’ve arbitrarily chosen a maximum length of nine nonword characters to separate the words. That allows a bit of punctuation and a few blank lines to separate the words. Unless you’re working with unusual subject text, this will probably end up working exactly like the previous .NET-only solution. Even in .NET, however, providing a reasonable repetition limit for any quantifiers inside lookbehind is a good safety practice since it reduces the amount of unanticipated backtracking that can potentially occur within the lookbehind.

The third regular expression entirely dropped the quantifier after the \W nonword character inside the lookbehind. Doing so lets the lookbehind test a fixed-length string, thereby adding support for PCRE, Perl, and Python. But it’s a steep price to pay, and now the regular expression only avoids matching words that are preceded by “cat” and exactly one separating character. The regex correctly matches only cat in the string cat fluff, but it matches both cat and fluff in the string cat, fluff.

Since Ruby 1.9 doesn’t allow \b word boundaries in lookbehind, the fourth regular expression uses two separate lookbehinds. The first lookbehind prevents cat as the preceding word when it is itself preceded by a nonword character such as whitespace or punctuation. The second uses the ^ anchor to prevent cat as the preceding word when it appears at the start of the string.

Simulate lookbehind

JavaScript does not support lookbehind, but the JavaScript example code shows how you can simulate lookbehind that appears at the beginning of a regex. It doesn’t impose any restrictions on the length of the text matched by the (simulated) lookbehind.

We start by splitting the (?<!\bcat\W+)\b\w+ regular expression from the first solution into two pieces: the pattern inside the lookbehind (\bcat\W+) and the pattern that comes after it (\b\w+). Append a $ to the end of the lookbehind pattern. If you need to use the “^ and $ match at line breaks” option (/m) with the lookbehind regex, use $(?!\s) instead of $ at the end of the lookbehind pattern to ensure that it can match only at the very end of its subject text. The lookbehindType variable controls whether we’re emulating positive or negative lookbehind. Use true for positive and false for negative.

After the variables are set up, we use mainRegex and the exec() method to iterate over the subject string (see Recipe 3.11 for a description of this process). When a match is found, the part of the subject text before the match is copied into a new string variable (leftContext), and we test whether the lookbehind regex matches that string. Because of the anchor we appended to the end of lookbehind, this can only match immediately to the left of the match found by mainRegex, or in other words, at the end of leftContext. By comparing the result of the lookbehind test to lookbehindType, we can determine whether the match meets the complete criteria for a successful match.

Finally, we take one of two steps. If we have a successful match, append the text matched by mainRegex to the matches array. If not, change the position at which to continue searching for a match (using mainRegex.lastIndex) to the position one character after the starting position of mainRegex’s last match, rather than letting the next iteration of the exec() method start at the end of the current match.

Whew! We’re done.

This is an advanced trick that takes advantage of the lastIndex property that is dynamically updated with JavaScript regular expressions that use the /g (global) flag. Usually, updating and resetting lastIndex is something that happens automagically. Here, we use it to take control of the regex’s path through the subject string, moving forward and backward as necessary. This trick only lets you emulate lookbehind that appears at the beginning of a regex. With a few changes, the code could also be used to emulate lookbehind at the very end of a regex. However, it does not serve as a full substitute for lookbehind support. Due to the interplay of lookbehind and backtracking, this approach cannot help you accurately emulate the behavior of a lookbehind that appears in the middle of a regex.

Variations

If you want to match words that are preceded by cat (without including the word cat and its following nonword characters as part of the matched text), change the negative lookbehind to positive lookbehind, as shown next.

Any number of separating nonword characters:

(?<=\bcat\W+)\w+
Regex options: Case insensitive
Regex flavor: .NET

Limited number of separating nonword characters:

(?<=\bcat\W{1,9})\w+
Regex options: Case insensitive
Regex flavors: .NET, Java

Single separating nonword character:

(?<=\bcat\W)\w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python
(?:(?<=\Wcat\W)|(?<=^cat\W))\w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

These adapted versions of the regexes no longer include a \b word boundary before the \w+ at the end because the positive lookbehinds already ensure that any match is preceded by a nonword character. The last regex (which adds support for Ruby 1.9) wraps its two positive lookbehinds in (?:|), since only one of the lookbehinds can match at a given position.

PCRE 7.2 and Perl 5.10 support the fancy \K or keep operator that resets the starting position for the part of a match that is returned in the match result (see Alternative to Lookbehind for more details). We can use this to come close to emulating leading infinite-length positive lookbehind, as shown in the next regex:

\bcat\W+\K\w+
Regex options: Case insensitive
Regex flavors: PCRE 7.2, Perl 5.10

There is a subtle but important difference between this and the .NET-only regex that allowed any number of separating nonword characters. Unlike with lookbehind, the text matched to the left of the \K is consumed by the match even though it is not included in the match result. You can see this difference by comparing the results of the regexes with \K and positive lookbehind when they’re applied to the subject string cat cat cat cat. In Perl and PHP, if you replace all matches of (?<=\bcat\W)\w+ with «dog», you’ll get the result cat dog dog dog, since only the first word is not itself preceded by cat. If you use the regex \bcat\W+\K\w+ to perform the same replacement, the result will be cat dog cat dog. After matching the leading cat cat (and replacing it with cat dog), the next match attempt can’t peek to the left of its starting position like lookbehind does. The regex matches the second cat cat, which is again replaced with cat dog.

See Also

Recipe 5.4 explains how to find all except a specific word. Recipe 5.5 explains how to find any word not followed by a specific word.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. It also explains \K, in the section Alternative to Lookbehind.