Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 3.16. Replace Matches with Replacements Generated in Code

    Problem

    You want to replace all matches of a regular expression with a new string that you build up in procedural code. You want to be able to replace each match with a different string, based on the text that was actually matched.

    For example, suppose you want to replace all numbers in a string with the number multiplied by two.

    Solution

    C#

    You can use the static call when you process only a small number of strings with the same regular expression:

    string resultString = Regex.Replace(subjectString, @"\d+",
                          new MatchEvaluator(ComputeReplacement));

    Construct a Regex object if you want to use the same regular expression with a large number of strings:

    Regex regexObj = new Regex(@"\d+");
    string resultString = regexObj.Replace(subjectString,
                          new MatchEvaluator(ComputeReplacement));

    Both code snippets call the function ComputeReplacement. You should add this method to the class in which you’re implementing this solution:

    public String ComputeReplacement(Match matchResult) {
        int twiceasmuch = int.Parse(matchResult.Value) * 2;
        return twiceasmuch.ToString();
    }

    VB.NET

    You can use the static call when you process only a small number of strings with the same regular expression:

    Dim MyMatchEvaluator As New MatchEvaluator(AddressOf ComputeReplacement)
    Dim ResultString = Regex.Replace(SubjectString, "\d+", MyMatchEvaluator)

    Construct a Regex object if you want to use the same regular expression with a large number of strings:

    Dim RegexObj As New Regex("\d+")
    Dim MyMatchEvaluator As New MatchEvaluator(AddressOf ComputeReplacement)
    Dim ResultString = RegexObj.Replace(SubjectString, MyMatchEvaluator)

    Both code snippets call the function ComputeReplacement. You should add this method to the class in which you’re implementing this solution:

    Public Function ComputeReplacement(ByVal MatchResult As Match) As String
        Dim TwiceAsMuch = Int.Parse(MatchResult.Value) * 2;
        Return TwiceAsMuch.ToString();
    End Function

    Java

    StringBuffer resultString = new StringBuffer();
    Pattern regex = Pattern.compile("\\d+");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        Integer twiceasmuch = Integer.parseInt(regexMatcher.group()) * 2;
        regexMatcher.appendReplacement(resultString, twiceasmuch.toString());
    }
    regexMatcher.appendTail(resultString);

    JavaScript

    var result = subject.replace(/\d+/g, function(match) {
        return match * 2;
    });

    PHP

    Using a declared callback function:

    $result = preg_replace_callback('/\d+/', 'compute_replacement', $subject);
    
    function compute_replacement($groups) {
        return $groups[0] * 2;
    }

    Using an anonymous callback function:

    $result = preg_replace_callback(
        '/\d+/',
        create_function(
            '$groups',
            'return $groups[0] * 2;'
        ),
        $subject
    );

    Perl

    $subject =~ s/\d+/$& * 2/eg;

    Python

    If you have only a few strings to process, you can use the global function:

    result = re.sub(r"\d+", computereplacement, subject)

    To use the same regex repeatedly, use a compiled object:

    reobj = re.compile(r"\d+")
    result = reobj.sub(computereplacement, subject)

    Both code snippets call the function computereplacement. This function needs to be declared before you can pass it to sub().

    def computereplacement(matchobj):
        return str(int(matchobj.group()) * 2)

    Ruby

    result = subject.gsub(/\d+/) {|match|
        Integer(match) * 2
    }

    Discussion

    When using a string as the replacement text, you can do only basic text substitution. To replace each match with something totally different that varies along with the match being replaced, you need to create the replacement text in your own code.

    C#

    Recipe 3.14 discusses the various ways in which you can call the Regex.Replace() method, passing a string as the replacement text. When using a static call, the replacement is the third parameter, after the subject and the regular expression. If you passed the regular expression to the Regex() constructor, you can call Replace() on that object with the replacement as the second parameter.

    Instead of passing a string as the second or third parameter, you can pass a MatchEvaluator delegate. This delegate is a reference to a member function that you add to the class where you’re doing the search-and-replace. To create the delegate, use the new keyword to call the MatchEvaluator() constructor. Pass your member function as the only parameter to MatchEvaluator().

    The function you want to use for the delegate should return a string and take one parameter of class System.Text.RegularExpressions.Match. This is the same Match class returned by the Regex.Match() member used in nearly all the previous recipes in this chapter.

    When you call Replace() with a MatchEvaluator as the replacement, your function will be called for each regular expression match that needs to be replaced. Your function needs to return the replacement text. You can use any of the properties of the Match object to build your replacement text. The example shown earlier uses matchResult.Value to retrieve the string with the whole regex match. Often, you’ll use matchResult.Groups[] to build up your replacement text from the capturing groups in your regular expression.

    If you do not want to replace certain regex matches, your function should return matchResult.Value. If you return null or an empty string, the regex match is replaced with nothing (i.e., deleted).

    VB.NET

    Recipe 3.14 discusses the various ways in which you can call the Regex.Replace() method, passing a string as the replacement text. When using a static call, the replacement text is the third parameter, after the subject and the regular expression. If you used the Dim keyword to create a variable with your regular expression, you can call Replace() on that object with the replacement as the second parameter.

    Instead of passing a string as the second or third parameter, you can pass a MatchEvaluator object. This object holds a reference to a function that you add to the class where you’re doing the search-and-replace. Use the Dim keyword to create a new variable of type MatchEvaluator. Pass one parameter with the AddressOf keyword followed by the name of your member function. The AddressOf operator returns a reference to your function, without actually calling the function at that point.

    The function you want to use for MatchEvaluator should return a string and should take one parameter of class System.Text.RegularExpressions.Match. This is the same Match class returned by the Regex.Match() member used in nearly all the previous recipes in this chapter. The parameter will be passed by value, so you have to declare it with ByVal.

    When you call Replace() with a MatchEvaluator as the replacement, your function will be called for each regular expression match that needs to be replaced. Your function needs to return the replacement text. You can use any of the properties of the Match object to build your replacement text. The example uses MatchResult.Value to retrieve the string with the whole regex match. Often, you’ll use MatchResult.Groups() to build up your replacement text from the capturing groups in your regular expression.

    If you do not want to replace certain regex matches, your function should return MatchResult.Value. If you return Nothing or an empty string, the regex match is replaced with nothing (i.e., deleted).

    Java

    The Java solution is very straightforward. We iterate over all the regex matches as explained in Recipe 3.11. Inside the loop, we call appendReplacement() on our Matcher object. When find() fails to find any further matches, we call appendTail(). The two methods appendReplacement() and appendTail() make it very easy to use a different replacement text for each regex match.

    appendReplacement() takes two parameters. The first is the StringBuffer where you’re (temporarily) storing the result of the search-and-replace in progress. The second is the replacement text to be used for the last match found by find(). This replacement text can include references to capturing groups, such as "$1". If there is a syntax error in your replacement text, an IllegalArgumentException is thrown. If the replacement text references a capturing group that does not exist, an IndexOutOfBoundsException is thrown instead. If you call appendReplacement() without a prior successful call to find(), it throws an IllegalStateException.

    If you call appendReplacement() correctly, it does two things. First, it copies the text located between the previous and current regex match to the string buffer, without making any modifications to the text. If the current match is the first one, it copies all the text before that match. After that, it appends your replacement text, substituting any backreferences in it with the text matched by the referenced capturing groups.

    If you want to delete a particular match, simply replace it with an empty string. If you want to leave a match in the string unchanged, you can omit the call to appendReplacement() for that match. By “previous regex match,” We mean the previous match for which you called appendReplacement(). If you don’t call appendReplacement() for certain matches, those become part of the text between the matches that you do replace, which is copied unchanged into the target string buffer.

    When you’re done replacing matches, call appendTail(). That copies the text at the end of the string after the last regex match for which you called appendReplacement().

    JavaScript

    In JavaScript, a function is really just another object that can be assigned to a variable. Instead of passing a literal string or a variable that holds a string to the string.replace() function, we can pass a function that returns a string. This function is then called each time a replacement needs to be made.

    You can make your replacement function accept one or more parameters. If you do, the first parameter will be set to the text matched by the regular expression. If your regular expression has capturing groups, the second parameter will hold the text matched by the first capturing group, the third parameter gives you the text of the second capturing group, and so on. You can set these parameters to use bits of the regular expression match to compose the replacement.

    The replacement function in the JavaScript solution for this recipe simply takes the text matched by the regular expression, and returns it multiplied by two. JavaScript handles the string-to-number and number-to-string conversions implicitly.

    PHP

    The preg_replace_callback() function works just like the preg_replace() function described in Recipe 3.14. It takes a regular expression, replacement, subject string, optional replacement limit, and optional replacement count. The regular expression and subject string can be single strings or arrays.

    The difference is that preg_replace_callback() expects the second parameter to be a function rather than the actual replacement text. If you declare the function in your code, then the name of the function must be passed as a string. Alternatively, you can pass the result of create_function() to create an anonymous function. Either way, your replacement function should take one parameter and return a string (or something that can be coerced into a string).

    Each time preg_replace_callback() finds a regex match, it will call your callback function. The parameter will be filled with an array of strings. Element zero holds the overall regex match, and elements one and beyond hold the text matched by capturing groups one and beyond. You can use this array to build up your replacement text using the text matched by the regular expression or one or more capturing groups.

    Perl

    The s/// operator supports one extra modifier that is ignored by the m// operator: /e. The /e, or “execute,” modifier tells the substitution operator to execute the replacement part as Perl code, instead of interpreting it as the contents of a double-quoted string. Using this modifier, we can easily retrieve the matched text with the $& variable, and then multiply it by two. The result of the code is used as the replacement string.

    Python

    Python’s sub() function allows you to pass the name of a function instead of a string as the replacement text. This function is then called for each regex match to be replaced.

    You need to declare this function before you can reference it. It should take one parameter to receive a MatchObject instance, which is the same object returned by the search() function. You can use it to retrieve (part of) the regex match to build your replacement. See Recipe 3.7 and Recipe 3.9 for details.

    Your function should return a string with the replacement text.

    Ruby

    The previous two recipes called the gsub() method of the String class with two parameters: the regex and the replacement text. This method also exists in block form.

    In block form, gsub() takes your regular expression as its only parameter. It fills one iterator variable with a string that holds the text matched by the regular expression. If you supply additional iterator variables, they are set to nil, even if your regular expression has capturing groups.

    Inside the block, place an expression that evaluates to the string that you want to use as the replacement text. You can use the special regex match variables, such as $~, $&, and $1, inside the block. Their values change each time the block is evaluated to make another replacement. See Recipes 3.7, 3.8, and 3.9 for details.

    You cannot use replacement text tokens such as «\1». Those remain as literal text.

    See Also

    Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex.

    Recipe 3.15 shows code to make a search-and-replace reinsert parts of the text matched by the regular expression.