Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 3.5. Test If a Match Can Be Found Within a Subject String

    Problem

    You want to check whether a match can be found for a particular regular expression in a particular string. A partial match is sufficient. For instance, the regex regexpattern partially matches The regex pattern can be found. You don’t care about any of the details of the match. You just want to know whether the regex matches the string.

    Solution

    C#

    For quick one-off tests, you can use the static call:

    bool foundMatch = Regex.IsMatch(subjectString, "regex pattern");

    If the regex is provided by the end user, you should use the static call with full exception handling:

    bool foundMatch = false;
    try {
        foundMatch = Regex.IsMatch(subjectString, UserInput);
    } catch (ArgumentNullException ex) {
        // Cannot pass null as the regular expression or subject string
    } catch (ArgumentException ex) {
        // Syntax error in the regular expression
    }

    To use the same regex repeatedly, construct a Regex object:

    Regex regexObj = new Regex("regex pattern");
    bool foundMatch = regexObj.IsMatch(subjectString);

    If the regex is provided by the end user, you should use the Regex object with full exception handling:

    bool foundMatch = false;
    try {
        Regex regexObj = new Regex(UserInput);
        try {
            foundMatch = regexObj.IsMatch(subjectString);
        } catch (ArgumentNullException ex) {
            // Cannot pass null as the regular expression or subject string
        }
    } catch (ArgumentException ex) {
        // Syntax error in the regular expression
    }

    VB.NET

    For quick one-off tests, you can use the static call:

    Dim FoundMatch = Regex.IsMatch(SubjectString, "regex pattern")

    If the regex is provided by the end user, you should use the static call with full exception handling:

    Dim FoundMatch As Boolean
    Try
        FoundMatch = Regex.IsMatch(SubjectString, UserInput)
    Catch ex As ArgumentNullException
        'Cannot pass Nothing as the regular expression or subject string
    Catch ex As ArgumentException
        'Syntax error in the regular expression
    End Try

    To use the same regex repeatedly, construct a Regex object:

    Dim RegexObj As New Regex("regex pattern")
    Dim FoundMatch = RegexObj.IsMatch(SubjectString)

    The IsMatch() call should have SubjectString as the only parameter, and the call should be made on the RegexObj instance rather than the Regex class:

    Dim FoundMatch = RegexObj.IsMatch(SubjectString)

    If the regex is provided by the end user, you should use the Regex object with full exception handling:

    Dim FoundMatch As Boolean
    Try
        Dim RegexObj As New Regex(UserInput)
        Try
            FoundMatch = Regex.IsMatch(SubjectString)
        Catch ex As ArgumentNullException
            'Cannot pass Nothing as the regular expression or subject string
        End Try
    Catch ex As ArgumentException
        'Syntax error in the regular expression
    End Try

    Java

    The only way to test for a partial match is to create a Matcher:

    Pattern regex = Pattern.compile("regex pattern");
    Matcher regexMatcher = regex.matcher(subjectString);
    boolean foundMatch = regexMatcher.find();

    If the regex is provided by the end user, you should use exception handling:

    boolean foundMatch = false;
    try {
    	Pattern regex = Pattern.compile(UserInput);
    	Matcher regexMatcher = regex.matcher(subjectString);
    	foundMatch = regexMatcher.find();
    } catch (PatternSyntaxException ex) {
    	// Syntax error in the regular expression
    }

    JavaScript

    if (/regex pattern/.test(subject)) {
        // Successful match
    } else {
        // Match attempt failed
    }

    PHP

    if (preg_match('/regex pattern/', $subject)) {
        # Successful match
    } else {
        # Match attempt failed
    }

    Perl

    With the subject string held in the special variable $_:

    if (m/regex pattern/) {
        # Successful match
    } else {
        # Match attempt failed
    }

    With the subject string held in the variable $subject:

    if ($subject =~ m/regex pattern/) {
        # Successful match
    } else {
        # Match attempt failed
    }

    Using a precompiled regular expression:

    $regex = qr/regex pattern/;
    if ($subject =~ $regex) {
        # Successful match
    } else {
        # Match attempt failed
    }

    Python

    For quick one-off tests, you can use the global function:

    if re.search("regex pattern", subject):
        # Successful match
    else:
        # Match attempt failed

    To use the same regex repeatedly, use a compiled object:

    reobj = re.compile("regex pattern")
    if reobj.search(subject):
        # Successful match
    else:
        # Match attempt failed

    Ruby

    if subject =~ /regex pattern/
        # Successful match
    else
        # Match attempt failed
    end

    This code does exactly the same thing:

    if /regex pattern/ =~ subject
        # Successful match
    else
        # Match attempt failed
    end

    Discussion

    The most basic task for a regular expression is to check whether a string matches the regex. In most programming languages, a partial match is sufficient for the match function to return true. The match function will scan through the entire subject string to see whether the regular expression matches any part of it. The function returns true as soon as a match is found. It returns false only when it reaches the end of the string without finding any matches.

    The code examples in this recipe are useful for checking whether a string contains certain data. If you want to check whether a string fits a certain pattern in its entirety (e.g., for input validation), use the next recipe instead.

    C# and VB.NET

    The Regex class provides four overloaded versions of the IsMatch() method, two of which are static. This makes it possible to call IsMatch() with different parameters. The subject string is always the first parameter. This is the string in which the regular expression will try to find a match. The first parameter must not be null. Otherwise, IsMatch() will throw an ArgumentNullException.

    You can perform the test in a single line of code by calling Regex.IsMatch() without constructing a Regex object. Simply pass the regular expression as the second parameter and pass regex options as an optional third parameter. If your regular expression has a syntax error, an ArgumentException will be thrown by IsMatch(). If your regex is valid, the call will return true if a partial match was found, or false if no match could be found at all.

    If you want to use the same regular expression on many strings, you can make your code more efficient by constructing a Regex object first, and calling IsMatch() on that object. The first parameter, which holds the subject string, is then the only required parameter. You can specify an optional second parameter to indicate the character index at which the regular expression should begin the check. Essentially, the number you pass as the second parameter is the number of characters at the start of your subject string that the regular expression should ignore. This can be useful when you’ve already processed the string up to a point, and you want to check whether the remainder should be processed further. If you specify a number, it must be greater than or equal to zero and less than or equal to the length of the subject string. Otherwise, IsMatch() throws an ArgumentOutOfRangeException.

    Java

    To test whether a regex matches a string partially or entirely, instantiate a Matcher object as explained in Recipe 3.3. Then call the find() method on your newly created or newly reset matcher.

    Do not call String.matches(), Pattern.matches(), or Matcher.matches(). Those all require the regex to match the whole string.

    JavaScript

    To test whether a regular expression can match part of a string, call the test() method on your regular expression. Pass the subject string as the only parameter.

    regexp.test() returns true if the regular expression matches part or all of the subject string, and false if it does not.

    PHP

    The preg_match() function can be used for a variety of purposes. The most basic way to call it is with only the two required parameters: the string with your regular expression, and the string with the subject text you want the regex to search through. preg_match() returns 1 if a match can be found and 0 when the regex cannot match the subject at all.

    Later recipes in this chapter explain the optional parameters you can pass to preg_match().

    Perl

    In Perl, m// is in fact a regular expression operator, not a mere regular expression container. If you use m// by itself, it uses the $_ variable as the subject string.

    If you want to use the matching operator on the contents of another variable, use the =~ binding operator to associate the regex operator with your variable. Binding the regex to a string immediately executes the regex. The pattern-matching operator returns true if the regex matches part of the subject string, and false if it doesn’t match at all.

    If you want to check whether a regular expression does not match a string, you can use !~, which is the negated version of =~.

    Python

    The search() function in the re module searches through a string to find whether the regular expression matches part of it. Pass your regular expression as the first parameter and the subject string as the second parameter. You can pass the regular expression options in the optional third parameter.

    The re.search() function calls re.compile(), and then calls the search() method on the compiled regular expression object. This method takes just one parameter: the subject string.

    If the regular expression finds a match, search() returns a MatchObject instance. If the regex fails to match, search() returns None. When you evaluate the returned value in an if statement, the MatchObject evaluates to True, whereas None evaluates to False. Later recipes in this chapter show how to use the information stored by MatchObject.

    Tip

    Don’t confuse search() with match(). You cannot use match() to find a match in the middle of a string. The next recipe uses match().

    Ruby

    The =~ operator is the pattern-matching operator. Place it between a regular expression and a string to find the first regular expression match. The operator returns an integer with the position at which the regex match begins in the string. It returns nil if no match can be found.

    This operator is implemented in both the Regexp and String classes. In Ruby 1.8, it doesn’t matter which class you place to the left and which to the right of the operator. In Ruby 1.9, doing so has a special side effect involving named capturing groups. Recipe 3.9 explains this.

    Tip

    In all the other Ruby code snippets in this book, we place the subject string to the left of the =~ operator and the regular expression to the right. This maintains consistency with Perl, from which Ruby borrowed the =~ syntax, and avoids the Ruby 1.9 magic with named capturing groups that people might not expect.

    See Also

    Recipe 3.6 shows code to test whether a regex matches a subject string entirely.

    Recipe 3.7 shows code to get the text that was actually matched by the regex.