Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 3.4. Set Regular Expression Options

    Problem

    You want to compile a regular expression with all of the available matching modes: free-spacing, case insensitive, dot matches line breaks, and “^ and $ match at line breaks.”

    Solution

    C#

    Regex regexObj = new Regex("regex pattern",
        RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase |
        RegexOptions.Singleline | RegexOptions.Multiline);

    VB.NET

    Dim RegexObj As New Regex("regex pattern",
        RegexOptions.IgnorePatternWhitespace Or RegexOptions.IgnoreCase Or
        RegexOptions.Singleline Or RegexOptions.Multiline)

    Java

    Pattern regex = Pattern.compile("regex pattern",
        Pattern.COMMENTS | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE |
        Pattern.DOTALL | Pattern.MULTILINE);

    JavaScript

    Literal regular expression in your code:

    var myregexp = /regex pattern/im;

    Regular expression retrieved from user input, as a string:

    var myregexp = new RegExp(userinput, "im");

    XRegExp

    var myregexp = XRegExp("regex pattern", "xism");

    PHP

    regexstring = '/regex pattern/xism';

    Perl

    m/regex pattern/xism;

    Python

    reobj = re.compile("regex pattern",
        re.VERBOSE | re.IGNORECASE |
        re.DOTALL | re.MULTILINE)

    Ruby

    Literal regular expression in your code:

    myregexp = /regex pattern/xim;

    Regular expression retrieved from user input, as a string:

    myregexp = Regexp.new(userinput,
        Regexp::EXTENDED or Regexp::IGNORECASE or
        Regexp::MULTILINE);

    Discussion

    Many of the regular expressions in this book, and those that you find elsewhere, are written to be used with certain regex matching modes. There are four basic modes that nearly all modern regex flavors support. Unfortunately, some flavors use inconsistent and confusing names for the options that implement the modes. Using the wrong modes usually breaks the regular expression.

    All the solutions in this recipe use flags or options provided by the programming language or regular expression class to set the modes. Another way to set modes is to use mode modifiers within the regular expression. Mode modifiers within the regex always override options or flags set outside the regular expression.

    .NET

    The Regex() constructor takes an optional second parameter with regular expressions options. You can find the available options in the RegexOptions enumeration.

    Free-spacing: RegexOptions.IgnorePatternWhitespace
    Case insensitive: RegexOptions.IgnoreCase
    Dot matches line breaks: RegexOptions.Singleline
    ^ and $ match at line breaks: RegexOptions.Multiline

Java

The Pattern.compile() class factory takes an optional second parameter with regular expression options. The Pattern class defines several constants that set the various options. You can set multiple options by combining them with the bitwise inclusive or operator |.

Free-spacing: Pattern.COMMENTS
Case insensitive: Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE
Dot matches line breaks: Pattern.DOTALL
^ and $ match at line breaks: Pattern.MULTILINE

There are indeed two options for case insensitivity, and you have to set both for full case insensitivity. If you set only Pattern.CASE_INSENSITIVE, only the English letters A to Z are matched case insensitively. If you set both options, all characters from all scripts are matched case insensitively. The only reason not to use Pattern.UNICODE_CASE is performance, in case you know in advance you’ll be dealing with ASCII text only. When using mode modifiers inside your regular expression, use (?i) for ASCII-only case insensitivity and (?iu) for full case insensitivity.

JavaScript

In JavaScript, you can specify options by appending one or more single-letter flags to the RegExp literal, after the forward slash that terminates the regular expression. When talking about these flags in documentation, they are usually written as /i and /m, even though the flag itself is only one letter. No additional slashes are added to specify regex mode flags.

When using the RegExp() constructor to compile a string into a regular expression, you can pass an optional second parameter with flags to the constructor. The second parameter should be a string with the letters of the options you want to set. Do not put any slashes into the string.

Free-spacing: Not supported by JavaScript.
Case insensitive: /i
Dot matches line breaks: Not supported by JavaScript.
^ and $ match at line breaks: /m

XRegExp

XRegExp extends JavaScript’s regular expression syntax, adding support for the “free-spacing” and “dot matches line breaks” modes with the letters “x” and “s” commonly used by other regular expression flavors. Pass these letters in the string with the flags in the second parameter to the XRegExp() constructor.

Free-spacing: "x"
Case insensitive: "i"
Dot matches line breaks: "s"
^ and $ match at line breaks: "m"

PHP

Recipe 3.1 explains that the PHP preg functions require literal regular expressions to be delimited with two punctuation characters, usually forward slashes, and the whole lot formatted as a string literal. You can specify regular expression options by appending one or more single-letter modifiers to the end of the string. That is, the modifier letters come after the closing regex delimiter, but still inside the string’s single or double quotes. When talking about these modifiers in documentation, they are usually written as /x, even though the flag itself is only one letter, and even though the delimiter between the regex and the modifiers doesn’t have to be a forward slash.

Free-spacing: /x
Case insensitive: /i
Dot matches line breaks: /s
^ and $ match at line breaks: /m

Perl

You can specify regular expression options by appending one or more single-letter modifiers to the end of the pattern-matching or substitution operator. When talking about these modifiers in documentation, they are usually written as /x, even though the flag itself is only one letter, and even though the delimiter between the regex and the modifiers doesn’t have to be a forward slash.

Free-spacing: /x
Case insensitive: /i
Dot matches line breaks: /s
^ and $ match at line breaks: /m

Python

The compile() function (explained in the previous recipe) takes an optional second parameter with regular expression options. You can build up this parameter by using the | operator to combine the constants defined in the re module. Many of the other functions in the re module that take a literal regular expression as a parameter also accept regular expression options as a final and optional parameter.

The constants for the regular expression options come in pairs. Each option can be represented either as a constant with a full name or as just a single letter. Their functionality is equivalent. The only difference is that the full name makes your code easier to read by developers who aren’t familiar with the alphabet soup of regular expression options. The basic single-letter options listed in this section are the same as in Perl.

Free-spacing: re.VERBOSE or re.X
Case insensitive: re.IGNORECASE or re.I
Dot matches line breaks: re.DOTALL or re.S
^ and $ match at line breaks: re.MULTILINE or re.M

Ruby

In Ruby, you can specify options by appending one or more single-letter flags to the Regexp literal, after the forward slash that terminates the regular expression. When talking about these flags in documentation, they are usually written as /i and /m, even though the flag itself is only one letter. No additional slashes are added to specify regex mode flags.

When using the Regexp.new() factory to compile a string into a regular expression, you can pass an optional second parameter with flags to the constructor. The second parameter should be either nil to turn off all options, or a combination of constants from the Regexp class combined with the or operator.

Free-spacing: /r or Regexp::EXTENDED
Case insensitive: /i or Regexp::IGNORECASE
Dot matches line breaks: /m or Regexp::MULTILINE. Ruby indeed uses “m” and “multiline” here, whereas all the other flavors use “s” or “singleline” for “dot matches line breaks.”
^ and $ match at line breaks: The caret and dollar always match at line breaks in Ruby. You cannot turn this off. Use \A and \Z to match at the start or end of the subject string.

Additional Language-Specific Options

.NET

RegexOptions.ExplicitCapture makes all groups, except named groups, noncapturing. With this option, () is the same as (?:). If you always name your capturing groups, turn on this option to make your regular expression more efficient without the need to use the (?:) syntax. Instead of using RegexOptions.ExplicitCapture, you can turn on this option by putting (?n) at the start of your regular expression. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.

Specify RegexOptions.ECMAScript if you’re using the same regular expression in your .NET code and in JavaScript code, and you want to make sure it behaves in the same way. This is particularly useful when you’re developing the client side of a web application in JavaScript and the server side in ASP.NET. The most important effect is that with this option, \w and \d are restricted to ASCII characters, as they are in JavaScript.

Java

An option unique to Java is Pattern.CANON_EQ, which enables “canonical equivalence.” As explained in the discussion in Unicode grapheme, Unicode provides different ways to represent characters with diacritics. When you turn on this option, your regex will match a character, even if it is encoded differently in the subject string. For instance, the regex \u00E0 will match both "\u00E0" and "\u0061\u0300", because they are canonically equivalent. They both appear as “à” when displayed on screen, indistinguishable to the end user. Without canonical equivalence, the regex \u00E0 does not match the string "\u0061\u0300". This is how all other regex flavors discussed in this book behave.

In Java 7, you can set Pattern.UNICODE_CHARACTER_CLASS to make shorthand character classes match Unicode characters rather than just ASCII characters. See Shorthands in Recipe 2.3 for details.

Finally, Pattern.UNIX_LINES tells Java to treat only \n as a line break character for the dot, caret, and dollar. By default, all Unicode line breaks are treated as line break characters.

JavaScript

If you want to apply a regular expression repeatedly to the same string (e.g., to iterate over all matches or to search and replace all matches instead of just the first) specify the /g or “global” flag.

XRegExp

XRegExp needs the “g” flag if you want to apply a regular expression repeatedly to the same string just as standard JavaScript does. XRegExp also adds the “n” flag which makes all groups, except named groups, noncapturing. With this option, () is the same as (?:). If you always name your capturing groups, turn on this option to make your regular expression more efficient without the need to use the (?:) syntax. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.

PHP

/u tells PCRE to interpret both the regular expression and the subject string as UTF-8 strings. This modifier also enables Unicode regex tokens such as \x{FFFF} and \p{L}. These are explained in Recipe 2.7. Without this modifier, PCRE treats each byte as a separate character, and Unicode regex tokens cause an error.

/U flips the “greedy” and “lazy” behavior of adding an extra question mark to a quantifier. Normally, .* is greedy and .*? is lazy. With /U, .* is lazy and .*? is greedy. We strongly recommend that you never use this flag, as it will confuse programmers who read your code later and miss the extra /U modifier, which is unique to PHP. Also, don’t confuse /U with /u if you encounter it in somebody else’s code. Regex modifiers are case sensitive.

Perl

If you want to apply a regular expression repeatedly to the same string (e.g., to iterate over all matches or to search-and-replace all matches instead of just the first one), specify the /g (“global”) flag.

If you interpolate a variable in a regex as in m/I am $name/ then Perl will recompile the regular expression each time it needs to be used, because the contents of $name may have changed. You can suppress this with the /o modifier. m/I am $name/o is compiled the first time Perl needs to use it, and then reused the way it is after that. If the contents of $name change, the regex will not reflect the change. See Recipe 3.3 if you want to control when the regex is recompiled.

If your regex uses shorthand character classes or word boundaries, you can specify one of the /d, /u, /a, or /l flags to control whether the shorthands and word boundaries will match only ASCII characters, or whether they use Unicode or the current locale. The “Variations” sections in Recipes 2.3 and 2.3 have more details on what these flags do in Perl.

Python

Python has two extra options that change the meaning of word boundaries (see Recipe 2.6) and the shorthand character classes \w, \d, and \s, as well as their negated counterparts (see Recipe 2.3). By default, these tokens deal only with ASCII letters, digits, and whitespace.

The re.LOCALE or re.L option makes these tokens dependent on the current locale. The locale then determines which characters are treated as letters, digits, and whitespace by these regex tokens. You should specify this option when the subject string is not a Unicode string and you want characters such as letters with diacritics to be treated as such.

The re.UNICODE or re.U makes these tokens dependent on the Unicode standard. All characters defined by Unicode as letters, digits, and whitespace are then treated as such by these regex tokens. You should specify this option when the subject string you’re applying the regular expression to is a Unicode string.

Ruby

The Regexp.new() factory takes an optional third parameter to select the string encoding your regular expression supports. If you do not specify an encoding for your regular expression, it will use the same encoding as your source file. Most of the time, using the source file’s encoding is the right thing to do.

To select a coding explicitly, pass a single character for this parameter. The parameter is case-insensitive. Possible values are:

n

This stands for “None.” Each byte in your string is treated as one character. Use this for ASCII text.

e

Enables the “EUC” encoding for Far East languages.

s

Enables the Japanese “Shift-JIS” encoding.

u

Enables UTF-8, which uses one to four bytes per character and supports all languages in the Unicode standard (which includes all living languages of any significance).

When using a literal regular expression, you can set the encoding with the modifiers /n, /e, /s, and /u. Only one of these modifiers can be used for a single regular expression. They can be used in combination with any or all of the /x, /i, and /m modifiers.

Caution

Do not mistake Ruby’s /s for that of Perl, Java, or .NET. In Ruby, /s forces the Shift-JIS encoding. In Perl and most other regex flavors, it turns on “dot matches line breaks” mode. In Ruby, you can do that with /m.

See Also

The effects of the matching modes are explained in detail in Chapter 2. Those sections also explain the use of mode modifiers within the regular expression.

Free-spacing: Recipe 2.18
Case insensitive: Case-insensitive matching in Recipe 2.1
Dot matches line breaks: Recipe 2.4
^ and $ match at line breaks: Recipe 2.5

Recipes 3.1 and 3.3 explain how to use literal regular expressions in your source code and how to create regular expression objects. You set the regular expression options while creating a regular expression.