Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 8.17. Matching IPv6 Addresses

    Problem

    You want to check whether a string represents a valid IPv6 address using the standard, compact, and/or mixed notations.

    Solution

    Standard notation

    Match an IPv6 address in standard notation, which consists of eight 16-bit words using hexadecimal notation, delimited by colons (e.g.: 1762:0:0:0:0:B03:1:AF18). Leading zeros are optional.

    Check whether the whole subject text is an IPv6 address using standard notation:

    ^(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}$
    Regex options: Case insensitive
    Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python
    \A(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}\Z
    Regex options: Case insensitive
    Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

    Find an IPv6 address using standard notation within a larger collection of text:

    (?<![:.\w])(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}(?![:.\w])
    Regex options: Case insensitive
    Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

    JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:

    \b(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}\b
    Regex options: Case insensitive
    Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Mixed notation

Match an IPv6 address in mixed notation, which consists of six 16-bit words using hexadecimal notation, followed by four bytes using decimal notation. The words are delimited with colons, and the bytes with dots. A colon separates the words from the bytes. Leading zeros are optional for both the hexadecimal words and the decimal bytes. This notation is used in situations where IPv4 and IPv6 are mixed, and the IPv6 addresses are extensions of the IPv4 addresses. 1762:0:0:0:0:B03:127.32.67.15 is an example of an IPv6 address in mixed notation.

Check whether the whole subject text is an IPv6 address using mixed notation:

^(?:[A-F0-9]{1,4}:){6}(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])↵
\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find IPv6 address using mixed notation within a larger collection of text:

(?<![:.\w])(?:[A-F0-9]{1,4}:){6}↵
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}↵
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?![:.\w])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:

\b(?:[A-F0-9]{1,4}:){6}(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])↵
\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Standard or mixed notation

Match an IPv6 address using standard or mixed notation.

Check whether the whole subject text is an IPv6 address using standard or mixed notation:

\A                                                       # Start of string
(?:[A-F0-9]{1,4}:){6}                                        # 6 words
(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}                               # 2 words
|  (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}  # or 4 bytes
   (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
)\Z                                                      # End of string
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:[A-F0-9]{1,4}:){6}(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}|↵
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}↵
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find IPv6 address using standard or mixed notation within a larger collection of text:

(?<![:.\w])                                              # Anchor address
(?:[A-F0-9]{1,4}:){6}                                        # 6 words
(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}                               # 2 words
|  (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}  # or 4 bytes
   (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
)(?![:.\w])                                              # Anchor address
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:

\b                                                       # Word boundary
(?:[A-F0-9]{1,4}:){6}                                        # 6 words
(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}                               # 2 words
|  (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}  # or 4 bytes
   (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
)\b                                                      # Word boundary
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
\b(?:[A-F0-9]{1,4}:){6}(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}|↵
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}↵
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))\b
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Compressed notation

Match an IPv6 address using compressed notation. Compressed notation is the same as standard notation, except that one sequence of one or more words that are zero may be omitted, leaving only the colons before and after the omitted zeros. Addresses using compressed notation can be recognized by the occurrence of two adjacent colons in the address. Only one sequence of zeros may be omitted; otherwise, it would be impossible to determine how many words have been omitted in each sequence. If the omitted sequence of zeros is at the start or the end of the IP address, it will begin or end with two colons. If all numbers are zero, the compressed IPv6 address consists of just two colons, without any digits.

For example, 1762::B03:1:AF18 is the compressed form of 1762:0:0:0:0:B03:1:AF18. The regular expressions in this section will match both the compressed and the standard form of the IPv6 address. Check whether the whole subject text is an IPv6 address using standard or compressed notation:

\A(?:
 # Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
 # Compressed with at most 7 colons
|(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    \Z) # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)\Z
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵
[A-F0-9]{0,4}$)(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)↵
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find IPv6 address using standard or compressed notation within a larger collection of text:

(?<![:.\w])(?:
 # Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
 # Compressed with at most 7 colons
|(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    (?![:.\w])) # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character:

(?:
 # Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
 # Compressed with at most 7 colons
|(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    (?![:.\w])) # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
(?:(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵
[A-F0-9]{0,4}(?![:.\w]))(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)↵
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})(?![:.\w])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Compressed mixed notation

Match an IPv6 address using compressed mixed notation. Compressed mixed notation is the same as mixed notation, except that one sequence of one or more words that are zero may be omitted, leaving only the colons before and after the omitted zeros. The four decimal bytes must all be specified, even if they are zero. Addresses using compressed mixed notation can be recognized by the occurrence of two adjacent colons in the first part of the address and the three dots in the second part. Only one sequence of zeros may be omitted; otherwise, it would be impossible to determine how many words have been omitted in each sequence. If the omitted sequence of zeros is at the start of the IP address, it will begin with two colons rather than with a digit.

For example, the IPv6 address 1762::B03:127.32.67.15 is the compressed form of 1762:0:0:0:0:B03:127.32.67.15. The regular expressions in this section will match both compressed and noncompressed IPv6 address using mixed notation.

Check whether the whole subject text is an IPv6 address using compressed or noncompressed mixed notation:

\A
(?:
 # Non-compressed
 (?:[A-F0-9]{1,4}:){6}
 # Compressed with at most 6 colons
|(?=(?:[A-F0-9]{0,4}:){0,6}
    (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
    \Z)                            # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
 # Compressed with 7 colons and 5 numbers
|::(?:[A-F0-9]{1,4}:){5}
)
# 255.255.255.
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
# 255
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
\Z
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}\.)↵
{3}[0-9]{1,3}$)(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵
|::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵
[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find IPv6 address using compressed or noncompressed mixed notation within a larger collection of text:

(?<![:.\w])
(?:
 # Non-compressed
 (?:[A-F0-9]{1,4}:){6}
 # Compressed with at most 6 colons
|(?=(?:[A-F0-9]{0,4}:){0,6}
    (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
    (?![:.\w]))                    # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
 # Compressed with 7 colons and 5 numbers
|::(?:[A-F0-9]{1,4}:){5}
)
# 255.255.255.
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
# 255
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character.

(?:
 # Non-compressed
 (?:[A-F0-9]{1,4}:){6}
 # Compressed with at most 6 colons
|(?=(?:[A-F0-9]{0,4}:){0,6}
    (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
    (?![:.\w]))                    # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
 # Compressed with 7 colons and 5 numbers
|::(?:[A-F0-9]{1,4}:){5}
)
# 255.255.255.
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
# 255
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}\.){3}↵
[0-9]{1,3}(?![:.\w]))(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵
|::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?↵
[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?![:.\w])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Standard, mixed, or compressed notation

Match an IPv6 address using any of the notations explained earlier: standard, mixed, compressed, and compressed mixed.

Check whether the whole subject text is an IPv6 address:

\A(?:
# Mixed
 (?:
  # Non-compressed
  (?:[A-F0-9]{1,4}:){6}
  # Compressed with at most 6 colons
 |(?=(?:[A-F0-9]{0,4}:){0,6}
     (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
     \Z)                            # and anchored
  # and at most 1 double colon
  (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
  # Compressed with 7 colons and 5 numbers
 |::(?:[A-F0-9]{1,4}:){5}
 )
 # 255.255.255.
 (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
 # 255
 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
|# Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
|# Compressed with at most 7 colons
 (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    \Z)  # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)\Z
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}↵
\.){3}[0-9]{1,3}$)(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵
|::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵
[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])|↵
(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵
[A-F0-9]{0,4}$)(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)|↵
(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find an IPv6 address using standard or mixed notation within a larger collection of text:

(?<![:.\w])(?:
# Mixed
 (?:
  # Non-compressed
  (?:[A-F0-9]{1,4}:){6}
  # Compressed with at most 6 colons
 |(?=(?:[A-F0-9]{0,4}:){0,6}
     (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
     (?![:.\w]))                    # and anchored
  # and at most 1 double colon
  (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
  # Compressed with 7 colons and 5 numbers
 |::(?:[A-F0-9]{1,4}:){5}
 )
 # 255.255.255.
 (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
 # 255
 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
|# Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
|# Compressed with at most 7 colons
 (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    (?![:.\w]))  # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character.

(?:
 # Mixed
 (?:
  # Non-compressed
  (?:[A-F0-9]{1,4}:){6}
  # Compressed with at most 6 colons
 |(?=(?:[A-F0-9]{0,4}:){0,6}
     (?:[0-9]{1,3}\.){3}[0-9]{1,3}  # and 4 bytes
     (?![:.\w]))                    # and anchored
  # and at most 1 double colon
  (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)
  # Compressed with 7 colons and 5 numbers
 |::(?:[A-F0-9]{1,4}:){5}
 )
 # 255.255.255.
 (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
 # 255
 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
|# Standard
 (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
|# Compressed with at most 7 colons
 (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}
    (?![:.\w]))  # and anchored
 # and at most 1 double colon
 (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)
 # Compressed with 8 colons
|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7}
)(?![:.\w])
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
(?:(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}\.){3}↵
[0-9]{1,3}(?![:.\w]))(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵
|::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵
[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|↵
(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵
[A-F0-9]{0,4}(?![:.\w]))(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}↵
|:)|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})(?![:.\w])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Because of the different notations, matching an IPv6 address isn’t nearly as simple as matching an IPv4 address. Which notations you want to accept will greatly impact the complexity of your regular expression. Basically, there are two notations: standard and mixed. You can decide to allow only one of the two notations, or both. That gives us three sets of regular expressions.

Both the standard and mixed notations have a compressed form that omits zeros. Allowing compressed notation gives us another three sets of regular expressions.

You’ll need slightly different regexes depending on whether you want to check if a given string is a valid IPv6 address, or whether you want to find IP addresses in a larger body of text. To validate the IP address, we use anchors, as Recipe 2.5 explains. JavaScript uses the ^ and $ anchors, whereas Ruby uses \A and \Z. All other flavors support both. Ruby also supports ^ and $, but allows them to match at embedded line breaks in the string as well. You should use the caret and dollar in Ruby only if you know your string doesn’t have any embedded line breaks.

To find IPv6 addresses within larger text, we use negative lookbehind (?<![:.\w]) and negative lookahead (?![:.\w]) to make sure the address isn’t preceded or followed by a word character (letter, digit, or underscore) or by a dot or colon. This makes sure we don’t match parts of longer sequences of digits and colons. Recipe 2.16 explains how lookbehind and lookahead work. If lookaround isn’t available, word boundaries can check that the address isn’t preceded or followed by a word character, but only if the first and last character in the address are sure to be (hexadecimal) digits. Compressed notation allows addresses that start and end with a colon. If we were to put a word boundary before or after a colon, it would require an adjacent letter or digit, which isn’t what we want. Recipe 2.6 explains everything about word boundaries.

Standard notation

Standard IPv6 notation is very straightforward to handle with a regular expression. We need to match eight words in hexadecimal notation, delimited by seven colons. [A-F0-9]{1,4} matches 1 to 4 hexadecimal characters, which is what we need for a 16-bit word with optional leading zeros. The character class (Recipe 2.3) lists only the uppercase letters. The case-insensitive matching mode takes care of the lowercase letters. See Recipe 3.4 to learn how to set matching modes in your programming language.

The noncapturing group (?:[A-F0-9]{1,4}:){7} matches a hexadecimal word followed by a literal colon. The quantifier repeats the group seven times. The first colon in this regex is part of the regex syntax for noncapturing groups, as Recipe 2.9 explains, and the second is a literal colon. The colon is not a metacharacter in regular expressions, except in a few very specific situations as part of a larger regex token. Therefore, we don’t need to use backslashes to escape literal colons in our regular expressions. We could escape them, but it would only make the regex harder to read.

Mixed notation

The regex for the mixed IPv6 notation consists of two parts. (?:[A-F0-9]{1,4}:){6} matches six hexadecimal words, each followed by a literal colon, just like we have a sequence of seven such words in the regex for the standard IPv6 notation.

Instead of having two more hexadecimal words at the end, we now have a full IPv4 address at the end. We match this using the “accurate” regex that disallows leading zeros shown in Recipe 8.16.

Standard or mixed notation

Allowing both standard and mixed notation requires a slightly longer regular expression. The two notations differ only in their representation of the last 32 bits of the IPv6 address. Standard notation uses two 16-bit words, whereas mixed notation uses 4 decimal bytes, as with IPv4.

The first part of the regex matches six hexadecimal words, as in the regex that supports mixed notation only. The second part of the regex is now a noncapturing group with the two alternatives for the last 32 bits. As Recipe 2.8 explains, the alternation operator (vertical bar) has the lowest precedence of all regex operators. Thus, we need the noncapturing group to exclude the six words from the alternation.

The first alternative, located to the left of the vertical bar, matches two hexadecimal words with a literal colon in between. The second alternative matches an IPv4 address.

Compressed notation

Things get quite a bit more complicated when we allow compressed notation. The reason is that compressed notation allows a variable number of zeros to be omitted. 1:0:0:0:0:6:0:0, 1::6:0:0, and 1:0:0:0:0:6:: are three ways of writing the same IPv6 address. The address may have at most eight words, but it needn’t have any. If it has less than eight, it must have one double-colon sequence that represents the omitted zeros.

Variable repetition is easy with regular expressions. If an IPv6 address has a double colon, there can be at most seven words before and after the double colon. We could easily write this as:

(
  ([0-9A-F]{1,4}:){1,7}  # 1 to 7 words to the left
| :                      # or a double colon at the start
)
(
  (:[0-9A-F]{1,4}){1,7}  # 1 to 7 words to the right
| :                      # or a double colon at the end
)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Tip

This regular expression and the ones that follow in this discussion also work with JavaScript if you eliminate the comments and extra whitespace. JavaScript supports all the features used in these regexes, except free-spacing, which we use here to make these regexes easier to understand. Or, you can use the XRegExp library which enables free-spacing regular expressions in JavaScript, among other regex syntax enhancements.

This regular expression matches all compressed IPv6 addresses, but it doesn’t match any addresses that use noncompressed standard notation.

This regex is quite simple. The first part matches 1 to 7 words followed by a colon, or just the colon for addresses that don’t have any words to the left of the double colon. The second part matches 1 to 7 words preceded by a colon, or just the colon for addresses that don’t have any words to the right of the double colon. Put together, valid matches are a double colon by itself, a double colon with 1 to 7 words at the left only, a double colon with 1 to 7 words at the right only, and a double colon with 1 to 7 words at both the left and the right.

It’s the last part that is troublesome. The regex allows 1 to 7 words at both the left and the right, as it should, but it doesn’t specify that the total number of words at the left and right must be 7 or less. An IPv6 address has 8 words. The double colon indicates we’re omitting at least one word, so at most 7 remain.

Regular expressions don’t do math. They can count if something occurs between 1 and 7 times. But they cannot count if two things occur for a total of 7 times, splitting those 7 times between the two things in any combination.

To understand this problem better, let’s examine a simple analog. Say we want to match something in the form of aaaaxbbb. The string must be between 1 and 8 characters long and consist of 0 to 7 times a, exactly one x, and 0 to 7 times b.

There are two ways to solve this problem with a regular expression. One way is to spell out all the alternatives. The next section discussing compressed mixed notation uses this. It can result in a long-winded regex, but it will be easy to understand.

\A(?:a{7}x
 |  a{6}xb?
 |  a{5}xb{0,2}
 |  a{4}xb{0,3}
 |  a{3}xb{0,4}
 |  a{2}xb{0,5}
 |  axb{0,6}
 |  xb{0,7}
)\Z
Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

This regular expression has one alternative for each of the possible number of letters a. Each alternative spells out how many letters b are allowed after the given number of letters a and the x have been matched.

The other solution is to use lookahead. This is the method used for the regex within the section that matches an IPv6 address using compressed notation. If you’re not familiar with lookahead, see Recipe 2.16 first. Using lookahead, we can essentially match the same text twice, checking it for two conditions.

\A
  (?=[abx]{1,8}\Z)
  a{0,7}xb{0,7}
\Z
Regex options: Free-spacing
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

The \A at the start of the regex anchors it to the start of the subject text. Then the positive lookahead kicks in. It checks whether a series of 1 to 8 letters a, b, and/or x can be matched, and that the end of the string is reached when those 1 to 8 letters have been matched. The \Z inside the lookahead is crucial. In order to limit the regex to strings of eight characters or less, the lookahead must test that there aren’t any further characters after those that it matched.

In a different scenario, you might use another kind of delimiter instead of \A and \Z. If you wanted to do a “whole words only” search for aaaaxbbb and friends, you would use word boundaries. But to restrict the regex match to the right length, you have to use some kind of delimiter, and you have to put the delimiter that matches the end of the string both inside the lookahead and at the end of the regular expression. If you don’t, the regular expression will partly match a string that has too many characters.

When the lookahead has satisfied its requirement, it gives up the characters that it has matched. Thus, when the regex engine attempts a{0,7}, it is back at the start of the string. The fact that the lookahead doesn’t consume the text that it matched is the key difference between a lookahead and a noncapturing group, and is what allows us to apply two patterns to a single piece of text.

Although a{0,7}xb{0,7} on its own could match up to 15 letters, in this case it can match only 8, because the lookahead already made sure there are only 8 letters. All a{0,7}xb{0,7} has to do is to check that they appear in the right order. In fact, a*xb* would have the exact same effect as a{0,7}xb{0,7} in this regular expression.

The second \Z at the end of the regex is also essential. Just like the lookahead needs to make sure there aren’t too many letters, the second test after the lookahead needs to make sure that all the letters are in the right order. This makes sure we don’t match something like axba, even though it satisfies the lookahead by being between 1 and 8 characters long.

Compressed mixed notation

Mixed notation can be compressed just like standard notation. Although the four bytes at the end must always be specified, even when they are zero, the number of hexadecimal words before them again becomes variable. If all the hexadecimal words are zero, the IPv6 address could end up looking like an IPv4 address with two colons before it.

Creating a regex for compressed mixed notation involves solving the same issues as for compressed standard notation. The previous section explains all this.

The main difference between the regex for compressed mixed notation and the regex for compressed (standard) notation is that the one for compressed mixed notation needs to check for the IPv4 address after the six hexadecimal words. We do this check at the end of the regex, using the same regex for accurate IPv4 addresses from Recipe 8.16 that we used in this recipe for noncompressed mixed notation.

We have to match the IPv4 part of the address at the end of the regex, but we also have to check for it inside the lookahead that makes sure we have no more than six colons or six hexadecimal words in the IPv6 address. Since we’re already doing an accurate test at the end of the regex, the lookahead can suffice with a simple IPv4 check. The lookahead doesn’t need to validate the IPv4 part, as the main regex already does that. But it does have to match the IPv4 part, so that the end-of-string anchor at the end of the lookahead can do its job.

Standard, mixed, or compressed notation

The final set of regular expressions puts it all together. These match an IPv6 address in any notation: standard or mixed, compressed or not.

These regular expressions are formed by alternating the ones for compressed mixed notation and compressed (standard) notation. These regexes already use alternation to match both the compressed and noncompressed variety of the IPv6 notation they support.

The result is a regular expression with three top-level alternatives, with the first alternative consisting of two alternatives of its own. The first alternative matches an IPv6 address using mixed notation, either noncompressed or compressed. The second alternative matches an IPv6 address using standard notation. The third alternative covers the compressed (standard) notation.

We have three top-level alternatives instead of two alternatives that each contain their own two alternatives because there’s no particular reason to group the alternatives for standard and compressed notation. For mixed notation, we do keep the compressed and noncompressed alternatives together, because it saves us having to spell out the IPv4 part twice.

Essentially, we combined this regex:

^(6words|compressed6words)ip4$

and this regex:

^(8words|compressed8words)$

into:

^((6words|compressed6words)ip4|8words|compressed8words)$

rather than:

^((6words|compressed6words)ip4|(8words|compressed8words))$

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.8 explains alternation. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. Recipe 2.18 explains how to add comments.