Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 4.19. Validate Password Complexity

    Problem

    You’re tasked with ensuring that any passwords chosen by your website users meet your organization’s minimum complexity requirements.

    Solution

    The following regular expressions check many individual conditions, and can be mixed and matched as necessary to meet your business requirements. At the end of this section, we’ve included several JavaScript code examples that show how you can tie these regular expressions together as part of a password security validation routine.

    Length between 8 and 32 characters

    ^.{8,32}$
    Regex options: Dot matches line breaks (“^ and $ match at line breaks” must not be set)
    Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

    Standard JavaScript doesn’t have a “dot matches line breaks” option. Use [\s\S] instead of a dot in JavaScript to ensure that the regex works correctly even for crazy passwords that include line breaks:

    ^[\s\S]{8,32}$
    Regex options: None (“^ and $ match at line breaks” must not be set)
    Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

ASCII visible and space characters only

If this next regex matches a password, you can be sure it includes only the characters AZ, az, 09, space, and ASCII punctuation. No control characters, line breaks, or characters outside of the ASCII table are allowed:

^[\x20-\x7E]+$
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If you want to additionally prevent the use of spaces, use ^[\x21-\x7E]+$ instead.

One or more uppercase letters

ASCII uppercase letters only:

[A-Z]
Regex options: None (“case insensitive” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any Unicode uppercase letter:

\p{Lu}
Regex options: None (“case insensitive” must not be set)
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

If you want to check for the presence of any letter character (not limited to uppercase), enable the “case insensitive” option or use [A-Za-z]. For the Unicode case, you can use \p{L}, which matches any kind of letter from any language.

One or more lowercase letters

ASCII lowercase letters only:

[a-z]
Regex options: None (“case insensitive” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any Unicode lowercase letter:

\p{Ll}
Regex options: None (“case insensitive” must not be set)
Regex flavors: .NET, Java, PCRE, Perl, Ruby 1.9

One or more numbers

[0-9]
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

One or more special characters

ASCII punctuation and spaces only:

[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Anything other than ASCII letters and numbers:

[^A-Za-z0-9]
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Disallow three or more sequential identical characters

This next regex is intended to rule out passwords like 111111. It works in the opposite way of the others in this recipe. If it matches, the password doesn’t meet the condition. In other words, the regex only matches strings that repeat a character three times in a row.

(.)\1\1
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
([\s\S])\1\1
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Example JavaScript solution, basic

The following code combines five password requirements:

  • Length between 8 and 32 characters.

  • One or more uppercase letters.

  • One or more lowercase letters.

  • One or more numbers.

  • One or more special characters (ASCII punctuation or space characters).

function validate(password) {
    var minMaxLength = /^[\s\S]{8,32}$/,
        upper = /[A-Z]/,
        lower = /[a-z]/,
        number = /[0-9]/,
        special = /[ !"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]/;

    if (minMaxLength.test(password) &&
        upper.test(password) &&
        lower.test(password) &&
        number.test(password) &&
        special.test(password)
    ) {
        return true;
    }

    return false;
}

The validate function just shown returns true if the provided string meets the password requirements. Otherwise, false is returned.

Example JavaScript solution, with x out of y validation

This next example enforces a minimum and maximum password length (8–32 characters), and additionally requires that at least three of the following four character types are present:

  • One or more uppercase letters.

  • One or more lowercase letters.

  • One or more numbers.

  • One or more special characters (anything other than ASCII letters and numbers).

function validate(password) {
    var minMaxLength = /^[\s\S]{8,32}$/,
        upper = /[A-Z]/,
        lower = /[a-z]/,
        number = /[0-9]/,
        special = /[^A-Za-z0-9]/,
        count = 0;

    if (minMaxLength.test(password)) {
        // Only need 3 out of 4 of these to match
        if (upper.test(password)) count++;
        if (lower.test(password)) count++;
        if (number.test(password)) count++;
        if (special.test(password)) count++;
    }

    return count >= 3;
}

As before, this modified validate function returns true if the provided password meets the overall requirements. If not, it returns false.

Example JavaScript solution, with password security ranking

This final code example is the most complicated of the bunch. It assigns a positive or negative score to various conditions, and uses the regexes we’ve been looking at to help calculate an overall score for the provided password. The rankPassword function returns a number from 04 that corresponds to the password rankings “Too Short,” “Weak,” “Medium,” “Strong,” and “Very Strong”:

var rank = {
    TOO_SHORT: 0,
    WEAK: 1,
    MEDIUM: 2,
    STRONG: 3,
    VERY_STRONG: 4
};

function rankPassword(password) {
    var upper = /[A-Z]/,
        lower = /[a-z]/,
        number = /[0-9]/,
        special = /[^A-Za-z0-9]/,
        minLength = 8,
        score = 0;

    if (password.length < minLength) {
        return rank.TOO_SHORT; // End early
    }

    // Increment the score for each of these conditions
    if (upper.test(password)) score++;
    if (lower.test(password)) score++;
    if (number.test(password)) score++;
    if (special.test(password)) score++;

    // Penalize if there aren't at least three char types
    if (score < 3) score--;

    if (password.length > minLength) {
        // Increment the score for every 2 chars longer than the minimum
        score += Math.floor((password.length - minLength) / 2);
    }

    // Return a ranking based on the calculated score
    if (score < 3) return rank.WEAK; // score is 2 or lower
    if (score < 4) return rank.MEDIUM; // score is 3
    if (score < 6) return rank.STRONG; // score is 4 or 5
    return rank.VERY_STRONG; // score is 6 or higher
}

// Test it...
var result = rankPassword("password1"),
    labels = ["Too Short", "Weak", "Medium", "Strong", "Very Strong"];

alert(labels[result]); // -> Weak

Because of how this password ranking algorithm is designed, it can serve two purposes equally well. First, it can be used to give users guidance about the quality of their password while they’re still typing it. Second, it lets you easily reject passwords that don’t rank at whatever you choose as your minimum security threshold. For example, the condition if(result <= rank.MEDIUM) can be used to reject any password that isn’t ranked as “Strong” or “Very Strong.”

Discussion

Users are notorious for choosing simple or common passwords that are easy to remember. But easy to remember doesn’t necessarily translate into something that keeps their account and your company’s information safe. It’s therefore typically necessary to protect users from themselves by enforcing minimum password complexity rules. However, the exact rules to use can vary widely between businesses and systems, which is why this recipe includes numerous regexes that serve as the raw ingredients to help you cook up whatever combination of validation rules you choose.

Limiting each regex to a specific rule brings the additional benefit of simplicity. As a result, all of the regexes shown thus far are fairly straightforward. Following are a few additional notes on each of them:

Length between 8 and 32 characters

To require a different minimum or maximum length, change the numbers used as the upper and lower bounds for the quantifier {8,32}. If you don’t want to specify a maximum, use {8,}, or remove the $ anchor and change the quantifier to {8}.

All of the programming languages covered by this book provide a simple and efficient way to determine the length of a string. However, using a regex allows you to test both the minimum and maximum length at the same time, and makes it easier to mix and match password complexity rules by choosing from a list of regexes.

ASCII visible and space characters only

As mentioned earlier, this regex allows the characters AZ, az, 09, space, and ASCII punctuation only. To be more specific about the allowed punctuation characters, they are !, ", #, $, %, &, ', (, ), *, +, -, ., /, :, ;, <, =, >, ?, @, [, \, ], ^, _, `, {, |, }, ~, and comma. In other words, all the punctuation you can type using a standard U.S. keyboard.

Limiting passwords to these characters can help avoid character encoding related issues, but keep in mind that it also limits the potential complexity of your passwords.

Uppercase letters

To check whether the password contains two or more uppercase letters, use [A-Z].*[A-Z]. For three or more, use [A-Z].*[A-Z].*[A-Z] or (?:[A-Z].*){3}. If you’re allowing any Unicode uppercase letters, just change each [A-Z] in the preceding examples to \p{Lu}. In JavaScript, replace the dots with [\s\S].

Lowercase letters

As with the “uppercase letters” regex, you can check whether the password contains at least two lowercase letters using [a-z].*[a-z]. For three or more, use [a-z].*[a-z].*[a-z] or (?:[a-z].*){3}. If you’re allowing any Unicode lowercase letters, change each [a-z] to \p{Ll}. In JavaScript, replace the dots with [\s\S].

Numbers

You can check whether the password contains two or more numbers using [0-9].*[0-9], and [0-9].*[0-9].*[0-9] or (?:[0-9].*){3} for three or more. In JavaScript, replace the dots with [\s\S].

We didn’t include a listing for matching any Unicode decimal digit (\p{Nd}), because it’s uncommon to treat characters other than 09 as numbers (although readers who speak Arabic or Hindi might disagree!).

Special characters

Use the same principles shown for letters and numbers if you want to require more than one special character. For instance, using [^A-Za-z0-9].*[^A-Za-z0-9] would require the password to contain at least two special characters.

Note that [^A-Za-z0-9] is different than \W (the negated version of the \w shorthand for word characters). \W goes beyond [^A-Za-z0-9] by additionally excluding the underscore, which we don’t want to do here. In some regex flavors, \W also excludes any Unicode letter or decimal digit from any language.

Disallow three or more sequential identical characters

This regex matches repeated characters using backreferences to a previously matched character. Recipe 2.10 explains how backreferences work. If you want to disallow any use of repeated characters, change the regex to (.)\1. To allow up to three repeated characters but not four, use (.)\1\1\1 or (.)\1{3}.

Remember that you need to check whether this regular expression doesn’t match your subject text. A match would indicate that repeated characters are present.

Example JavaScript solutions

The three blocks of JavaScript example code each use this recipe’s regular expressions a bit differently.

The first example requires all conditions to be met or else the password fails. In the second example, acing the password test requires three out of four conditional requirements to be met. The third example, titled , is probably the most interesting. It includes a function called rankPassword that does what it says on the tin and ranks passwords by how secure they are. It can thus help provide a more user-friendly experience and encourage users to choose strong passwords.

The rankPassword function’s password ranking algorithm increments and decrements an internal password score based on multiple conditions. If the password’s length is less than the specified minimum of eight characters, the function returns early with the numeric equivalent of “Too Short.” Not including at least three character types incurs a one-point penalty, but this can be balanced out because every two additional characters after the minimum of eight adds a point to the running score.

The code can of course be customized to further improve it or to meet your particular requirements. However, it works quite well as-is, regardless of what you throw at it. As a sanity check, we ran it against several hundred of the known most common (and therefore most insecure) user passwords. All came out ranked as either “Too Short” or “Weak,” which is exactly what we were hoping for.

Caution

Using JavaScript to validate passwords in a web browser can be very beneficial for your users, but make sure to also implement your validation routine on the server. If you don’t, it won’t work for users who disable JavaScript or use custom scripts to circumvent your client-side validation.

Variations

Validate multiple password rules with a single regex

Up to this point, we’ve split password validation into discrete rules that can be tested using simple regexes. That’s usually the best approach. It keeps the regexes readable, and makes it easier to provide error messages that identify why a password isn’t up to code. It can even help you rank a password’s complexity, as we’ve seen. However, there may be times when you don’t care about all that, or when one regex is all you can use. In any case, it’s common for people to want to validate multiple password rules using a single regex, so let’s take a look at how it can be done. We’ll use the following requirements:

  • Length between 8 and 32 characters.

  • One or more uppercase letters.

  • One or more lowercase letters.

  • One or more numbers.

Here’s a regex that pulls it all off:

^(?=.{8,32}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9]).*
Regex options: Dot matches line breaks (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

This regex can be used with standard JavaScript (which doesn’t have a “dot matches line breaks” option) if you replace each of the five dots with [\s\S]. Otherwise, you might fail to match some valid passwords that contain line breaks. Either way, though, the regex won’t match any invalid passwords.

Notice how this regular expression puts each validation rule into its own lookahead group at the beginning of the regex. Because lookahead does not consume any characters as part of a match (see Recipe 2.16), each lookahead test runs from the very beginning of the string. When a lookahead succeeds, the regex moves along to test the next one, starting from the same position. Any lookahead that fails to find a match causes the overall match to fail.

The first lookahead, (?=.{8,32}$), ensures that any match is between 8 and 32 characters long. Make sure to keep the $ anchor after {8,32}, otherwise the match will succeed even when there are more than 32 characters. The next three lookaheads search one by one for an uppercase letter, lowercase letter, and digit. Because each lookahead searches from the beginning of the string, they use .* before their respective character classes. This allows other characters to appear before the character type that they’re searching for.

By following the approach shown here, it’s possible to add as many lookahead-based password tests as you want to a single regex, so long as all of the conditions are always required.

The .* at the very end of this regex is not actually required. Without it, though, the regex would return a zero-length empty string when it successfully matches. The trailing .* lets the regex include the password itself in successful match results.

Caution

It’s equally valid to write this regex as ^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9]).{8,32}$, with the length test coming after the lookaheads. Unfortunately, writing it this way triggers a bug in Internet Explorer 5.5–8 that prevents it from working correctly. Microsoft fixed the bug in the new regex engine included in IE9.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.7 explains how to match Unicode characters. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.