Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012
  1. Cover
  2. Regular Expressions Cookbook
  3. Preface
  4. Caught in the Snarls of Different Versions
  5. Intended Audience
  6. Technology Covered
  7. Organization of This Book
  8. Conventions Used in This Book
  9. Using Code Examples
  10. Safari® Books Online
  11. How to Contact Us
  12. Acknowledgments
  13. 1. Introduction to Regular Expressions
  14. Regular Expressions Defined
  15. Search and Replace with Regular Expressions
  16. Tools for Working with Regular Expressions
  17. 2. Basic Regular Expression Skills
  18. 2.1. Match Literal Text
  19. 2.2. Match Nonprintable Characters
  20. 2.3. Match One of Many Characters
  21. 2.4. Match Any Character
  22. 2.5. Match Something at the Start and/or the End of a Line
  23. 2.6. Match Whole Words
  24. 2.7. Unicode Code Points, Categories, Blocks, and Scripts
  25. 2.8. Match One of Several Alternatives
  26. 2.9. Group and Capture Parts of the Match
  27. 2.10. Match Previously Matched Text Again
  28. 2.11. Capture and Name Parts of the Match
  29. 2.12. Repeat Part of the Regex a Certain Number of Times
  30. 2.13. Choose Minimal or Maximal Repetition
  31. 2.14. Eliminate Needless Backtracking
  32. 2.15. Prevent Runaway Repetition
  33. 2.16. Test for a Match Without Adding It to the Overall Match
  34. 2.17. Match One of Two Alternatives Based on a Condition
  35. 2.18. Add Comments to a Regular Expression
  36. 2.19. Insert Literal Text into the Replacement Text
  37. 2.20. Insert the Regex Match into the Replacement Text
  38. 2.21. Insert Part of the Regex Match into the Replacement Text
  39. 2.22. Insert Match Context into the Replacement Text
  40. 3. Programming with Regular Expressions
  41. Programming Languages and Regex Flavors
  42. 3.1. Literal Regular Expressions in Source Code
  43. 3.2. Import the Regular Expression Library
  44. 3.3. Create Regular Expression Objects
  45. 3.4. Set Regular Expression Options
  46. 3.5. Test If a Match Can Be Found Within a Subject String
  47. 3.6. Test Whether a Regex Matches the Subject String Entirely
  48. 3.7. Retrieve the Matched Text
  49. 3.8. Determine the Position and Length of the Match
  50. 3.9. Retrieve Part of the Matched Text
  51. 3.10. Retrieve a List of All Matches
  52. 3.11. Iterate over All Matches
  53. 3.12. Validate Matches in Procedural Code
  54. 3.13. Find a Match Within Another Match
  55. 3.14. Replace All Matches
  56. 3.15. Replace Matches Reusing Parts of the Match
  57. 3.16. Replace Matches with Replacements Generated in Code
  58. 3.17. Replace All Matches Within the Matches of Another Regex
  59. 3.18. Replace All Matches Between the Matches of Another Regex
  60. 3.19. Split a String
  61. 3.20. Split a String, Keeping the Regex Matches
  62. 3.21. Search Line by Line
  63. Construct a Parser
  64. 4. Validation and Formatting
  65. 4.1. Validate Email Addresses
  66. 4.2. Validate and Format North American Phone Numbers
  67. 4.3. Validate International Phone Numbers
  68. 4.4. Validate Traditional Date Formats
  69. 4.5. Validate Traditional Date Formats, Excluding Invalid Dates
  70. 4.6. Validate Traditional Time Formats
  71. 4.7. Validate ISO 8601 Dates and Times
  72. 4.8. Limit Input to Alphanumeric Characters
  73. 4.9. Limit the Length of Text
  74. 4.10. Limit the Number of Lines in Text
  75. 4.11. Validate Affirmative Responses
  76. 4.12. Validate Social Security Numbers
  77. 4.13. Validate ISBNs
  78. 4.14. Validate ZIP Codes
  79. 4.15. Validate Canadian Postal Codes
  80. 4.16. Validate U.K. Postcodes
  81. 4.17. Find Addresses with Post Office Boxes
  82. 4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
  83. 4.19. Validate Password Complexity
  84. 4.20. Validate Credit Card Numbers
  85. 4.21. European VAT Numbers
  86. 5. Words, Lines, and Special Characters
  87. 5.1. Find a Specific Word
  88. 5.2. Find Any of Multiple Words
  89. 5.3. Find Similar Words
  90. 5.4. Find All Except a Specific Word
  91. 5.5. Find Any Word Not Followed by a Specific Word
  92. 5.6. Find Any Word Not Preceded by a Specific Word
  93. 5.7. Find Words Near Each Other
  94. 5.8. Find Repeated Words
  95. 5.9. Remove Duplicate Lines
  96. 5.10. Match Complete Lines That Contain a Word
  97. 5.11. Match Complete Lines That Do Not Contain a Word
  98. 5.12. Trim Leading and Trailing Whitespace
  99. 5.13. Replace Repeated Whitespace with a Single Space
  100. 5.14. Escape Regular Expression Metacharacters
  101. 6. Numbers
  102. 6.1. Integer Numbers
  103. 6.2. Hexadecimal Numbers
  104. 6.3. Binary Numbers
  105. 6.4. Octal Numbers
  106. 6.5. Decimal Numbers
  107. 6.6. Strip Leading Zeros
  108. 6.7. Numbers Within a Certain Range
  109. 6.8. Hexadecimal Numbers Within a Certain Range
  110. 6.9. Integer Numbers with Separators
  111. 6.10. Floating-Point Numbers
  112. 6.11. Numbers with Thousand Separators
  113. 6.12. Add Thousand Separators to Numbers
  114. 6.13. Roman Numerals
  115. 7. Source Code and Log Files
  116. Keywords
  117. Identifiers
  118. Numeric Constants
  119. Operators
  120. Single-Line Comments
  121. Multiline Comments
  122. All Comments
  123. Strings
  124. Strings with Escapes
  125. Regex Literals
  126. Here Documents
  127. Common Log Format
  128. Combined Log Format
  129. Broken Links Reported in Web Logs
  130. 8. URLs, Paths, and Internet Addresses
  131. 8.1. Validating URLs
  132. 8.2. Finding URLs Within Full Text
  133. 8.3. Finding Quoted URLs in Full Text
  134. 8.4. Finding URLs with Parentheses in Full Text
  135. 8.5. Turn URLs into Links
  136. 8.6. Validating URNs
  137. 8.7. Validating Generic URLs
  138. 8.8. Extracting the Scheme from a URL
  139. 8.9. Extracting the User from a URL
  140. 8.10. Extracting the Host from a URL
  141. 8.11. Extracting the Port from a URL
  142. 8.12. Extracting the Path from a URL
  143. 8.13. Extracting the Query from a URL
  144. 8.14. Extracting the Fragment from a URL
  145. 8.15. Validating Domain Names
  146. 8.16. Matching IPv4 Addresses
  147. 8.17. Matching IPv6 Addresses
  148. 8.18. Validate Windows Paths
  149. 8.19. Split Windows Paths into Their Parts
  150. 8.20. Extract the Drive Letter from a Windows Path
  151. 8.21. Extract the Server and Share from a UNC Path
  152. 8.22. Extract the Folder from a Windows Path
  153. 8.23. Extract the Filename from a Windows Path
  154. 8.24. Extract the File Extension from a Windows Path
  155. 8.25. Strip Invalid Characters from Filenames
  156. 9. Markup and Data Formats
  157. Processing Markup and Data Formats with Regular Expressions
  158. 9.1. Find XML-Style Tags
  159. 9.2. Replace Tags with
  160. 9.3. Remove All XML-Style Tags Except and
  161. 9.4. Match XML Names
  162. 9.5. Convert Plain Text to HTML by Adding

    and
    Tags

  163. 9.6. Decode XML Entities
  164. 9.7. Find a Specific Attribute in XML-Style Tags
  165. 9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
  166. 9.9. Remove XML-Style Comments
  167. 9.10. Find Words Within XML-Style Comments
  168. 9.11. Change the Delimiter Used in CSV Files
  169. 9.12. Extract CSV Fields from a Specific Column
  170. 9.13. Match INI Section Headers
  171. 9.14. Match INI Section Blocks
  172. 9.15. Match INI Name-Value Pairs
  173. Index
  174. Index
  175. Index
  176. Index
  177. Index
  178. Index
  179. Index
  180. Index
  181. Index
  182. Index
  183. Index
  184. Index
  185. Index
  186. Index
  187. Index
  188. Index
  189. Index
  190. Index
  191. Index
  192. Index
  193. Index
  194. Index
  195. Index
  196. Index
  197. Index
  198. Index
  199. About the Authors
  200. Colophon
  201. Copyright
  202. 6.12. Add Thousand Separators to Numbers

    Problem

    You want to add commas as the thousand separator to numbers with four or more digits. You want to do this both for individual numbers and for any numbers in a string or file.

    For example, you’d like to convert this:

    There are more than 7000000000 people in the world today.

    To this:

    There are more than 7,000,000,000 people in the world today.

    Tip

    Not all countries and written languages use the same character as the thousand separator. The solutions here use a comma, but some people use dots, underscores, apostrophes, or spaces for the same purpose. If you want, you can replace the commas in this recipe’s replacement strings with one of these other characters.

    Solution

    The following solutions work both for individual numbers and for all numbers in a given string. They’re designed to be used in a search-and-replace for all matches.

    Basic solution

    Regular expression:

    [0-9](?=(?:[0-9]{3})+(?![0-9]))
    Regex options: None
    Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

    Although this regular expression works equally well with all of the flavors covered by this book, the accompanying replacement text is decidedly less portable.

    Replacement:

    $&,
    Replacement text flavors: .NET, JavaScript, Perl
    $0,
    Replacement text flavors: .NET, Java, XRegExp, PHP
    \0,
    Replacement text flavors: PHP, Ruby
    \&,
    Replacement text flavor: Ruby
    \g<0>,
    Replacement text flavor: Python

    These replacement strings all put the matched number back using backreference zero (the entire match, which in this case is a single digit), followed by a comma. When programming, you can implement this regular expression search-and-replace as explained in Recipe 3.15.

Match separator positions only, using lookbehind

Regular expression:

(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

Replacement:

,
Replacement text flavors: .NET, Java, Perl, PHP, Python, Ruby

Recipe 3.14 explains how you can implement this basic regular expression search-and-replace when programming.

This version doesn’t work with JavaScript or Ruby 1.8, because they don’t support any type of lookbehind. This time around, however, we need only one version of the replacement text because we’re simply using a comma without any backreference as the replacement.

Discussion

Introduction

Adding thousand separators to numbers in your documents, data, and program output is a simple but effective way to improve their readability and appearance.

Some of the programming languages covered by this book provide built-in methods to add locale-aware thousand separators to numbers. For instance, in Python you can use locale.format('%d', 1000000, True) to convert the number 1000000 to the string '1,000,000', assuming you’ve previously set your program to use a locale that uses commas as the thousand separator. For other locales, the number might be separated using dots, underscores, apostrophes, or spaces.

However, locale-aware processing is not always available, reliable, or appropriate. In the finance world, for example, using commas as thousand separators is the norm, regardless of location. Internationalization might not be a relevant issue to begin with when working in a text editor rather than programming. For these reasons, and for simplicity, in this recipe we’ve assumed you always want to use commas as the thousand separator. In the upcoming section, we’ve also assumed you want to use dots as decimal points. If you need to use other characters, feel free to swap them in.

Caution

Although adding thousand separators to all numbers in a file or string can improve the presentation of your data, it’s important to understand what kind of content you’re dealing with before doing so. For instance, you probably don’t want to add commas to IDs, four-digit years, and ZIP codes. Documents and data that include these kinds of numbers might not be good candidates for automated comma insertion.

Basic solution

This regular expression matches any single digit that has digits on the right in exact sets of three. It therefore matches twice in the string 12345678, finding the digits 2 and 5. All the other digits are not followed by an exact multiple of three digits.

The accompanying replacement text puts back the matched digit using backreference zero (the entire match), and follows it with a comma. That leaves us with 12,345,678. Voilà!

To explain how the regex determines which digits to match, we’ll split it into two parts. The first part is the leading character class [0-9] that matches any single digit. The second part is the positive lookahead (?=(?:[0-9]{3})+(?![0-9])) that causes the match attempt to fail unless it’s at a position followed by digits in exact sets of three. In other words, the lookahead ensures that the regex matches only the digits that should be followed by a comma. Recipe 2.16 explains how lookahead works.

The (?:[0-9]{3})+ within the lookahead matches digits in sets of three. The negative lookahead (?![0-9]) that follows is there to ensure that no digits come immediately after the digits we matched in sets of three. Otherwise, the outer positive lookahead would be satisfied by any number of following digits, so long as there were at least three.

Match separator positions only, using lookbehind

This adaptation of the previous regex doesn’t match any digits at all. Instead, it matches only the positions where we want to insert commas within numbers. These positions are wherever there are digits on the right in exact sets of three, and at least one digit on the left.

The lookahead used to search for sets of exactly three digits on the right is the same as in the last regex. The difference here is that, instead of starting the regex with [0-9] to match a digit, we instead assert that there is at least one digit to the left by using the positive lookbehind (?<=[0-9]). Without the lookbehind, the regex would match the position to the left of 123 and therefore the search-and-replace would convert it to ,100. Lookbehind is explained together with lookahead in Recipe 2.16.

JavaScript and Ruby 1.8 don’t support lookbehind, so they cannot use this version of the regular expression.

Variations

Don’t add commas after a decimal point

The preceding regexes add commas to any sequence of four or more digits. A rather glaring issue with this basic approach is that it can add commas to digits that come after a dot as the decimal separator, so long as there are at least four digits after the dot. Following are two ways to fix this.

Use infinite lookbehind

The problem is easy to solve if you’re able to use an infinite-length quantifier like + or at least a long finite-length quantifier like {1,100} within lookbehind.

Regular expression:

[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!\.[0-9]+)
Regex options: None
Regex flavors: .NET
[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!\.[0-9]{1,100})
Regex options: None
Regex flavors: .NET, Java

Replacement:

$0,
Replacement text flavors: .NET, Java

The first regex here works in .NET only because of the + in the lookbehind. The second regex works in both .NET and Java, because Java supports any finite-length quantifier inside lookbehind—even arbitrarily long interval quantifiers like {1,100}. The .NET-only version therefore works correctly with any number, whereas the Java version avoids adding commas to numbers after a decimal place only when there are 100 or fewer digits after the dot. You can bump up the second number in the {1,100} quantifier if you want to support even longer numbers to the right of a decimal separator.

With both regexes, we’ve put the new lookbehind at the end of the pattern. The regexes could be restructured to add the lookbehind at the front, as you might intuitively expect, but we’ve done it this way to optimize efficiency. Since the lookbehind is the slowest part of the regex, putting it at the end lets the regex fail more quickly at positions within the subject string where the lookbehind doesn’t need to be evaluated in order to rule out a match.

Search-and-replace within matched numbers

If you’re not working with .NET or Java and therefore can’t look as far back into the subject string as you want, you can still use fixed-length lookbehind to help match entire numbers that aren’t preceded by a dot. That lets you identify the numbers that qualify for having commas added (and correctly exclude any digits that come after a decimal point), but because it matches entire numbers, you can’t simply include a comma in the replacement string and be done with it.

Completing the solution requires using two regexes. An outer regex to match the numbers that should have commas added to them, and an inner regex that searches within the qualifying numbers as part of a search-and-replace that inserts the commas.

Outer regex:

\b(?<!\.)[0-9]{4,}
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

This matches any entire number with four or more digits that is not preceded by a dot. The word boundary at the beginning of the regex ensures that any matched numbers start at the beginning of the string or are separate from other numbers and words. Otherwise, the regex could match the 2345 from 0.12345. In other words, without the word boundary, matches could start from the second digit after a decimal point, since a dot is no longer the preceding character at that point.

The inner regex and replacement text to go with this are the same as the Basic solution.

In order to apply the inner regex’s generated replacement values to each match of the outer regex, we need to replace matches of the outer regex with values generated in code, rather than using a simple string replacement. That way we can run the inner regex within the code that generates the outer regex’s replacement value. This may sound complicated, but the programming languages covered by this book all make it fairly straightforward.

Here’s the complete solution for Ruby 1.9:

subject.gsub(/\b(?<!\.)[0-9]{4,}/) {|match|
    match.gsub(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/, '\0,')
}

The subject variable in this code holds the string to commafy. Ruby’s gsub string method performs a global search-and-replace. For other programming languages, follow Recipe 3.16, which explains how to replace matches with replacements generated in code. It includes examples that show this technique in action for each language.

The lack of lookbehind support in JavaScript and Ruby 1.8 prevents this solution from being fully portable, since we used lookbehind in the outer regex. We can work around this in JavaScript and Ruby 1.8 by including the character, if any, that precedes a number as part of the match, and requiring that it be something other than a digit or dot. We can then put the nondigit/nondot character back using a backreference in the generated replacement text.

Here’s the JavaScript code to pull this off:

subject.replace(/(^|[^0-9.])([0-9]{4,})/g, function($0, $1, $2) {
    return $1 + $2.replace(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/g, "$&,");
});

See Also

Recipe 6.11 explains how to match numbers that already include commas within them.

All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.