Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Cover
Regular Expressions Cookbook
Preface
Caught in the Snarls of Different Versions
Intended Audience
Technology Covered
Organization of This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Introduction to Regular Expressions
Regular Expressions Defined
Search and Replace with Regular Expressions
Tools for Working with Regular Expressions
2. Basic Regular Expression Skills
2.1. Match Literal Text
2.2. Match Nonprintable Characters
2.3. Match One of Many Characters
2.4. Match Any Character
2.5. Match Something at the Start and/or the End of a Line
2.6. Match Whole Words
2.7. Unicode Code Points, Categories, Blocks, and Scripts
2.8. Match One of Several Alternatives
2.9. Group and Capture Parts of the Match
2.10. Match Previously Matched Text Again
2.11. Capture and Name Parts of the Match
2.12. Repeat Part of the Regex a Certain Number of Times
2.13. Choose Minimal or Maximal Repetition
2.14. Eliminate Needless Backtracking
2.15. Prevent Runaway Repetition
2.16. Test for a Match Without Adding It to the Overall Match
2.17. Match One of Two Alternatives Based on a Condition
2.18. Add Comments to a Regular Expression
2.19. Insert Literal Text into the Replacement Text
2.20. Insert the Regex Match into the Replacement Text
2.21. Insert Part of the Regex Match into the Replacement Text
2.22. Insert Match Context into the Replacement Text
3. Programming with Regular Expressions
Programming Languages and Regex Flavors
3.1. Literal Regular Expressions in Source Code
3.2. Import the Regular Expression Library
3.3. Create Regular Expression Objects
3.4. Set Regular Expression Options
3.5. Test If a Match Can Be Found Within a Subject String
3.6. Test Whether a Regex Matches the Subject String Entirely
3.7. Retrieve the Matched Text
3.8. Determine the Position and Length of the Match
3.9. Retrieve Part of the Matched Text
3.10. Retrieve a List of All Matches
3.11. Iterate over All Matches
3.12. Validate Matches in Procedural Code
3.13. Find a Match Within Another Match
3.14. Replace All Matches
3.15. Replace Matches Reusing Parts of the Match
3.16. Replace Matches with Replacements Generated in Code
3.17. Replace All Matches Within the Matches of Another Regex
3.18. Replace All Matches Between the Matches of Another Regex
3.19. Split a String
3.20. Split a String, Keeping the Regex Matches
3.21. Search Line by Line
Construct a Parser
4. Validation and Formatting
4.1. Validate Email Addresses
4.2. Validate and Format North American Phone Numbers
4.3. Validate International Phone Numbers
4.4. Validate Traditional Date Formats
4.5. Validate Traditional Date Formats, Excluding Invalid Dates
4.6. Validate Traditional Time Formats
4.7. Validate ISO 8601 Dates and Times
4.8. Limit Input to Alphanumeric Characters
4.9. Limit the Length of Text
4.10. Limit the Number of Lines in Text
4.11. Validate Affirmative Responses
4.12. Validate Social Security Numbers
4.13. Validate ISBNs
4.14. Validate ZIP Codes
4.15. Validate Canadian Postal Codes
4.16. Validate U.K. Postcodes
4.17. Find Addresses with Post Office Boxes
4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
4.19. Validate Password Complexity
4.20. Validate Credit Card Numbers
4.21. European VAT Numbers
5. Words, Lines, and Special Characters
5.1. Find a Specific Word
5.2. Find Any of Multiple Words
5.3. Find Similar Words
5.4. Find All Except a Specific Word
5.5. Find Any Word Not Followed by a Specific Word
5.6. Find Any Word Not Preceded by a Specific Word
5.7. Find Words Near Each Other
5.8. Find Repeated Words
5.9. Remove Duplicate Lines
5.10. Match Complete Lines That Contain a Word
5.11. Match Complete Lines That Do Not Contain a Word
5.12. Trim Leading and Trailing Whitespace
5.13. Replace Repeated Whitespace with a Single Space
5.14. Escape Regular Expression Metacharacters
6. Numbers
6.1. Integer Numbers
6.2. Hexadecimal Numbers
6.3. Binary Numbers
6.4. Octal Numbers
6.5. Decimal Numbers
6.6. Strip Leading Zeros
6.7. Numbers Within a Certain Range
6.8. Hexadecimal Numbers Within a Certain Range
6.9. Integer Numbers with Separators
6.10. Floating-Point Numbers
6.11. Numbers with Thousand Separators
6.12. Add Thousand Separators to Numbers
6.13. Roman Numerals
7. Source Code and Log Files
Keywords
Identifiers
Numeric Constants
Operators
Single-Line Comments
Multiline Comments
All Comments
Strings
Strings with Escapes
Regex Literals
Here Documents
Common Log Format
Combined Log Format
Broken Links Reported in Web Logs
8. URLs, Paths, and Internet Addresses
8.1. Validating URLs
8.2. Finding URLs Within Full Text
8.3. Finding Quoted URLs in Full Text
8.4. Finding URLs with Parentheses in Full Text
8.5. Turn URLs into Links
8.6. Validating URNs
8.7. Validating Generic URLs
8.8. Extracting the Scheme from a URL
8.9. Extracting the User from a URL
8.10. Extracting the Host from a URL
8.11. Extracting the Port from a URL
8.12. Extracting the Path from a URL
8.13. Extracting the Query from a URL
8.14. Extracting the Fragment from a URL
8.15. Validating Domain Names
8.16. Matching IPv4 Addresses
8.17. Matching IPv6 Addresses
8.18. Validate Windows Paths
8.19. Split Windows Paths into Their Parts
8.20. Extract the Drive Letter from a Windows Path
8.21. Extract the Server and Share from a UNC Path
8.22. Extract the Folder from a Windows Path
8.23. Extract the Filename from a Windows Path
8.24. Extract the File Extension from a Windows Path
8.25. Strip Invalid Characters from Filenames
9. Markup and Data Formats
Processing Markup and Data Formats with Regular Expressions
9.1. Find XML-Style Tags
9.2. Replace Tags with
9.3. Remove All XML-Style Tags Except and
9.4. Match XML Names
9.5. Convert Plain Text to HTML by Adding
and
Tags
9.6. Decode XML Entities
9.7. Find a Specific Attribute in XML-Style Tags
9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
9.9. Remove XML-Style Comments
9.10. Find Words Within XML-Style Comments
9.11. Change the Delimiter Used in CSV Files
9.12. Extract CSV Fields from a Specific Column
9.13. Match INI Section Headers
9.14. Match INI Section Blocks
9.15. Match INI Name-Value Pairs
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
About the Authors
Colophon
Copyright

Previous Chapter
3.12. Validate Matches in Procedural Code

Next Chapter
3.14. Replace All Matches

3.13. Find a Match Within Another Match
Problem
You want to find all the matches of a particular regular expression, but only within certain sections of the subject string. Another regular expression matches each of the sections in the string.
Suppose you have an HTML file in which various passages are marked as bold with  tags. You want to find all numbers marked as bold. If some bold text contains multiple numbers, you want to match all of them separately. For example, when processing the string 1 2 3 4 5 6 7, you want to find four matches: 2, 5, 6, and 7.
Solution
C#
StringCollection resultList = new StringCollection(); Regex outerRegex = new Regex("(.*?)", RegexOptions.Singleline); Regex innerRegex = new Regex(@"\d+"); // Find the first section Match outerMatch = outerRegex.Match(subjectString); while (outerMatch.Success) { // Get the matches within the section Match innerMatch = innerRegex.Match(outerMatch.Groups[1].Value); while (innerMatch.Success) { resultList.Add(innerMatch.Value); innerMatch = innerMatch.NextMatch(); } // Find the next section outerMatch = outerMatch.NextMatch(); }
VB.NET
Dim ResultList = New StringCollection Dim OuterRegex As New Regex("(.*?)", RegexOptions.Singleline) Dim InnerRegex As New Regex("\d+") 'Find the first section Dim OuterMatch = OuterRegex.Match(SubjectString) While OuterMatch.Success 'Get the matches within the section Dim InnerMatch = InnerRegex.Match(OuterMatch.Groups(1).Value) While InnerMatch.Success ResultList.Add(InnerMatch.Value) InnerMatch = InnerMatch.NextMatch End While OuterMatch = OuterMatch.NextMatch End While
Java
Iterating using two matchers is easy, and works with Java 4 and later:
List<String> resultList = new ArrayList<String>(); Pattern outerRegex = Pattern.compile("(.*?)", Pattern.DOTALL); Pattern innerRegex = Pattern.compile("\\d+"); Matcher outerMatcher = outerRegex.matcher(subjectString); while (outerMatcher.find()) { Matcher innerMatcher = innerRegex.matcher(outerMatcher.group(1)); while (innerMatcher.find()) { resultList.add(innerMatcher.group()); } }
The following code is more efficient (because innerMatcher is created only once), but requires Java 5 or later:
List<String> resultList = new ArrayList<String>(); Pattern outerRegex = Pattern.compile("(.*?)", Pattern.DOTALL); Pattern innerRegex = Pattern.compile("\\d+"); Matcher outerMatcher = outerRegex.matcher(subjectString); Matcher innerMatcher = innerRegex.matcher(subjectString); while (outerMatcher.find()) { innerMatcher.region(outerMatcher.start(1), outerMatcher.end(1)); while (innerMatcher.find()) { resultList.add(innerMatcher.group()); } }
JavaScript
var result = []; var outerRegex = /([\s\S]*?)<\/b>/g; var innerRegex = /\d+/g; var outerMatch; var innerMatches; while (outerMatch = outerRegex.exec(subject)) { if (outerMatch.index == outerRegex.lastIndex) outerRegex.lastIndex++; innerMatches = outerMatch[1].match(innerRegex); if (innerMatches) { result = result.concat(innerMatches); } }
XRegExp
XRegExp has a matchChain() method that is specifically designed to get the matches of one regex within the matches of another regex:
var result = XRegExp.matchChain(subject, [ {regex: XRegExp("(.*?)", "s"), backref: 1}, /\d+/ ]);
Alternatively, you can use XRegExp.forEach() for a solution similar to the standard JavaScript solution:
var result = []; var outerRegex = XRegExp("(.*?)", "s"); var innerRegex = /\d+/g; XRegExp.forEach(subject, outerRegex, function(outerMatch) { var innerMatches = outerMatch[1].match(innerRegex); if (innerMatches) { result = result.concat(innerMatches); } });
PHP
$list = array(); preg_match_all('%(.*?)%s', $subject, $outermatches, PREG_PATTERN_ORDER); for ($i = 0; $i < count($outermatches[0]); $i++) { if (preg_match_all('/\d+/', $outermatches[1][$i], $innermatches, PREG_PATTERN_ORDER)) { $list = array_merge($list, $innermatches[0]); } }
Perl
while ($subject =~ m!(.*?)!gs) { push(@list, ($1 =~ m/\d+/g)); }
This only works if the inner regular expression (‹\d+›, in this example) doesn’t have any capturing groups, so use noncapturing groups instead. See Recipe 2.9 for details.
Python
list = [] innerre = re.compile(r"\d+") for outermatch in re.finditer("(?s)(.*?)", subject): list.extend(innerre.findall(outermatch.group(1)))
Ruby
list = [] subject.scan(/(.*?)<\/b>/m) {|outergroups| list += outergroups[1].scan(/\d+/) }
Discussion
Regular expressions are well suited for tokenizing input, but they are not well suited for parsing input. Tokenizing means to identify different parts of a string, such as numbers, words, symbols, tags, comments, etc. It involves scanning the text from left to right, trying different alternatives and quantities of characters to be matched. Regular expressions handle this very well.
Parsing means to process the relationship between those tokens. For example, in a programming language, combinations of such tokens form statements, functions, classes, namespaces, etc. Keeping track of the meaning of the tokens within the larger context of the input is best left to procedural code. In particular, regular expressions cannot keep track of nonlinear context, such as nested constructs.^[6]
Trying to find one kind of token within another kind of token is a task that people commonly try to tackle with regular expressions. A pair of HTML bold tags is easily matched with the regular expression ‹(.*?)›.^[7] A number is even more easily matched with the regex ‹\d+›. But if you try to combine these into a single regex, you’ll end up with something rather different:
\d+(?=(?:(?!).)*)

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Though the regular expression just shown is a solution to the problem posed by this recipe, it is hardly intuitive. Even a regular expression expert will have to carefully scrutinize the regex to determine what it does, or perhaps resort to a tool to highlight the matches. And this is the combination of just two simple regexes.
A better solution is to keep the two regular expressions as they are and use procedural code to combine them. The resulting code, while a bit longer, is much easier to understand and maintain, and creating simple code is the reason for using regular expressions in the first place. A regex such as ‹(.*?)› is easy to understand by anyone with a modicum of regex experience, and quickly does what would otherwise take many more lines of code that are harder to maintain.
Though the solutions for this recipe are some of the most complex ones in this chapter, they’re very straightforward. Two regular expressions are used. The “outer” regular expression matches the HTML bold tags and the text between them, and the text in between is captured by the first capturing group. This regular expression is implemented with the same code shown in Recipe 3.11. The only difference is that the placeholder comment saying where to use the match has been replaced with code that lets the “inner” regular expression do its job.
The second regular expression matches a digit. This regex is implemented with the same code as shown in Recipe 3.10. The only difference is that instead of processing the subject string entirely, the second regex is applied only to the part of the subject string matched by the first capturing group of the outer regular expression.
There are two ways to restrict the inner regular expressions to the text matched by (a capturing group of) the outer regular expressions. Some languages provide a function that allows the regular expression to be applied to part of a string. That can save an extra string copy if the match function doesn’t automatically fill a structure with the text matched by the capturing groups. We can always simply retrieve the substring matched by the capturing group and apply the inner regex to that.
Either way, using two regular expressions together in a loop will be faster than using the one regular expression with its nested lookahead groups. The latter requires the regex engine to do a whole lot of backtracking. On large files, using just one regex will be much slower, as it needs to determine the section boundaries (HTML bold tags) for each number in the subject string, including numbers that are not between  tags. The solution that uses two regular expressions doesn’t even begin to look for numbers until it has found the section boundaries, which it does in linear time.
The XRegExp library for JavaScript has a special matchChain() method that is specifically designed to get the matches of one regex within the matches of another regex. This method takes an array of regexes as its second parameter. You can add as many regexes to the array as you want. You can find the matches of a regex within the matches of another regex, within the matches of other regexes, as many levels deep as you want. This recipe only uses two regexes, so our array only needs two elements. If you want the next regex to search within the text matched by a particular capturing group of a regex, add that regex as an object to the array. The object should have a regex property with the regular expression, and a backref property with the name or number of the capturing group. If you specify the last regex in the array as an object with a regex and a backref property, then the returned array will contain the matches of that capturing group in the final regex.

Table of Contents for Regular Expressions Cookbook, 2nd Edition

3.13. Find a Match Within Another Match

Problem

Solution

C#

VB.NET

Java

JavaScript

XRegExp

PHP

Perl

Python

Ruby

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition