Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Cover
Regular Expressions Cookbook
Preface
Caught in the Snarls of Different Versions
Intended Audience
Technology Covered
Organization of This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Introduction to Regular Expressions
Regular Expressions Defined
Search and Replace with Regular Expressions
Tools for Working with Regular Expressions
2. Basic Regular Expression Skills
2.1. Match Literal Text
2.2. Match Nonprintable Characters
2.3. Match One of Many Characters
2.4. Match Any Character
2.5. Match Something at the Start and/or the End of a Line
2.6. Match Whole Words
2.7. Unicode Code Points, Categories, Blocks, and Scripts
2.8. Match One of Several Alternatives
2.9. Group and Capture Parts of the Match
2.10. Match Previously Matched Text Again
2.11. Capture and Name Parts of the Match
2.12. Repeat Part of the Regex a Certain Number of Times
2.13. Choose Minimal or Maximal Repetition
2.14. Eliminate Needless Backtracking
2.15. Prevent Runaway Repetition
2.16. Test for a Match Without Adding It to the Overall Match
2.17. Match One of Two Alternatives Based on a Condition
2.18. Add Comments to a Regular Expression
2.19. Insert Literal Text into the Replacement Text
2.20. Insert the Regex Match into the Replacement Text
2.21. Insert Part of the Regex Match into the Replacement Text
2.22. Insert Match Context into the Replacement Text
3. Programming with Regular Expressions
Programming Languages and Regex Flavors
3.1. Literal Regular Expressions in Source Code
3.2. Import the Regular Expression Library
3.3. Create Regular Expression Objects
3.4. Set Regular Expression Options
3.5. Test If a Match Can Be Found Within a Subject String
3.6. Test Whether a Regex Matches the Subject String Entirely
3.7. Retrieve the Matched Text
3.8. Determine the Position and Length of the Match
3.9. Retrieve Part of the Matched Text
3.10. Retrieve a List of All Matches
3.11. Iterate over All Matches
3.12. Validate Matches in Procedural Code
3.13. Find a Match Within Another Match
3.14. Replace All Matches
3.15. Replace Matches Reusing Parts of the Match
3.16. Replace Matches with Replacements Generated in Code
3.17. Replace All Matches Within the Matches of Another Regex
3.18. Replace All Matches Between the Matches of Another Regex
3.19. Split a String
3.20. Split a String, Keeping the Regex Matches
3.21. Search Line by Line
Construct a Parser
4. Validation and Formatting
4.1. Validate Email Addresses
4.2. Validate and Format North American Phone Numbers
4.3. Validate International Phone Numbers
4.4. Validate Traditional Date Formats
4.5. Validate Traditional Date Formats, Excluding Invalid Dates
4.6. Validate Traditional Time Formats
4.7. Validate ISO 8601 Dates and Times
4.8. Limit Input to Alphanumeric Characters
4.9. Limit the Length of Text
4.10. Limit the Number of Lines in Text
4.11. Validate Affirmative Responses
4.12. Validate Social Security Numbers
4.13. Validate ISBNs
4.14. Validate ZIP Codes
4.15. Validate Canadian Postal Codes
4.16. Validate U.K. Postcodes
4.17. Find Addresses with Post Office Boxes
4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
4.19. Validate Password Complexity
4.20. Validate Credit Card Numbers
4.21. European VAT Numbers
5. Words, Lines, and Special Characters
5.1. Find a Specific Word
5.2. Find Any of Multiple Words
5.3. Find Similar Words
5.4. Find All Except a Specific Word
5.5. Find Any Word Not Followed by a Specific Word
5.6. Find Any Word Not Preceded by a Specific Word
5.7. Find Words Near Each Other
5.8. Find Repeated Words
5.9. Remove Duplicate Lines
5.10. Match Complete Lines That Contain a Word
5.11. Match Complete Lines That Do Not Contain a Word
5.12. Trim Leading and Trailing Whitespace
5.13. Replace Repeated Whitespace with a Single Space
5.14. Escape Regular Expression Metacharacters
6. Numbers
6.1. Integer Numbers
6.2. Hexadecimal Numbers
6.3. Binary Numbers
6.4. Octal Numbers
6.5. Decimal Numbers
6.6. Strip Leading Zeros
6.7. Numbers Within a Certain Range
6.8. Hexadecimal Numbers Within a Certain Range
6.9. Integer Numbers with Separators
6.10. Floating-Point Numbers
6.11. Numbers with Thousand Separators
6.12. Add Thousand Separators to Numbers
6.13. Roman Numerals
7. Source Code and Log Files
Keywords
Identifiers
Numeric Constants
Operators
Single-Line Comments
Multiline Comments
All Comments
Strings
Strings with Escapes
Regex Literals
Here Documents
Common Log Format
Combined Log Format
Broken Links Reported in Web Logs
8. URLs, Paths, and Internet Addresses
8.1. Validating URLs
8.2. Finding URLs Within Full Text
8.3. Finding Quoted URLs in Full Text
8.4. Finding URLs with Parentheses in Full Text
8.5. Turn URLs into Links
8.6. Validating URNs
8.7. Validating Generic URLs
8.8. Extracting the Scheme from a URL
8.9. Extracting the User from a URL
8.10. Extracting the Host from a URL
8.11. Extracting the Port from a URL
8.12. Extracting the Path from a URL
8.13. Extracting the Query from a URL
8.14. Extracting the Fragment from a URL
8.15. Validating Domain Names
8.16. Matching IPv4 Addresses
8.17. Matching IPv6 Addresses
8.18. Validate Windows Paths
8.19. Split Windows Paths into Their Parts
8.20. Extract the Drive Letter from a Windows Path
8.21. Extract the Server and Share from a UNC Path
8.22. Extract the Folder from a Windows Path
8.23. Extract the Filename from a Windows Path
8.24. Extract the File Extension from a Windows Path
8.25. Strip Invalid Characters from Filenames
9. Markup and Data Formats
Processing Markup and Data Formats with Regular Expressions
9.1. Find XML-Style Tags
9.2. Replace Tags with
9.3. Remove All XML-Style Tags Except and
9.4. Match XML Names
9.5. Convert Plain Text to HTML by Adding
and
Tags
9.6. Decode XML Entities
9.7. Find a Specific Attribute in XML-Style Tags
9.8. Add a cellspacing Attribute to Tags That Do Not Already Include It
9.9. Remove XML-Style Comments
9.10. Find Words Within XML-Style Comments
9.11. Change the Delimiter Used in CSV Files
9.12. Extract CSV Fields from a Specific Column
9.13. Match INI Section Headers
9.14. Match INI Section Blocks
9.15. Match INI Name-Value Pairs
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
About the Authors
Colophon
Copyright

Previous Chapter
9.10. Find Words Within XML-Style Comments

Next Chapter
9.12. Extract CSV Fields from a Specific Column

9.11. Change the Delimiter Used in CSV Files
Problem
You want to change all field-delimiting commas in a CSV file to tabs. Commas that occur within double-quoted values should be left alone.
Solution
The following regular expression matches an individual CSV field along with its preceding delimiter, if any. The preceding delimiter is usually a comma, but can also be an empty string (i.e., nothing) when matching the first field of the first record, or a line break when matching the first field of any subsequent record. Every time a match is found, the field itself, including the double quotes that may surround it, is captured to backreference 2, and its preceding delimiter is captured to backreference 1.
Tip
The regular expressions in this recipe are designed to work correctly with valid CSV files only, according to the format rules discussed in Comma-Separated Values (CSV).
(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
Here is the same regular expression again in free-spacing mode:
( , | \r?\n | ^ ) # Capture the leading field delimiter to backref 1 ( # Capture a single field to backref 2: [^",\r\n]+ # Unquoted field | # Or: " (?:[^"]|"")* " # Quoted field (may contain escaped double quotes) )? # The group is optional because fields may be empty
Regex options: Free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
Using this regex and the code in Recipe 3.11, you can iterate over your CSV file and check the value of backreference 1 after each match. The necessary replacement string for each match depends on the value of this backreference. If it’s a comma, replace it with a tab character. If the backreference is empty or contains a line break, leave the value in place (i.e., do nothing, or put it back as part of a replacement string). Since CSV fields are captured to backreference 2 as part of each match, you’ll also have to put that back as part of each replacement string. The only things you’re actually replacing are the commas that are captured to backreference 1.
Example web page with JavaScript
The following code is a complete web page that includes two multiline text input fields, with a button labeled Replace between them. Clicking the button takes whatever string you put into the first text box (labeled Input), converts any comma delimiters to tabs with the help of the regular expression just shown, then puts the new string into the second text box (labeled Output). If you use valid CSV content as your input, it should show up in the second text box with all comma delimiters replaced with tabs. To test it, save this code into a file with the .html extension and open it in your favorite web browser:
<html> <head> <title>Change CSV delimiters from commas to tabs</title> </head> <body> <p>Input:</p> <textarea id="input" rows="5" cols="75"></textarea> <p><input type="button" value="Replace" onclick="commasToTabs()"></p> <p>Output:</p> <textarea id="output" rows="5" cols="75"></textarea> <script> function commasToTabs() { var input = document.getElementById("input"), output = document.getElementById("output"), regex = /(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?/g, result = "", match; while (match = regex.exec(input.value)) { // Check the value of backreference 1 if (match[1] == ",") { // Add a tab (in place of the matched comma) and backreference // 2 to the result. If backreference 2 is undefined (because // the optional, second capturing group did not participate in // the match), use an empty string instead. result += "\t" + (match[2] || ""); } else { // Add the entire match to the result result += match[0]; } // If there is an empty match, prevent some browsers from getting // stuck in an infinite loop if (match.index == regex.lastIndex) { regex.lastIndex++; } } output.value = result; } </script> </body> </html>

Discussion

The approach prescribed by this recipe allows you to pass over each complete CSV field (including any embedded line breaks, escaped double quotes, and commas) one at a time. Each match then starts just before the next field delimiter.

The first capturing group in the regex, ‹(,|\r?\n|^)›, matches a comma, line break, or the position at the beginning of the subject string. Since the regex engine will attempt alternatives from left to right, these options are listed in the order in which they will most frequently occur in the average CSV file. This capturing group is the only part of the regex that is required to match. Therefore, it’s possible for the complete regex to match an empty string since the ‹^› anchor can match once. The value matched by this first capturing group must be checked in the code outside of the regex that replaces commas with your substitute delimiters (in this case, tabs).

We haven’t yet gotten through the entire regex, but the approach described so far is already somewhat convoluted. You might be wondering why the regex is not written to match only the commas that should be replaced with tabs. If you could do that, a simple substitution of all matched text would avoid the need for code outside of the regex to check whether capturing group 1 matched a comma or some other string. After all, it should be possible to use lookahead and lookbehind to determine whether a comma is inside or outside a quoted CSV field, right?

Unfortunately, in order for such an approach to accurately determine which commas are outside of double-quoted fields, you’d need infinite-length lookbehind, which is available in the .NET regex flavor only (see Different levels of lookbehind for a discussion of the varying lookbehind limitations). Even .NET developers should avoid a lookaround-based approach since it would add significant complexity and also make the regex slower.

Getting back to how the regex works, most of the pattern appears within the next set of parentheses: capturing group 2. This second group matches a single CSV field, including any surrounding double quotes. Unlike the previous capturing group, this one is optional in order to allow matching empty fields.

Note that group 2 within the regex contains two alternative patterns separated by the ‹|› metacharacter. The first alternative, ‹[^",\r\n]+›, is a negated character class followed by a one-or-more quantifier (‹+›) that, together, match an entire unquoted field. For this to match, the field cannot contain any double quotes, commas, or line breaks.

The second alternative within group 2, ‹"(?:[^"]|"")*"›, matches a field surrounded by double quotes. More precisely, it matches a double quote character, followed by zero or more non-double-quote characters or repeated (escaped) double quotes, followed by a closing double quote. The ‹*› quantifier at the end of the noncapturing group continues repeating the two options within the group until it reaches a double quote that is not repeated and therefore ends the field.

Assuming you’re working with a valid CSV file, the first match found by this regex should occur at the beginning of the subject string, and each subsequent match should occur immediately after the end of the last match.

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.11. Change the Delimiter Used in CSV Files

Problem

Solution

Tip

Example web page with JavaScript

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition