Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Replacement

$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
\2,●\1\3
Replacement text flavors: Python, Ruby

JavaScript example

function formatName(name) { return name.replace(/^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$/i, "$2, $1$3"); }
Recipe 3.15 has code listings that will help you add this regex search-and-replace to programs written in other languages. Recipe 3.4 shows how to set the “case insensitive” option used here.

Discussion

First, let’s take a look at this regular expression piece by piece. Higher-level comments are provided afterward to help explain which parts of a name are being matched by various segments of the regex. Since the regex is written here in free-spacing mode, the literal space characters have been escaped with backslashes:

^ # Assert position at the beginning of the string. ( # Capture the enclosed match to backreference 1: .+? # Match one or more characters, as few times as possible. ) # End the capturing group. \ # Match a literal space character. ( # Capture the enclosed match to backreference 2: [^\s,]+ # Match one or more non-whitespace/comma characters. ) # End the capturing group. ( # Capture the enclosed match to backreference 3: ,?\ # Match ", " or " ". (?: # Group but don't capture: [JS]r\.? # Match "Jr", "Jr.", "Sr", or "Sr.". | # Or: III? # Match "II" or "III". | # Or: IV # Match "IV". ) # End the noncapturing group. )? # Make the group optional. $ # Assert position at the end of the string.
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
This regular expression makes the following assumptions about the subject data:
It contains at least one first name and one last name (other name parts are optional).
The first name is listed before the last name (not the norm with some national conventions).
If the name contains a suffix, it is one of the values “Jr”, “Jr.”, “Sr”, “Sr.”, “II”, “III”, or “IV”, with an optional preceding comma.
A few more issues to consider:
The regular expression cannot identify compound surnames that don’t use hyphens. For example, Sacha Baron Cohen would be replaced with Cohen, Sacha Baron, rather than the correct listing, Baron Cohen, Sacha.
It does not keep particles in front of the family name, although this is sometimes called for by convention or personal preference (for example, the correct alphabetical listing of “Charles de Gaulle” is “de Gaulle, Charles” according to the Chicago Manual of Style, 16^th Edition, which contradicts Merriam-Webster’s Biographical Dictionary on this particular name).
Because of the ‹^› and ‹$› anchors that bind the match to the beginning and end of the string, no replacement can be made if the entire subject text does not fit the pattern. Hence, if no suitable match is found (for example, if the subject text contains only one name), the name is left unaltered.
As for how the regular expression works, it uses three capturing groups to split up the name. The pieces are then reassembled in the desired order via backreferences in the replacement string. Capturing group 1 uses the maximally flexible ‹.+?› pattern to grab the first name along with any number of middle names and surname particles, such as the German “von” or the French, Portuguese, and Spanish “de.” These name parts are handled together because they are listed sequentially in the output. Lumping the first and middle names together also helps avoid errors, because the regular expression cannot distinguish between a compound first name, such as “Mary Lou” or “Norma Jeane,” and a first name plus middle name. Even humans cannot accurately make the distinction just by visual examination.
Capturing group 2 matches the last name using ‹[^\s,]+›. Like the dot used in capturing group 1, the flexibility of this character class allows it to match accented characters and any other non-Latin characters. Capturing group 3 matches an optional suffix, such as “Jr.” or “III,” from a predefined list of possible values. The suffix is handled separately from the last name because it should continue to appear at the end of the reformatted name.
Let’s go back for a minute to capturing group 1. Why was the dot within group 1 followed by the lazy ‹+?› quantifier, whereas the character class in group 2 was followed by the greedy ‹+› quantifier? If group 1 (which handles a variable number of elements and therefore needs to go as far as it can into the name) used a greedy quantifier, capturing group 3 (which attempts to match a suffix) wouldn’t have a shot at participating in the match. The dot from group 1 would match until the end of the string, and since capturing group 3 is optional, the regex engine would only backtrack enough to find a match for group 2 before declaring success. Capturing group 2 can use a greedy quantifier because its more restrictive character class only allows it to match one name.
Table 4-2 shows some examples of how names are formatted using this regular expression and replacement string.
Table 4-2. Formatted names
Input
Output
Robert Downey, Jr.
Downey, Robert, Jr.
John F. Kennedy
Kennedy, John F.
Scarlett O’Hara
O’Hara, Scarlett
Pepé Le Pew
Pew, Pepé Le
J.R.R. Tolkien
Tolkien, J.R.R.
Catherine Zeta-Jones
Zeta-Jones, Catherine

Input	Output
`Robert Downey, Jr.`	`Downey, Robert, Jr.`
`John F. Kennedy`	`Kennedy, John F.`
`Scarlett O’Hara`	`O’Hara, Scarlett`
`Pepé Le Pew`	`Pew, Pepé Le`
`J.R.R. Tolkien`	`Tolkien, J.R.R.`
`Catherine Zeta-Jones`	`Zeta-Jones, Catherine`

Variations

List surname particles at the beginning of the name

An added segment in the following regular expression allows you to output surname particles from a predefined list in front of the last name. Specifically, this regular expression accounts for the values “de”, “du”, “la”, “le”, “St”, “St.”, “Ste”, “Ste.”, “van”, and “von”. Any number of these values are allowed in sequence (for example, “de la”):

^(.+?)●((?:(?:d[eu]|l[ae]|Ste?\.?|v[ao]n)●)*[^\s,]+)↵ (,?●(?:[JS]r\.?|III?|IV))?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
\2,●\1\3
Replacement text flavors: Python, Ruby

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”

Problem

Solution

Regular expression

Replacement

JavaScript example

Discussion

Variations

List surname particles at the beginning of the name

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”

Problem

Solution

Regular expression

Replacement

JavaScript example

Discussion

Variations

List surname particles at the beginning of the name

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition