Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3| Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3| Replacement text flavors: Python, Ruby |
function formatName(name) {
return name.replace(/^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$/i,
"$2, $1$3");
}Recipe 3.15 has code listings that will help you add this regex search-and-replace to programs written in other languages. Recipe 3.4 shows how to set the “case insensitive” option used here.
First, let’s take a look at this regular expression piece by piece. Higher-level comments are provided afterward to help explain which parts of a name are being matched by various segments of the regex. Since the regex is written here in free-spacing mode, the literal space characters have been escaped with backslashes:
^ # Assert position at the beginning of the string.
( # Capture the enclosed match to backreference 1:
.+? # Match one or more characters, as few times as possible.
) # End the capturing group.
\ # Match a literal space character.
( # Capture the enclosed match to backreference 2:
[^\s,]+ # Match one or more non-whitespace/comma characters.
) # End the capturing group.
( # Capture the enclosed match to backreference 3:
,?\ # Match ", " or " ".
(?: # Group but don't capture:
[JS]r\.? # Match "Jr", "Jr.", "Sr", or "Sr.".
| # Or:
III? # Match "II" or "III".
| # Or:
IV # Match "IV".
) # End the noncapturing group.
)? # Make the group optional.
$ # Assert position at the end of the string.| Regex options: Case insensitive, free-spacing |
| Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This regular expression makes the following assumptions about the subject data:
It contains at least one first name and one last name (other name parts are optional).
The first name is listed before the last name (not the norm with some national conventions).
If the name contains a suffix, it is one of the values “Jr”, “Jr.”, “Sr”, “Sr.”, “II”, “III”, or “IV”, with an optional preceding comma.
A few more issues to consider:
The regular expression cannot identify compound surnames that
don’t use hyphens. For example, Sacha Baron Cohen would be replaced
with Cohen, Sacha Baron, rather than the
correct listing, Baron Cohen,
Sacha.
It does not keep particles in front of the family name, although this is sometimes called for by convention or personal preference (for example, the correct alphabetical listing of “Charles de Gaulle” is “de Gaulle, Charles” according to the Chicago Manual of Style, 16th Edition, which contradicts Merriam-Webster’s Biographical Dictionary on this particular name).
Because of the ‹^› and ‹$› anchors that bind the match to the
beginning and end of the string, no replacement can be made if the
entire subject text does not fit the pattern. Hence, if no suitable
match is found (for example, if the subject text contains only one
name), the name is left unaltered.
As for how the regular expression works, it uses three capturing
groups to split up the name. The pieces are then reassembled in the
desired order via backreferences in the replacement string. Capturing
group 1 uses the maximally flexible ‹.+?› pattern to grab the first name along with any
number of middle names and surname particles, such as the German “von”
or the French, Portuguese, and Spanish “de.” These name parts are
handled together because they are listed sequentially in the output.
Lumping the first and middle names together also helps avoid errors,
because the regular expression cannot distinguish between a compound
first name, such as “Mary Lou” or “Norma Jeane,” and a first name plus
middle name. Even humans cannot accurately make the distinction just by
visual examination.
Capturing group 2 matches the last name using ‹[^\s,]+›. Like the dot used in
capturing group 1, the flexibility of this character class allows it to
match accented characters and any other non-Latin characters. Capturing
group 3 matches an optional suffix, such as “Jr.” or “III,” from a
predefined list of possible values. The suffix is handled separately
from the last name because it should continue to appear at the end of
the reformatted name.
Let’s go back for a minute to capturing group 1. Why was the dot
within group 1 followed by the lazy ‹+?›
quantifier, whereas the character class in group 2 was followed by the
greedy ‹+›
quantifier? If group 1 (which handles a variable number of elements and
therefore needs to go as far as it can into the name) used a greedy
quantifier, capturing group 3 (which attempts to match a suffix)
wouldn’t have a shot at participating in the match. The dot from group 1
would match until the end of the string, and since capturing group 3 is
optional, the regex engine would only backtrack enough to find a match
for group 2 before declaring success. Capturing group 2 can use a greedy
quantifier because its more restrictive character class only allows it
to match one name.
Table 4-2 shows some examples of how names are formatted using this regular expression and replacement string.
An added segment in the following regular expression allows you to output surname particles from a predefined list in front of the last name. Specifically, this regular expression accounts for the values “de”, “du”, “la”, “le”, “St”, “St.”, “Ste”, “Ste.”, “van”, and “von”. Any number of these values are allowed in sequence (for example, “de la”):
^(.+?)●((?:(?:d[eu]|l[ae]|Ste?\.?|v[ao]n)●)*[^\s,]+)↵ (,?●(?:[JS]r\.?|III?|IV))?$
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3| Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3| Replacement text flavors: Python, Ruby |
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.