Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Although this regular expression works equally well with all of the flavors covered by this book, the accompanying replacement text is decidedly less portable.
Replacement:
$&,
| Replacement text flavors: .NET, JavaScript, Perl |
$0,
| Replacement text flavors: .NET, Java, XRegExp, PHP |
\0,
| Replacement text flavors: PHP, Ruby |
\&,
| Replacement text flavor: Ruby |
\g<0>,
| Replacement text flavor: Python |
These replacement strings all put the matched number back using backreference zero (the entire match, which in this case is a single digit), followed by a comma. When programming, you can implement this regular expression search-and-replace as explained in Recipe 3.15.
(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
Replacement:
,
| Replacement text flavors: .NET, Java, Perl, PHP, Python, Ruby |
Recipe 3.14 explains how you can implement this basic regular expression search-and-replace when programming.
This version doesn’t work with JavaScript or Ruby 1.8, because they don’t support any type of lookbehind. This time around, however, we need only one version of the replacement text because we’re simply using a comma without any backreference as the replacement.
Adding thousand separators to numbers in your documents, data, and program output is a simple but effective way to improve their readability and appearance.
Some of the programming languages covered by this book provide
built-in methods to add locale-aware thousand separators to numbers.
For instance, in Python you can use locale.format('%d', 1000000, True) to convert
the number 1000000 to
the string '1,000,000',
assuming you’ve previously set your program to use a locale that uses
commas as the thousand separator. For other locales, the number might
be separated using dots, underscores, apostrophes, or spaces.
However, locale-aware processing is not always available, reliable, or appropriate. In the finance world, for example, using commas as thousand separators is the norm, regardless of location. Internationalization might not be a relevant issue to begin with when working in a text editor rather than programming. For these reasons, and for simplicity, in this recipe we’ve assumed you always want to use commas as the thousand separator. In the upcoming section, we’ve also assumed you want to use dots as decimal points. If you need to use other characters, feel free to swap them in.
Although adding thousand separators to all numbers in a file or string can improve the presentation of your data, it’s important to understand what kind of content you’re dealing with before doing so. For instance, you probably don’t want to add commas to IDs, four-digit years, and ZIP codes. Documents and data that include these kinds of numbers might not be good candidates for automated comma insertion.
This regular expression matches any single digit that has digits
on the right in exact sets of three. It therefore matches twice in the
string 12345678, finding the digits 2 and 5. All the other digits
are not followed by an exact multiple of three digits.
The accompanying replacement text puts back the matched digit
using backreference zero (the entire match), and follows it with a
comma. That leaves us with 12,345,678. Voilà!
To explain how the regex determines which digits to match, we’ll
split it into two parts. The first part is the leading character class
‹[0-9]› that matches any
single digit. The second part is the positive lookahead ‹(?=(?:[0-9]{3})+(?![0-9]))› that
causes the match attempt to fail unless it’s at a position followed by
digits in exact sets of three. In other words, the lookahead ensures
that the regex matches only the digits that should be followed by a
comma. Recipe 2.16 explains how lookahead
works.
The ‹(?:[0-9]{3})+› within the lookahead matches
digits in sets of three. The negative lookahead ‹(?![0-9])› that follows is there
to ensure that no digits come immediately after the digits we matched
in sets of three. Otherwise, the outer positive lookahead would be
satisfied by any number of following digits, so long as there were at
least three.
This adaptation of the previous regex doesn’t match any digits at all. Instead, it matches only the positions where we want to insert commas within numbers. These positions are wherever there are digits on the right in exact sets of three, and at least one digit on the left.
The lookahead used to search for sets of exactly three digits on
the right is the same as in the last regex. The difference here is
that, instead of starting the regex with ‹[0-9]› to match a digit, we instead assert that
there is at least one digit to the left by using the positive
lookbehind ‹(?<=[0-9])›. Without the lookbehind, the
regex would match the position to the left of 123 and therefore the
search-and-replace would convert it to ,100. Lookbehind is explained together with
lookahead in Recipe 2.16.
JavaScript and Ruby 1.8 don’t support lookbehind, so they cannot use this version of the regular expression.
The preceding regexes add commas to any sequence of four or more digits. A rather glaring issue with this basic approach is that it can add commas to digits that come after a dot as the decimal separator, so long as there are at least four digits after the dot. Following are two ways to fix this.
The problem is easy to solve if you’re able to use an
infinite-length quantifier like ‹+› or at least a long finite-length quantifier
like ‹{1,100}› within
lookbehind.
Regular expression:
[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!\.[0-9]+)| Regex options: None |
| Regex flavors: .NET |
[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!\.[0-9]{1,100})| Regex options: None |
| Regex flavors: .NET, Java |
Replacement:
$0,
| Replacement text flavors: .NET, Java |
The first regex here works in .NET only because of the
‹+› in
the lookbehind. The second regex works in both .NET and Java,
because Java supports any finite-length quantifier inside
lookbehind—even arbitrarily long interval quantifiers like {1,100}. The .NET-only version
therefore works correctly with any number, whereas the Java version
avoids adding commas to numbers after a decimal place only when
there are 100 or fewer digits after the dot. You can bump up the
second number in the ‹{1,100}› quantifier if you want to support
even longer numbers to the right of a decimal separator.
With both regexes, we’ve put the new lookbehind at the end of the pattern. The regexes could be restructured to add the lookbehind at the front, as you might intuitively expect, but we’ve done it this way to optimize efficiency. Since the lookbehind is the slowest part of the regex, putting it at the end lets the regex fail more quickly at positions within the subject string where the lookbehind doesn’t need to be evaluated in order to rule out a match.
If you’re not working with .NET or Java and therefore can’t look as far back into the subject string as you want, you can still use fixed-length lookbehind to help match entire numbers that aren’t preceded by a dot. That lets you identify the numbers that qualify for having commas added (and correctly exclude any digits that come after a decimal point), but because it matches entire numbers, you can’t simply include a comma in the replacement string and be done with it.
Completing the solution requires using two regexes. An outer regex to match the numbers that should have commas added to them, and an inner regex that searches within the qualifying numbers as part of a search-and-replace that inserts the commas.
Outer regex:
\b(?<!\.)[0-9]{4,}| Regex options: None |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
This matches any entire number with four or more digits that
is not preceded by a dot. The word boundary at the beginning of the
regex ensures that any matched numbers start at the beginning of the
string or are separate from other numbers and words. Otherwise, the
regex could match the 2345 from 0.12345. In other words, without the
word boundary, matches could start from the second digit after a
decimal point, since a dot is no longer the preceding character at
that point.
The inner regex and replacement text to go with this are the same as the Basic solution.
In order to apply the inner regex’s generated replacement values to each match of the outer regex, we need to replace matches of the outer regex with values generated in code, rather than using a simple string replacement. That way we can run the inner regex within the code that generates the outer regex’s replacement value. This may sound complicated, but the programming languages covered by this book all make it fairly straightforward.
Here’s the complete solution for Ruby 1.9:
subject.gsub(/\b(?<!\.)[0-9]{4,}/) {|match|
match.gsub(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/, '\0,')
}The subject
variable in this code holds the string to commafy. Ruby’s gsub string method performs a
global search-and-replace. For other programming languages, follow
Recipe 3.16, which explains how
to replace matches with replacements generated in code. It includes
examples that show this technique in action for each
language.
The lack of lookbehind support in JavaScript and Ruby 1.8 prevents this solution from being fully portable, since we used lookbehind in the outer regex. We can work around this in JavaScript and Ruby 1.8 by including the character, if any, that precedes a number as part of the match, and requiring that it be something other than a digit or dot. We can then put the nondigit/nondot character back using a backreference in the generated replacement text.
Here’s the JavaScript code to pull this off:
subject.replace(/(^|[^0-9.])([0-9]{4,})/g, function($0, $1, $2) {
return $1 + $2.replace(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/g, "$&,");
});Recipe 6.11 explains how to match numbers that already include commas within them.
All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.