Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

Removing leading and trailing whitespace is a simple but common task. The regular expressions just shown contain three parts each: the shorthand character class to match any whitespace character (‹\s›), a quantifier to repeat the class one or more times (‹+›), and an anchor to assert position at the beginning or end of the string. ‹\A› and ‹^› match at the beginning; ‹\Z› and ‹$› at the end.

We’ve included two options for matching both leading and trailing whitespace because of incompatibilities between Ruby and JavaScript. With the other regex flavors, you can chose either option. The versions with ‹^› and ‹$› don’t work correctly in Ruby, because Ruby always lets these anchors match at the beginning and end of any line. JavaScript doesn’t support the ‹\A› and ‹\Z› anchors.

Many programming languages provide a function, usually called trim or strip, that can remove leading and trailing whitespace for you. Table 5-2 shows how to use this built-in function or method in a variety of programming languages.

Table 5-2. Standard functions to remove leading and trailing whitespace
Language
Function
C#, VB.NET
String.Trim([Chars])
Java, JavaScript
string.trim()
PHP
trim($string)
Python, Ruby
string.strip()

Language	Function
C#, VB.NET	`String.Trim([Chars])`
Java, JavaScript	`string.trim()`
PHP	`trim($string)`
Python, Ruby	`string.strip()`

Perl does not have an equivalent function in its standard library, but you can create your own by using the regular expressions shown earlier in this recipe:

sub trim { my $string = shift; $string =~ s/^\s+//; $string =~ s/\s+$//; return $string; }
JavaScript’s string.trim() method is a recent addition to the language. For older browsers (prior to Internet Explorer 9 and Firefox 3.5), you can add it like this:
// Add the trim method for browsers that don't already include it if (!String.prototype.trim) { String.prototype.trim = function() { return this.replace(/^\s+/, "").replace(/\s+$/, ""); }; }
Tip
In both Perl and JavaScript, ‹\s› matches any character defined as whitespace by the Unicode standard, in addition to the space, tab, line feed, and carriage return characters that are most commonly considered whitespace.

Variations

There are in fact many different ways you can write a regular expression to help you trim a string. However, the alternatives are usually slower than using two simple substitutions when working with long strings (when performance matters most). Following are some of the more common alternative solutions you might encounter. They are all written in JavaScript, and since standard JavaScript doesn’t have a “dot matches line breaks” option, the regular expressions use ‹[\s\S]› to match any single character, including line breaks. In other programming languages, use a dot instead, and enable the “dot matches line breaks” option.

string.replace(/^\s+|\s+$/g, "");

This is probably the most common solution. It combines the two simple regexes via alternation (see Recipe 2.8), and uses the /g (global) flag to replace all matches rather than just the first (it will match twice when its target contains both leading and trailing whitespace). This isn’t a terrible approach, but it’s slower than using two simple substitutions when working with long strings since the two alternation options need to be tested at every character position.

string.replace(/^\s*([\s\S]*?)\s*$/, "$1")

This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference 1. By replacing the entire string with backreference 1, you’re left with a trimmed version of the string.

This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings.

Let’s step back to look at how this actually works. After the regex enters the capturing group, the ‹[\s\S]› class’s lazy ‹*?› quantifier requires that it be repeated as few times as possible. Thus, the regex matches one character at a time, stopping after each character to try to match the remaining ‹\s*$› pattern. If that fails because nonwhitespace characters remain somewhere after the current position in the string, the regex matches one more character, updates the backreference, and then tries the remainder of the pattern again.

string.replace(/^\s*([\s\S]*\S)?\s*$/, "$1")

This is similar to the last regex, but it replaces the lazy quantifier with a greedy one for performance reasons. To make sure that the capturing group still only matches up to the last nonwhitespace character, a trailing ‹\S› is required. However, since the regex must be able to match whitespace-only strings, the entire capturing group is made optional by adding a trailing question mark quantifier.

Here, the greedy asterisk in ‹[\s\S]*› repeats its any-character pattern to the end of the string. The regex then backtracks one character at a time until it’s able to match the following ‹\S›, or until it backtracks to the first character matched within the group (after which it skips the group).

Unless there’s more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. Still, it doesn’t hold up to the consistent performance of using two simple substitutions.

string.replace(/^\s*(\S*(?:\s+\S+)*)\s*$/, "$1")

This is a relatively common approach, but there’s no good reason to use it since it’s consistently one of the slowest of the options shown here. It’s similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner, noncapturing group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.

Some regular expression implementations contain clever optimizations that alter the internal matching processes described here, and therefore make some of these options perform a bit better or worse than we’ve suggested. Nevertheless, the simplicity of using two substitutions provides consistently respectable performance with different string lengths and varying string contents, and it’s therefore the best all-around solution.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

5.12. Trim Leading and Trailing Whitespace

Problem

Solution

Discussion

Tip

Variations

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

5.12. Trim Leading and Trailing Whitespace

Problem

Solution

Discussion

Tip

Variations

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition