Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

Seven of the most commonly used ASCII control characters have dedicated escape sequences. These all consist of a backslash followed by a letter. This is the same syntax that is used by string literals in many programming languages. Table 2-1 shows the common nonprinting characters and how they are represented.

Table 2-1. Nonprinting characters
Representation
Meaning
Hexadecimal representation
Regex flavors
‹\a›
bell
0x07
.NET, Java, PCRE, Perl, Python, Ruby
‹\e›
escape
0x1B
.NET, Java, PCRE, Perl, Ruby
‹\f›
form feed
0x0C
.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\n›
line feed (newline)
0x0A
.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\r›
carriage return
0x0D
.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\t›
horizontal tab
0x09
.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\v›
vertical tab
0x0B
.NET, Java, JavaScript, Python, Ruby

Representation	Meaning	Hexadecimal representation	Regex flavors
‹`\a`›	bell	0x07	.NET, Java, PCRE, Perl, Python, Ruby
‹`\e`›	escape	0x1B	.NET, Java, PCRE, Perl, Ruby
‹`\f`›	form feed	0x0C	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹`\n`›	line feed (newline)	0x0A	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹`\r`›	carriage return	0x0D	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹`\t`›	horizontal tab	0x09	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹`\v`›	vertical tab	0x0B	.NET, Java, JavaScript, Python, Ruby

In Perl 5.10 and later, and PCRE 7.2 and later, ‹\v› does match the vertical tab. In these flavors ‹\v› matches all vertical whitespace. That includes the vertical tab, line breaks, and the Unicode line and paragraph separators. So for Perl and PCRE we have to use a different syntax for the vertical tab.

JavaScript does not support ‹\a› and ‹\e›. So for JavaScript too we need a separate solution.

These control characters, as well as the alternative syntax shown in the following section, can be used equally inside and outside character classes in your regular expression.

Variations on Representations of Nonprinting Characters

The 26 control characters

Here’s another way to match the same seven ASCII control characters matched by the regexes earlier in this recipe:

\cG\x1B\cL\cJ\cM\cI\cK
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9
Using ‹\cA› through ‹\cZ›, you can match one of the 26 control characters that occupy positions 1 through 26 in the ASCII table. The c must be lowercase. The letter that follows the c is case insensitive in most flavors. We recommend that you always use an uppercase letter. Java requires this.
This syntax can be handy if you’re used to entering control characters on console systems by pressing the Control key along with a letter. On a terminal, Ctrl-H sends a backspace. In a regex, ‹\cH› matches a backspace.
Python and the classic Ruby engine in Ruby 1.8 do not support this syntax. The Oniguruma engine in Ruby 1.9 does.
The escape control character, at position 27 in the ASCII table, is beyond the reach of the English alphabet, so we leave it as ‹\x1B› in our regular expression.

The 7-bit character set

Following is yet another way to match our list of seven commonly used control characters:

\x07\x1B\x0C\x0A\x0D\x09\x0B
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
A lowercase \x followed by two uppercase hexadecimal digits matches a single character in the ASCII set. Figure 2-1 shows which hexadecimal combinations from ‹\x00› through ‹\x7F› match each character in the entire ASCII character set. The table is arranged with the first hexadecimal digit going down the left side and the second digit going across the top.
Figure 2-1. ASCII table
The characters that ‹\x80› through ‹\xFF› match depends on how your regex engine interprets them, and which code page your subject text is encoded in. We recommend that you not use ‹\x80› through ‹\xFF›. Instead, use the Unicode code point token described in Recipe 2.7.
Caution
If you’re using Ruby 1.8 or you compiled PCRE without UTF-8 support, you cannot use Unicode code points. Ruby 1.8 and PCRE without UTF-8 are 8-bit regex engines. They are completely ignorant about text encodings and multibyte characters. ‹\xAA› in these engines simply matches the byte 0xAA, regardless of which character 0xAA happens to represent or whether 0xAA is part of a multibyte character.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

2.2. Match Nonprintable Characters

Problem

Solution

Discussion

Variations on Representations of Nonprinting Characters

The 26 control characters

The 7-bit character set

Caution

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

2.2. Match Nonprintable Characters

Problem

Solution

Discussion

Variations on Representations of Nonprinting Characters

The 26 control characters

The 7-bit character set

Caution

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition