Chapter 3. Regular Expressions

Regular expressions (regex) are a powerful method for describing a text pattern to be matched by various tools. There is only one place in bash where regular expressions are valid, using the =~ comparison in the [[ compound command, as in an if statement. However, regular expressions are a crucial part of the larger toolkit for commands like grep, awk, and sed in particular. They are very powerful and thus worth knowing. Once mastered, you’ll wonder how you ever got along without them.

For many of the examples in this chapter we will be using the file frost.txt with its seven, yes seven, lines of text.

Example 3-1. frost.txt
1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

The content of Frost.txt will be used to demonstrate the power of regular expressions to process text data. This text was chosen because it requires no prior knowledge to understand.

Commands in Use

We introduce the grep family of commands to demonstrate the basic regex patterns.

grep

The grep command searches the content of the files for a given pattern and prints any line where the pattern is matched. To use grep, you need to provide it with a pattern and one or more filenames (or piped data).

Common Options

-c

Count the number of lines that match the pattern.

-E

Enable extended regular expressions

-f

Read the search pattern from a provided file. A file can contain more than one pattern, with each line containing a single pattern.

-i

Ignore character case.

-l

Only print the file name and path where the pattern was found.

-n

Print the line number of the file where the pattern was found.

-P

Enables the Perl regular expression engine.

-R, -r

Recursively search sub-directories.

Command Example

In general, the way grep is used is like this: grep options pattern filenames

To search the /home directory and all sub-directories for files containing the word password irrespective of uppercase/lowercase distinctions:

grep -R -i 'password' /home

grep and egrep

The grep command supports some variations, notably an extended syntax for the regex patterns (we’ll discuss the regex patterns next). There are three different ways to tell grep that you want special meaning on certain characters: 1) by preceding those characters with a backslash; or 2) by telling grep that you want the special syntax (without the need for backslash) by using the -E option when you invoke grep; or 3) by using the command named egrep which is just a script that simply invokes grep as grep -E so you don’t have to.

The only characters that are affected by the extended syntax are: ? + { | ( and ). In the examples that follow we will use grep and egrep interchangeably - they are the same binary underneath. We will choose the one to use that seems most appropriate based on what special characters we need. The special, or meta-, characters are what make grep so powerful. Here is what you need to know about the most powerful and frequently used metacharacters.

Regular Expression Metacharacters

Regular expressions are patterns that are created using a series of characters and metacharacters. Metacharacters such as "?" and "*" have special meaning beyond their literal meaning in regex.

The “.” Metacharacter

In regex, the “.” represents a single wildcard character. It will match on any single character except for a newline. As can be seen in the example below, if we try to match on the pattern T.o the first line of the frost.txt file is returned because it contains the word Two.

$ grep 'T.o' frost.txt

1    Two roads diverged in a yellow wood,

Note that line 5 is not returned even though it contains the word To. This pattern allows any character to appear between the T and o, but as written there must be a character in between. Regex patterns are also case sensitive, which is why line 3 of the file was not returned even though it contains the string too. If you want to treat "." as a period character rather than a wildcard, precede it with a backslash "\." to escape its special meaning.

The “?” Metacharacter

In regex, the “?” character makes any item that precedes it optional; it matches it zero or one time. By adding this metacharacter to the previous example we can see that the output is different.

$ egrep 'T.?o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;

This time we see that both lines 1 and 5 are returned. This is because the metacharacter "." is optional due to the "?" metacharacter that follows it. This pattern will match on any three-character sequence that begins with T and ends with o as well as the two-character sequence To.

Notice that we are using egrep here. We could have used grep -E or we could have used “plain” grep with a slightly different pattern: T.\?o putting the backslash on the question mark to give it the extended meaning.

The “*” Metacharacter

In regex, the "*" is a special character that matches the preceding item zero or more times. It is similar to the "?“, the main difference being that the previous item may appear more than once.

$ grep 'T.*o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The ".*" in the pattern above allows any number of any character to appear in between the T and o. Thus the last line also matches because it contains the pattern The Ro.

The “+” Metacharacter

The "+" metacharacter is the same as the "*" except it requires the preceding item to appear at least once. In other words it matches the preceding item one or more times.

$ egrep 'T.+o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The pattern above specifies one or more of any character to appear in between the T and o. The first line of text matches because of Two - the w is 1 character between the T and the o. The second line doesn’t match the To, as in the previous example; rather, the pattern matches a much larger string — all the way to the o in undergrowth. The last line also matches because it contains the pattern The Ro.

Grouping

We can use parentheses to group together characters. Among other things, this allows us to treat the characters appearing inside the parenthesis as a single item which we can later reference.

$ egrep 'And be one (stranger|traveler), long I stood' frost.txt

3    And be one traveler, long I stood

In the example above we use parenthesis and the Boolean OR operator "|" to create a pattern that will match on line 3. Line 3 as written has the word traveler in it, but this pattern would match even if traveler was replaced by the word stranger.

Brackets and Character Classes

In regex the square brackets, [ ], are used to define character classes and lists of acceptable characters. Using this construct you can list exactly which characters are matched at this position in the pattern. This is particularly useful when trying to perform user input validation. As a shorthand you can specify ranges with a dash such as [a-j]. These ranges are in your locale’s collating sequence and alphabet. For the C locale, the pattern [a-j] will match one of the letters a through j. Table 3-1 provides a list of common examples when using character classes and ranges.

Table 3-1. Regex character ranges
Example Meaning

[abc]

Match only the character a or b or c

[1-5]

Match on digits in the range 1 to 5

[a-zA-Z]

Match any lowercase or uppercase a to z

[0-9+-*/]

Match on numbers or these 4 mathematical symbols

[0-9a-fA-F]

Match a hexadecimal digit

Warning

Be careful when defining a range for digits; the range can at most go from 0 to 9. For example, the pattern [1-475] does not match on numbers between 1 and 475, it matches on any one of the digits (characters) in the range 1-4 or the character 7 or the character 5.

There are also predefined character classes known as shortcuts. These can be used to indicate common character classes such as numbers or letters. See Table 3-2 for a list of shortcuts.

Table 3-2. Regex shortcuts
Shortcut Meaning

\s

Whitespace

\S

Not Whitespace

\d

Digit

\D

Not Digit

\w

Word

\W

Not Word

\x

Hexadecimal Number (e.g. 0x5F)

Note that the above shortcuts are not supported by egrep. In order to use them you must use grep with the -P option. That option enables the Perl regular expression engine to support the shortcuts. For example, to find any numbers in frost.txt:

$ grep -P '\d' frost.txt

1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

There are other character classes (with a more verbose syntax) that are valid only within the bracket syntax, as seen in Table 3-3. They match a single character, so if you need to match many in a row, use the star or plus to get the repetition you need.

Table 3-3. Regex character classes in brackets
Character Class Meaning

[:alnum:]

any alphanumeric character

[:alpha:]

any alphabetic character

[:cntrl:]

any control character

[:digit:]

any digit

[:graph:]

any graphical character

[:lower:]

any lowercase character

[:print:]

any printable character

[:punct:]

any punctuation

[:space:]

any whitespace

[:upper:]

any uppercase character

[:xdigit:]

any hex digit

To use one of these classes it has to be inside the brackets, so you end up with two sets of brackets. For example: grep '[[:cntrl:]]' large.data will look for lines containing control characters (ASCII 0-25). Here is another example:

grep 'X[[:upper:][:digit:]]' idlist.txt

will match any line with an X followed by any uppercase letter or digit. It would match these lines:

User: XTjohnson
an XWing model 7
an X7wing model

They each have an uppercase X followed immediately by either another uppercase letter or by a digit.

Back References

Regex back references are one of the most powerful and often confusing regex operations. Consider the following file, tags.txt:

1    Command
2    <i>line</i>
3    is
4    <div>great</div>
5    <u>!</u>

Suppose you want to write a regular expression that will extract any line that contains a matching pair of complete HTML tags. The start tag has an HTML tag name; the ending tag has the same tag name but with a leading slash. <div> and </div> are a matching pair. You could search for these by writing a lengthy regex that contains all possible HTML tag values, or you can focus on the format of an HTML tag and use a regex back reference.

$ egrep '<([A-Za-z]*)>.*</\1>' tags.txt

2    <i>line</i>
4    <div>great</div>
5    <u>!</u>

In this example, the back reference is the \1 appearing in the latter part of the regular expression. It is referring back to the expression enclosed in first set of parentheses, [A-Za-z]* which has two parts. The letter range in brackets denotes a choice of any letter, uppercase or lowercase. The asterisk (or star) that follows it means to repeat that zero or more times. Therefore the \1 refers to whatever was matched by that pattern in parentheses. If [A-Za-z]* matches div then the \1 also refers to the pattern div.

The overall regular expression, then, can be described as matching a < sign (that literal character is the first one in the regex) followed by zero or more letters then a > sign and then zero or more of any character “.” for any character, “*” for zero or more of the previous item) followed by another < and a slash and then the sequence matched by the expression within the parentheses and finally a > character. If this sequence matches any part of a line from our text file then egrep will print that line out.

You can have more than one back reference in an expression and refer to each with a \1 or \2 or \3 depending on its order in the regular expression. A \1 refers to the first set of parentheses, \2 to the second, and so on. Note that the parentheses are metacharacters - they have a special meaning. If you just want to match a literal parenthesis you need to escape its special meaning by preceding it with a backslash, as in: sin\([0-9.]*\) to match expressions like: sin(6.2) or sin(3.14159).

Note

Valid HTML doesn’t have to be all on one line; the end tag can be several lines away from the start tag. Moreover, some tags can both start and end in a single tag, such as <br/> for a break, or <p/> for an empty paragraph. We would need a more sophisticated approach to include such things in our search.

Quantifiers

Quantifiers specify the number of times an item must appear in a string. Quantifiers are defined by the curly brackets { }. For example, the pattern T{5} means that the letter T must appear consecutively exactly 5 times. The pattern T{3,6} means that the letter T must appear consecutively 3 to 6 times. The pattern T{5,} means that the letter T must appear 5 or more times.

Anchors and Word Boundaries

You can use anchors to specify that a pattern must exist at the beginning or the end of a string. The ^ character is used to anchor a pattern to the beginning of a string. For example ^[1-5] means that a matching string must start with one of the digits 1 through 5 as the first character on the line. The $ character is used to anchor a pattern to the end of a string or line. For example [1-5]$ means that a string must end with one of the digits 1 through 5.

In addition, you can use \b to identify a word boundary (i.e., a space). The pattern \b[1-5]\b will match on any of the digits 1 through 5 where the digit appears as its own word.

Summary

Regular expressions are extremely powerful for describing patterns and can be used in coordination with other tools to search and process data.

The uses and full syntax of regex far exceeds the scope of this book. You can visit the resources below for additional information and utilities related to regex.

In the next chapter we will discuss common data types relevant to security operations and how it can be gathered.

Exercises

  1. Write a regular expression that matches a floating point number (a number with a decimal point) such as 3.14. There can be digits on either side of the decimal point but there need not be any on one side or the other. Allow it to match just a decimal point by itself, too.

  2. Use a back reference in a regular expression to match a number that appears on both sides of an equal sign. For example, it should match “314 is = to 314” but not “6 = 7”

  3. Write a regular expression that looks for a line that begins with a digit and ends with a digit, with anything occurring in between.

  4. Write a regular expression that uses grouping to match on the following 2 IP addresses: 10.0.0.25 and 10.0.0.134.

  5. Write a regular expression that will match if the hexadecimal string 0x90 occurs more than 3 times in a row (i.e. 0x90 0x90 0x90).