Until now, we've seen grep used with various options that alter its behavior. There is one final important option we'd like to share with you: --extended-regexp (-E). As the man grep page states, this means interpret PATTERN as an extended regular expression.
In contrast to the default regular expressions found in Linux, extended regular expressions have search patterns that are a lot closer to regular expressions in other scripting/programming languages (should you already have experience with those).
Specifically, the following constructs are available when using extended regular expressions over default regular expressions:
|
? |
Matches a repeat of the previous character zero or more times |
|
+ |
Matches a repeat of the previous character one or more times |
|
{n} |
Matches a repeat of the previous character exactly n times |
|
{n,m} |
Matches a repeat of the previous character between n and m times |
|
{,n} |
Matches a repeat of the previous character n or fewer times |
|
{n,} |
Matches a repeat of the previous character n or more times |
|
(xx|yy) |
Alternation character, allows us to find xx OR yy in the search pattern (great for patterns with more than one character, otherwise, [xy] notation would suffice) |
Now, before we start using the new ERE search patterns, we'll look at a new command: egrep. If you tried to find out what it does, you might start with a which egrep, which would result in /bin/egrep. This might lead you to think it was a separate binary from grep, which you've used so much by now.
However, in the end, egrep is nothing more than a small wrapper script:
reader@ubuntu:~/scripts/chapter_10$ cat /bin/egrep
#!/bin/sh
exec grep -E "$@"
As you can see, it's just a shell script, but without the customary .sh extension. It uses the exec command to replace the current process image with a new process image.
You might recall that normally, a command is executed in a fork of the current environment. In this case, since we use this script to wrap (hence why it is called a wrapper script) grep -E as egrep, it makes sense to replace it instead of forking it again.
The "$@" construct is new as well: it is an array (if you aren't familiar with this term, think of an ordered list) of arguments. In this case, it essentially passes all arguments received by egrep into grep -E.
So, if the full command was egrep -w [[:digit:]] grep-file.txt, it would be wrapped and finally executed in place as grep -E -w [[:digit:]] grep-file.txt.
In practice, it does not matter whether you use egrep or grep -E. We prefer using egrep so we know for sure that we're dealing with extended regular expressions (since the extended functionality is often used in practice, in our experience). For simple search patterns, however, there is no need for ERE.
We advise you to find your own system for when to use each one.
Now for some examples of the extended regular expression search pattern capabilities:
reader@ubuntu:~/scripts/chapter_10$ egrep -w '[[:lower:]]{5}' grep-file.txt
but in the USA they use color (and realize)!
reader@ubuntu:~/scripts/chapter_10$ egrep -w '[[:lower:]]{7}' grep-file.txt
We can use this regular file for testing grep.
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
reader@ubuntu:~/scripts/chapter_10$ egrep -w '[[:alpha:]]{7}' grep-file.txt
We can use this regular file for testing grep.
Regular expressions are pretty cool
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
Also, New Zealand is pretty far away.
The first command, egrep -w [[:lower:]]{5} grep-file.txt, shows us all words that are exactly five characters long, using lowercase letters. Don't forget we need the -w option here, because otherwise, any five letters in a row match as well, ignoring word boundaries (in this case, the prett in pretty matches as well). The result is only one five-letter word: color.
Next, we do the same for seven-letter words. We now get more results. However, because we are only using lowercase letters, we're missing two words that are also seven letters long: Regular and Zealand. We fix this by using [[:alpha:]] instead of [[:lower:]]. (We could have also used the -i option to make everything case-insensitive—egrep -iw [[:lower:]]{7} grep-file.txt.
While this is functionally acceptable, think about it for a second. In that case, you would be searching for case-insensitive words made up of exactly seven lowercase letters. That doesn't really make any sense. In situations such as these, we always choose logic over functionality, which in this case means changing [[:lower:]] to [[:alpha:]], instead of using the -i option.
So we know how we can search for words (or lines, if we omit the -w option) of a specific length. How about we now look for words longer or shorter than a minimum or maximum length?
Here's an example:
reader@ubuntu:~/scripts/chapter_10$ egrep -w '[[:lower:]]{5,}' grep-file.txt
We can use this regular file for testing grep.
Regular expressions are pretty cool
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ egrep -w '[[:alpha:]]{,3}' grep-file.txt
We can use this regular file for testing grep.
Regular expressions are pretty cool
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ egrep '.{40,}' grep-file.txt
We can use this regular file for testing grep.
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
This example demonstrates boundary syntax. This first command, egrep -w '[[:lower:]]{5,}' grep-file.txt, looks for lowercase words that are five letters or more. If you compare these results to the previous examples, where we were looking for words exactly five letters long, you now see that longer words are also matched.
Next, we reverse the boundary condition: we only want to match on words that are three letters or fewer. We see that all two- and three-letter words are matched (and, because we switched from [[:lower:]] to [[:alpha:]], UK and capitalized letters at the beginning of the lines are matched as well).
In the final example, egrep '.{40,}' grep-file.txt, we remove the -w so we're matching on whole lines. We match on any character (as denoted by the dot), and we want at least 40 characters on a line (as denoted by the {40,}). In this case, only three lines of the five are matched (as the other two are shorter).
The final concept of extended regular expressions we want to show is alternation. This uses pipe syntax (not to be confused with pipes used for redirection, which will be further discussed in Chapter 12, Using Pipes and Redirection in Scripts) to convey the meaning of match on xxx OR yyy.
An example should make this clear:
reader@ubuntu:~/scripts/chapter_10$ egrep 'f(a|o)r' grep-file.txt
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ egrep 'f[ao]r' grep-file.txt
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ egrep '(USA|UK)' grep-file.txt
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!
In the case of a single letter difference, we can choose whether we want to use extended alternation syntax, or the earlier-discussed bracket syntax. We would advise using the simplest syntax that accomplishes the goal, which, in this case, is bracket syntax.
However, once we are looking for patterns of more than one character difference, using bracket syntax becomes prohibitively complex. In this case, extended alternation syntax is clear and concise, especially since | or || represents an OR construct in most scripting/programming logic. For this example, this would be like saying: I want to find lines that contain either the word USA or the word UK.
Because this syntax corresponds nicely with a semantic view, it feels intuitive and is understandable, something we should always strive for in our scripts!