This chapter introduces some more useful utilities that are not part of the shell but are used in so many shell scripts that you really should know about them.
Sorting is such a common task, and so useful for readability reasons, that it’s good to know about the sort command. In a similar vein, the tr command will translate or map from one character to another, or even just delete characters.
One common thread here is that these utilities are written not just as standalone commands, but also as filters that can be included in a pipeline of commands. These sorts of commands will typically take one to many filenames as parameters (or arguments), but in the absence of any filenames they will read from standard input. They also write to standard output. That combination makes it easy to connect to the commands with pipes, as in something | sort | even more.
This makes them especially useful, and avoids the clutter and confusion of a myriad of temporary files.
Use the sort utility. You can sort one or more files by putting the filenames on the command line:
sort file1.txt file2.txt myotherfile.xyz
With no filenames on the command line, sort will read from standard input, so you can pipe the output from a previous command into sort:
somecommands | sort
It can be handy to have your output in sorted order, and handier still not to have to add sorting code to every program you write. The shell’s piping allows you to hook up sort to any program’s standard output.
There are many options to sort, but two of the three most worth remembering are:
sort -r
to reverse the order of the sort (where, to borrow a phrase, the last shall be first and the first, last) and:
sort -f
to “fold” lower- and uppercase characters together; i.e., to ignore the case differences. This can be done either with the -f option or with a GNU long-format option:
sort --ignore-case
We decided to keep you in suspense, so see Recipe 8.2 for the third-coolest sort option.
man sort
You need to tell sort that the data should be sorted as numbers. Specify a numeric sort with the -n option:
$ sort -n somedata 2 21 200 250 $
There is nothing wrong with the original (if odd) sort order if you realize that it is an alphabetic sort on the data (i.e., 21 comes after 200 because 1 comes after 0 in an alphabetic sort). Of course, what you probably want is numeric ordering, so you need to use the -n option.
sort -rn can be very handy in giving you a descending frequency list of something when combined with uniq -c. For example, let’s display the most popular shells on this system:
$ cut -d':' -f7 /etc/passwd | sort | uniq -c | sort -rn
20 /bin/sh
10 /bin/false
2 /bin/bash
1 /bin/sync
$
cut -d':' -f7 /etc/passwd isolates the shell from the /etc/passwd file. Then we have to do an initial sort so that uniq will work. uniq -c counts consecutive, duplicate lines, which is why we need the presort. Then sort -rn gives us a reverse numerical sort, with the most popular shell at the top.
If you don’t need to count the occurrences and just want a unique list of values—i.e., if you want sort to remove duplicates—then you can use the -u option on the sort command (and omit the uniq command). So, to find just the list of different shells on this system:
cut -d':' -f7 /etc/passwd | sort -u
man sort
man uniq
man cut
To sort by the last octet only (old syntax):
$ sort -t. -n +3.0 ipaddr.list 10.0.0.2 192.168.0.2 192.168.0.4 10.0.0.5 192.168.0.12 10.0.0.20 $
To sort the entire address as you would expect (POSIX syntax):
$ sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n ipaddr.list 10.0.0.2 10.0.0.5 10.0.0.20 192.168.0.2 192.168.0.4 192.168.0.12 $
We know this is numeric data, so we use the -n option. The -t option indicates the character to use as a separator between fields (in our case, a period) so that we can also specify which fields to sort first. In the first example, we start sorting with the third field (zero-based) from the left, and the very first character (again, zero-based) of that field, so +3.0.
In the second example, we used the new POSIX specification instead of the traditional (but obsolete) +pos1 -pos2 method. Unlike the older method, it is not zero-based, so fields start at 1:
sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n ipaddr.list
Wow, that’s ugly. Here it is in the old format, which is just as bad:
sort -t. +0n -1 +1n -2 +2n -3 +3n -4
Using -t. to define the field delimiter is the same, but the sort-key fields are given quite differently. In this case, -k 1,1n means “start sorting at the beginning of field one (1) and (,) stop sorting at the end of field one (1) and do a numerical sort (n). Once you get that, the rest is easy. When using more than one field, it’s very important to tell sort where to stop. The default is to go to the end of the line, which is often not what you want and which will really confuse you if you don’t understand what it’s doing.
The order that sort uses is affected by your locale setting. If your results are not as expected, that’s one thing to check.
Your sort order will vary from system to system depending on whether your sort command defaults to using a stable sort. A stable sort preserves the original order in the sorted data when the sort fields are equal. Linux and Solaris do not default to a stable sort, but NetBSD does. And while -S turns off the stable sort on NetBSD, it sets the buffer size in other versions of sort.
Say we have a trivial file like:
10.0.0.5 # mainframe 192.168.0.12 # speedy 10.0.0.20 # lanyard 192.168.0.4 # office 10.0.0.2 # sluggish 192.168.0.2 # laptop
If we run this sort command on a Linux or Solaris system:
sort -t. -k4n ipaddr.list
or this command on a NetBSD system:
sort -t. -S -k4n ipaddr.list
we will get the data sorted as shown in the first column of Table 8-1. Remove the -S on a NetBSD system, and sort will produce the ordering as shown in the second column.
| Linux and Solaris (default) and NetBSD (with -S) | NetBSD stable (default) sort ordering |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
If our input file, ipaddr.list, had all the 192.168 addresses first, followed by all the 10. addresses, then the stable sort would leave the 192.168 address first when there is a tie—that is, when two elements in our sort have the same value. We can see in Table 8-1 that this situation exists for laptop and sluggish, since each has a 2 as its fourth field, and also for mainframe and office, which tie with 4. In the default Linux sort (and NetBSD with the -S option specified), the order is not guaranteed.
To get back to something easy, and just for practice, let’s sort by the text in our IP address list. This time we want our separator to be the # character and we want an alphabetic sort on the second field, so we get:
$ sort -t'#' -k2 ipaddr.list 10.0.0.20 # lanyard 192.168.0.2 # laptop 10.0.0.5 # mainframe 192.168.0.4 # office 10.0.0.2 # sluggish 192.168.0.12 # speedy $
The sorting will start with the second key and, in this case, go through the end of the line. With just the one separator (#) per line, we didn’t need to specify the ending, though we could have written -k2,2.
man sort
./functions/inetaddr, as provided in the bash tarball (Appendix B)
Use the cut command with the -c option to take particular columns:1
$ ps -l | cut -c12-15 PID 5391 7285 7286 $
or:
$ ps -elf | cut -c58- (output not shown) $
With the cut command we specify what portion of the lines we want to keep. In the first example, we are keeping columns 12 (starting at column 1) through 15, inclusive. In the second case, we specify starting at column 58 but don’t specify the end of the range so that cut will take from column 58 on through the end of the line.
Most of the data manipulation we’ve looked at has been based on fields, relative positions separated by characters called delimiters. The cut command can do that too, but it is one of the few utilities that you’ll use with bash that can also easily deal with fixed-width, columnar data (via the -c option).
Using cut to print out fields rather than columns is possible, though it’s more limited than other choices such as awk. The default delimiter between fields is the tab character, but you can specify a different delimiter with the -d option. Here is an example of a cut command using fields:
cut -d'#' -f2 < ipaddr.list
and an equivalent awk command:
awk -F'#' '{print $2}' < ipaddr.list
You can even use cut to handle nonmatching delimiters by using more than one cut. You may be better off using a regular expression with awk for this, but sometimes a couple of quick and dirty cut commands are faster to figure out and type.
Here is how you can get the field out from between square brackets. Note that the first cut uses a delimiter of open square bracket (-d'[') and field 2 (-f2, starting at 1). Because the first cut has already removed part of the line, the second cut uses a delimiter of closed square bracket (-d']') and field 1 (-f1):
$ cat delimited_data Line [l1]. Line [l2]. Line [l3]. $ cut -d'[' -f2 delimited_data | cut -d']' -f1 l1 l2 l3 $
man cut
man awk
After selecting and/or sorting some data, you notice that there are many duplicate lines in your results. You’d like to get rid of the duplicates, so that you can see just the unique values.
You have two choices available to you. If you’ve just been sorting your output, add the -u option to the sort command:
somesequence | sort -u
If you aren’t running sort, just pipe the output into uniq—provided, that is, that the output is sorted, so that identical lines are adjacent:
somesequence | uniq > myfile
Since uniq requires the data to be sorted already, we’re more likely to just add the -u option to sort unless we also need to count the number of duplicates (-c, see Recipe 8.2) or see only the duplicates (-d), which uniq can do.
Don’t accidentally overwrite a valuable file by mistake; the uniq command is a bit odd in its parameters. Whereas most Unix/Linux commands take multiple input files on the command line, uniq does not. In fact, the first (nonoption) argument is taken to be the (one and only) input file and any second argument, if supplied, is taken as the output file. So if you supply two filenames on the command line, the second one will get clobbered without warning.
man sort
man uniq
First, you need to understand that in traditional Unix, archiving (or combining) and compressing files are two different operations using two different tools, while in the DOS and Windows world it’s typically one operation with one tool. A “tarball” is created by combining several files and/or directories using the tar (tape archive) command, then compressed using the compress, gzip, or bzip2 tools. This results in files like tarball.tar.Z, tarball.tar.gz, tarball.tgz, or tarball.tar.bz2, respectively. Having said that, many other tools, including zip, are supported.
In order to use the correct format, you need to understand where your data will be used. If you are simply compressing some files for yourself, use whatever you find easiest. If other people will need to use your data, consider what platform they will be using and what they are comfortable with.
The Unix traditional tarball was tarball.tar.Z, but gzip is now much more common and xz and bzip2 (which offer better compression than gzip) are gaining ground. There is also a tool question. Some versions of tar allow you to use the compression of your choice automatically while creating the archive. Others don’t.
The universally accepted Unix or Linux format would be a tarball.tar.gz created like this:
$ tar cf tarball_name.tar directory_of_files $ gzip tarball_name.tar $
If you have GNU tar, you could use -Z for compress (don’t, this is obsolete), -z for gzip (safest), or -j for bzip2 (highest compression). Don’t forget to use an appropriate filename, as this is not automatic. For example:
tar czf tarball_name.tgz directory_of_files
While tar and gzip are available for many platforms, if you need to share with Windows you are better off using zip, which is nearly universal:
zip -r zipfile_name directory_of_files
zip and unzip are supplied by the InfoZip packages on Unix and almost any other platform you can possibly think of. Unfortunately, they are not always installed by default. Run the command by itself for some helpful usage information, since these tools are not like most other Unix tools. And note the -l option to convert Unix line endings to DOS line endings, or -ll for the reverse.
There are far too many compression algorithms and tools to talk about here; others include ar, arc, arj, bin, bz2, cab, jar, cpio, deb, hqx, lha, lzh, rar, rpm, uue, and zoo.
When using tar, we strongly recommend using a relative directory to store all the files. If you use an absolute directory, you might overwrite something on another system that you shouldn’t. If you don’t use any directory, you’ll clutter up whatever directory the user is in when they extract the files (see Recipe 8.8). The recommended use is the name and possibly version of the data you are processing. Table 8-2 shows some examples.
| Good | Bad |
|---|---|
./myapp_1.0.1 |
myapp.c |
myapp.h |
|
myapp.man |
|
./bintools |
/usr/local/bin |
It is worth noting that Red Hat Package Manager (RPM) files are actually CPIO files with a header. You can get a shell or Perl script called rpm2cpio to strip that header and then extract the files like this:
rpm2cpio some.rpm | cpio -i
Debian’s .deb files are actually ar archives containing gzipped or bzipped tar archives. They may be extracted with the standard ar, gunzip, or bunzip2 tools.
Many of the Windows-based tools such as WinZip, PKZIP, FilZip, and 7-Zip can handle many or all of the formats mentioned here, and more (including tarballs and RPMs).
man tar
man gzip
man bzip2
man compress
man zip
man rpm
man ar
man dpkg
Figure out what you are dealing with and use the right tool. Table 8-3 maps common extensions to programs capable of handling them. The file command is helpful here since it can usually tell you the type of a file even if the name is incorrect.
| File extension | Command |
|---|---|
.tar |
|
.tar.gz, .tgz |
GNU tar: Else: |
.tar.bz2 |
GNU tar: Else: |
.tar.Z |
GNU tar: Else: |
.zip |
|
You should also try the file command:
$ file what_is_this.* what_is_this.1: GNU tar archive what_is_this.2: gzip compressed data, from Unix $ gunzip what_is_this.2 gunzip: what_is_this.2: unknown suffix -- ignored $ mv what_is_this.2 what_is_this.2.gz $ gunzip what_is_this.2.gz $ file what_is_this.2 what_is_this.2: GNU tar archive
If the file extension matches none of those listed in Table 8-3 and the file command doesn’t help, but you are sure it’s an archive of some kind, then you should do a web search for it.
Use an awk script to parse off the directory names from the tar archive’s table of contents, then use sort -u to leave you with just the unique directory names:
tar tf some.tar | awk -F/ '{print $1}' | sort -u
The t option will produce the table of contents for the file specified with the f option whose filename follows. The awk command specifies a nondefault field separator by using -F/ to specify a slash as the separator between fields. Thus, the print $1 will print the first directory name in the pathname.
Finally, all the directory names will be sorted and only unique ones will be printed.
If a line of the output contains a single period then some files will be extracted into the current directory when you unpack this tar file, so be sure to be in the directory you desire.
Similarly, if the filenames in the archive are all local and without a leading ./, then you will get a list of filenames that will be created in the current directory.
If the output contains a blank line, that means that some of the files are specified with absolute pathnames (i.e., beginning with /); again be careful, as extracting such an archive might clobber something that you don’t want replaced.
Some tar programs strip the leading / by default (e.g., GNU tar) or optionally. That’s a much safer way to create a tarball, but you can’t count on that when you are looking at extracting one.
In its simplest form, a tr command replaces occurrences of the first (and only) character of the first argument with the first (and only) character of the second argument.
In the example solution, we redirected input from the file named be.fore and sent the output into the file named af.ter, and we translated all occurrences of a semicolon into a comma.
Why do we use the single quotes around the semicolon and the comma? Well, a semicolon has special meaning to bash, so if we didn’t quote it bash would break our command into two commands, resulting in an error. The comma has no special meaning, but we quote it out of habit to avoid any special meaning we may have forgotten about—it’s safer always to use the quotes, as then we never forget to use them when we need them.
The tr command can do more than one translation at a time if we put the several characters to be translated in the first argument and their corresponding resultant characters in the second argument. Just remember, it’s a one-for-one substitution. For example:
tr ';:.!?' ',' <other.punct >commas.all
will translate all occurrences of the punctuation symbols of semicolon, colon, period, exclamation point, and question mark to commas. Since the second argument is shorter than the first, its last (and here, its only) character is repeated to match the length of the first argument, so that each character has a corresponding character for the translation.
This kind of translation could be done with the sed command, though sed syntax is a bit trickier. The tr command is not as powerful, since it doesn’t use regular expressions, but it does have some special syntax for ranges of characters—and that can be quite useful, as we’ll see in Recipe 8.10.
You can translate all uppercase characters (A–Z) to lowercase (a–z) using the tr command and specifying a range of characters, as in:
tr 'A-Z' 'a-z' <be.fore >af.ter
There is also special syntax in tr for specifying this sort of range for upper- and lower-case conversions:
tr '[:upper:]' '[:lower:]' <be.fore >af.ter
There are some versions of tr that honor the current locale’s collating sequence, and A-Z may not always be the set of uppercase letters in the current locale. It’s better to avoid that problem and use [:lower:] and [:upper:] if possible, but that does make it impossible to use subranges like N-Z and a-m.
Although tr doesn’t support regular expressions, it does support a range of characters. Just make sure that both arguments end up with the same number of characters. If the second argument is shorter, its last character will be repeated to match the length of the first argument. If the first argument is shorter, the second argument will be truncated to match the length of the first.
Here’s a very simplistic encoding of a text message using a simple substitution cypher that offsets each character by 13 places (i.e., ROT13). An interesting characteristic of ROT13 is that the same process is used to both encipher and decipher the text:
$ cat /tmp/joke Q: Why did the chicken cross the road? A: To get to the other side. $ tr 'A-Za-z' 'N-ZA-Mn-za-m' < /tmp/joke D: Jul qvq gur puvpxra pebff gur ebnq? N: Gb trg gb gur bgure fvqr. $ tr 'A-Za-z' 'N-ZA-Mn-za-m' < /tmp/joke | tr 'A-Za-z' 'N-ZA-Mn-za-m' Q: Why did the chicken cross the road? A: To get to the other side.
Use the -d option on tr to delete the character(s) in the supplied list. For example, to delete all DOS carriage returns (\r), use the command:
tr -d '\r' <file.dos >file.txt
This will delete all \r characters in the file, not just those at the end of a line. Typical text files rarely have characters like that inline, but it is possible. You may wish to look into the dos2unix and unix2dos programs if you are worried about this.
The tr utility has a few special escape sequences that it recognizes, among them \r for carriage return and \n for newline. The other special backslash sequences are listed in Table 8-4.
| Sequence | Meaning |
|---|---|
|
Character with octal value |
|
Backslash character (i.e., escapes the backslash itself) |
|
“Audible” bell, the ASCII BEL character (since “b” was taken for backspace) |
|
Backspace |
|
Form feed |
|
Newline |
|
Return |
|
Tab (sometimes called a “horizontal” tab) |
|
Vertical tab |
man tr
Translate the odd characters back to simple ASCII like this:
tr '\221\222\223\224\226\227' '\047\047""--' <odd.txt >plain.txt
Such “smart quotes” come from the Windows-1252 character set, and may also show up in email messages that you save as text.
To clean up such text, we can use the tr command. The 221 and 222 (octal) curved single quotes will be translated to simple single quotes. We specify them in octal (047) to make it easier on us, since the shell uses single quotes as a delimiter. The 223 and 224 (octal) are opening and closing curved double quotes, and will be translated to simple double quotes. The double quotes can be typed within the second argument since the single quotes protect them from shell interpretation. The 226 and 227 (octal) are dash characters and will be translated to hyphens (and no, that second hyphen in the second argument is not technically needed since tr will repeat the last character to match the length of the first argument, but it’s better to be specific).
man tr
https://en.wikipedia.org/wiki/Quotation_mark#Curved_quotes_and_Unicode for way more than you might ever have wanted to know about quotation marks and related character set issues
Use the wc (word count) command in a command substitution.
The normal output of wc is something like this:
$ wc data_file
5 15 60 data_file
# Lines only
$ wc -l data_file
5 data_file
# Words only
$ wc -w data_file
15 data_file
# Characters (often the same as bytes) only
$ wc -c data_file
60 data_file
# Note 60B
$ ls -l data_file
-rw-r--r-- 1 jp users 60B Dec 6 03:18 data_file
You may be tempted to just do something like this:
data_file_lines=$(wc -l "$data_file")
That won’t do what you expect, since you’ll get something like "5 data_file" as the value. You may also see:
data_file_lines=$(cat "$data_file" | wc -l)
Instead, use this to avoid the filename problem without a useless use of cat:
data_file_lines=$(wc -l < "$data_file")
Use the fmt command:
fmt mangled_text
optionally with a goal and maximum line length:
fmt 55 60 mangled_text
One tricky thing about fmt is that it expects blank lines to separate headers and paragraphs. If your input file doesn’t have those blanks, it has no way to tell the difference between different paragraphs and extra newlines inside the same paragraph—so you will end up with one giant paragraph, with the correct line lengths.
The pr command might also be of some interest for formatting text.
man fmt
man pr
Read the less manpage and use the $LESS variable with ~/.lessfilter and ~/.lesspipe files.
less takes options from the $LESS variable, so rather than creating an alias with your favorite options, put them in that variable. It takes both long and short options, and any command-line options will override options in the variable. We recommend using the long options in the $LESS variable since they are easy to read. For example:
export LESS="--LONG-PROMPT --LINE-NUMBERS --ignore-case --QUIET"
But that is just the beginning. less is expandable via input preprocessors, which are simply programs or scripts that preprocess the file that less is about to display. This is handled by setting the $LESSOPEN and $LESSCLOSE environment variables appropriately.
You could build your own, but save yourself some time (after reading the following discussion) and look into Wolfgang Friebel’s lesspipe.sh. The script works by setting and exporting the $LESSOPEN environment variable when run by itself:
$ ./lesspipe.sh LESSOPEN="|./lesspipe.sh %s" export LESSOPEN $
So you simply run it in an eval statement, like eval $(/path/to/lesspipe.sh) or eval `/path/to/lesspipe.sh`, and then use less as usual. A partial list of supported formats for version 1.82 is:
gzip, compress, bzip2, zip, rar, tar, nroff, ar archive, pdf, ps, dvi, shared library, executable, directory, RPM, Microsoft Word, OASIS (OpenDocument, Openoffice, Libreoffice) formats, Debian, MP3 files, image formats (png, gif, jpeg, tiff, …), utf-16 text, iso images and filesystems on removable media via /dev/xxx.
But there is a catch. These formats require various external tools, so not all features in the example lesspipe.sh will work if you don’t have them. The package also contains ./configure (or make) scripts to generate a version of the filter that will work on your system, given the tools that you have available.
less is unique in that it is a GNU tool that was already installed by default on every single test system we tried—every one. Not even bash can say this. And version differences aside, it works the same on all of them. Quite a claim to fame.
However, the same cannot be said for lesspipe* and the $LESSOPEN filters. We found other versions, with wildly variable capabilities, besides the ones listed in the Solution section:
Red Hat has a /usr/bin/lesspipe.sh that can’t be used like eval `/path/to/lesspipe.sh`.
Debian has a /usr/bin/lesspipe that can be eval’ed and also supports additional filters via a ~/.lessfilter file.
SUSE Linux has a /usr/bin/lessopen.sh that can’t be eval’ed.
FreeBSD has a trivial /usr/bin/lesspipe.sh (no eval, .Z, .gz, or .bz2).
Solaris, HP-UX, the other BSDs, and macOS have nothing by default.
To see if you already have one of these, try this on your system. This Debian system has the Debian lesspipe installed but not in use (since $LESSOPEN is not defined):
$ type lesspipe.sh; type lesspipe; set | grep LESS -bash3: type: lesspipe.sh: not found lesspipe is /usr/bin/lesspipe $
This Ubuntu system has the Debian lesspipe installed and in use:
$ type lesspipe.sh; type lesspipe; set | grep LESS -bash: type: lesspipe.sh: not found lesspipe is hashed (/usr/bin/lesspipe) LESSCLOSE='/usr/bin/lesspipe %s %s' LESSOPEN='| /usr/bin/lesspipe %s' $
We recommend that you download, configure, and use Wolfgang Friebel’s lesspipe.sh because it’s the most capable. We also recommend that you read the less manpage because it’s very interesting.
man less
man lesspipe
man lesspipe.sh
1 Note that our example ps command only works with certain systems; e.g., CentOS-4, Fedora Core 5, and Ubuntu work, but Red Hat 8, NetBSD, Solaris, and macOS all garble the output due to using different columns.