Chapter 7. Intermediate Shell Tools I

It is time to expand our repertoire. This chapter’s recipes use some utilities that are not part of the shell, but which are so useful that it is hard to imagine using the shell without them.

One of the overarching philosophies of Unix (and thus Linux) is that of small (i.e., limited in scope) program pieces that can be fit together to provide powerful results. Rather than have one program that does everything, we have many different programs that each do one thing well.

That applies to bash as well. While it’s getting big and feature-rich, it still doesn’t try to do everything, and there are times when it is easier to use other commands to accomplish a task even if bash can be stretched to do it.

A simple example of this is the ls command. You needn’t use ls to see the contents of your current directory. You could just type echo * to have filenames displayed. Or you could even get fancier, using the bash printf command and some formatting, etc. But that’s not really the purpose of the shell, and someone has already provided a listing program (ls) to deal with all sorts of variations in filesystem information.

Perhaps more importantly, by not expecting bash to provide more filesystem listing features, we avoid additional feature creep pressures and instead give it some measure of independence; ls can be released with new features without requiring that we all upgrade our bash versions.

But enough philosophy—back to the practical.

What we have here are three of the most useful text-related utilities: grep, sed, and awk.

The grep program searches for strings, the sed program provides a way to edit text as it passes through a pipeline, and awk…well, awk is its own interesting beast, a precursor to perl and a bit of a chameleon—it can look quite different depending on how it is used.

These utilities, and a few more that we will discuss in the next chapter, are very much a part of most shell scripts and most sessions spent typing commands to bash. If your shell script requires a list of files on which to operate, it is likely that either find or grep will be used to supply that list of files, and it’s likely that sed and/or awk will be used to parse the input or format the output at some stage of the shell script.

To say it another way, if our scripting examples are going to tackle real-world problems, they need to use the wider range of tools that are actually used by real-world bash users and programmers.

7.1 Sifting Through Files for a String

Problem

You need to find all occurrences of a string in one or more files.

Solution

The grep command searches through files looking for the expression you supply:

$ grep printf *.c
both.c:    printf("Std Out message.\n", argv[0], argc-1);
both.c:    fprintf(stderr, "Std Error message.\n", argv[0], argc-1);
good.c:    printf("%s: %d args.\n", argv[0], argc-1);
somio.c:        // we'll use printf to tell us what we
somio.c:        printf("open: fd=%d\n", iod[i]);
$

The files we searched through in this example were all in the current directory. We just used the simple shell pattern *.c to match all the files ending in .c with no preceding pathname.

Not all the files through which you want to search may be that conveniently located. Of course, the shell doesn’t care how much pathname you type, so we could have done something like this:

grep printf ../lib/*.c ../server/*.c ../cmd/*.c */*.c

Discussion

When more than one file is searched, grep begins its output with the filename, followed by a colon. The text after the colon is what actually appears in the files that grep searched.

The search matches any occurrence of the specified characters, so a line that contained the string “fprintf” was returned, since “printf” is contained within “fprintf”.

The first (nonoption) argument to grep can be just a simple string, as in this example, or it can be a more complex regular expression (regexp). These regexps are not the same as the shell’s pattern matching, though they can look similar at times. Pattern matching is so powerful that you may find yourself relying on it to the point where you’ll start using “grep” as a verb, and wishing you could make use of it everywhere, as in “I wish I could grep my desk for that paper you wanted.”

You can vary the output from grep using command-line options. If you don’t want to see the specific filenames, you may turn this off using the -h option to grep:

$ grep -h printf *.c
  printf("Std Out message.\n", argv[0], argc-1);
  fprintf(stderr, "Std Error message.\n", argv[0], argc-1);
  printf("%s: %d args.\n", argv[0], argc-1);
     // we'll use printf to tell us what we
     printf("open: fd=%d\n", iod[i]);
$

If you don’t want to see the actual lines from the file, but only a count of the number of times the expression is found, then use the -c option:

$ grep -c printf *.c
both.c:2
good.c:1
somio.c:2
$
Warning

A common mistake is to forget to provide grep with a source of input—for example, grep myvar. In this case grep assumes you will provide input from STDIN, but you think it will get it from a file. So it just sits there forever, seemingly doing nothing. (In fact, it is waiting for input from your keyboard.) This is particularly hard to catch when you are grepping a large amount of data and expect it to take a while.

See Also

  • man grep

  • man regex (Linux, Solaris, HP-UX) or man re_format (BSD, Mac) for the details of your regular expression library

  • Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly)

  • Classic Shell Scripting by Nelson H. F. Beebe and Arnold Robbins (O’Reilly), Sections 3.1 and 3.2

  • Chapter 9 and the find utility, for more far-reaching searches

  • Recipe 9.5, “Finding Files Irrespective of Case”

7.4 Searching for Text While Ignoring Case

Problem

You need to search for a string (e.g., “error”) in a logfile, and you want to do it case-insensitively to catch all occurrences.

Solution

Use the -i option on grep to ignore case:

 grep -i error logfile.msgs

Discussion

A case-insensitive search finds messages written “ERROR,” “error,” and “Error,” as well as ones like “ErrOR” and “eRrOr.” This option is particularly useful for finding words anywhere that you might have mixed-case text, including words that might be capitalized at the beginning of a sentence or in email addresses.

See Also

7.5 Doing a Search in a Pipeline

Problem

You need to search for some text, but the text you’re searching for isn’t in a file; instead, it’s in the output of a command or perhaps even the output of a pipeline of commands.

Solution

Just pipe your results into grep:

some pipeline | of commands | grep

Discussion

When no filename is supplied to grep, it reads from standard input. Most well-designed utilities meant for shell scripting will do this. It is one of the things that makes them so useful as building blocks for shell scripts.

If you also want to have grep search through error messages that come from the previous command, be sure to redirect its error output into standard output before the pipe:

gcc bigbadcode.c 2>&1 | grep -i error

This command attempts to compile some hypothetical hairy piece of code. We redirect standard error into standard output (2>&1) before we proceed to pipe (|) the output into grep, where it will search case-insensitively (-i) looking for the string error.

Don’t overlook the possibility of grepping the output of grep. Why would you want to do that? To further narrow down the results of a search. Let’s say you wanted to find out Bob Johnson’s email address:

$ grep -i johnson mail/*
... too much output to think about; there are lots of Johnsons in the world ...
$ !! | grep -i robert
grep -i johnson mail/* | grep -i robert
... more manageable output ...
$ !! | grep -i "the bluesman"
grep -i johnson mail/* | grep -i robert | grep -i "the bluesman"
Robert M. Johnson, The Bluesman <rmj@noplace.org>

You could have retyped the first grep, but this example also shows the power of the !! history operator (see Recipe 18.2). The !! lets you repeat the previous command without retyping it. You can then continue adding to the command line after the !! as we show here. The shell will display the command that it runs, so that you can see what you got as a result of the !! substitution.

You can build up a long grep pipeline very quickly and simply this way, seeing the results of the intermediate steps as you go and deciding how to refine your search with additional grep expressions. You could also accomplish the same task with a single grep and a clever regular expression, but we find that building up a pipeline incrementally is easier.

See Also

7.6 Paring Down What the Search Finds

Problem

Your search is returning way more than you expected, including many results you don’t want.

Solution

Pipe the results into grep -v with an expression that describes what you don’t want to see.

Let’s say you were searching for messages in a logfile, and you wanted all the messages from the month of December. You know that your logfile uses the three-letter abbreviation Dec for December, but you’re not sure if it’s always abbreviated, so to be sure to catch all the messages you type:

grep -i dec logfile

But then you get output like this:

...
error on Jan 01: not a decimal number
error on Feb 13: base converted to Decimal
warning on Mar 22: using only decimal numbers
error on Dec 16 : the actual message you wanted
error on Jan 01: not a decimal number
...

A quick and dirty solution in this case is to pipe the first result into a second grep and tell the second grep to ignore any instances of “decimal”:

grep -i dec logfile | grep -vi decimal

It’s not uncommon to string a few of these together (as new, unexpected matches are also discovered) to filter down the search results to what you’re really looking for:

grep -i dec logfile | grep -vi decimal | grep -vi decimate

Discussion

The “dirty” part of this “quick and dirty” solution is that the solution here might also get rid of some of the December log messages, ones that you wanted to keep—if they have the word “decimal” in them, they’ll be filtered out by the grep -v.

The -v option can be handy if used carefully; you just have to keep in mind what it might exclude.

For this particular example, a better solution would be to use a more powerful regular expression to match the December date, one that looked for “Dec” followed by a space and two digits:

grep 'Dec [0-9][0-9]' logfile

But that often won’t work either because syslog uses a space to pad single-digit dates. To account for this, we can add a space in the first list:

grep 'Dec [0-9 ][0-9]' logfile

We used single quotes around the expression because of the embedded spaces, and to avoid any possible shell interpretation of the bracket characters (not that there would be, but just as a matter of habit). It’s good to get into the habit of using single quotes around anything that might possibly be confusing to the shell. We could have written:

grep Dec\ [0-9\ ][0-9] logfile

escaping the spaces with a backslash, but in that form it’s harder to see where the search string ends and the filename begins.

See Also

7.7 Searching with More Complex Patterns

The regular expression mechanism of grep provides for some very powerful patterns that can fit most of your needs.

A regular expression describes patterns for matching against strings. Any alphabetic character (or other character without special meaning to the shell) just matches that character in the string. “A” matches A, “B” matches B; no surprise there. The next important rule is to combine letters just by position, so AB matches “A” followed by “B”. This, too, seems obvious. But regular expressions define other special characters that can be used by themselves or in combination with other characters to make more complex patterns.

The first special character is the period (.), which matches any single character. Therefore, .... matches any four characters; A. matches an “A” followed by any character; and .A. matches any character, then an “A”, then any character (not necessarily the same character as the first).

An asterisk (*) matches zero or more occurrences of the previous character, so A* matches zero or more “A” characters, and .* matches zero or more characters of any sort (such as “abcdefg”, “aaaabc”, “sdfgf ;lkjhj”, or even an empty line).

So what does ..* mean? It matches any single character followed by zero or more of any character (i.e., one or more characters, but not an empty line).

Speaking of lines, the caret ^ matches the beginning of a line of text and the dollar sign $ matches the end of a line; hence, ^$ matches an empty line (the beginning followed by the end, with nothing in between).

What if you want to match an actual period, caret, dollar sign, or any other special character? Precede it by a backslash (\). ion. matches the letters “ion” followed by any other letter, but ion\. matches “ion” bounded by a period (e.g., at the end of a sentence or wherever else it appears with a trailing dot).

A set of characters enclosed in square brackets (e.g., [abc]) matches any one of those characters (e.g., “a” or “b” or “c”). If the first character inside the square brackets is a caret, then it matches any character that is not in that set.

For example, [AaEeIiOoUu] matches any of the vowels, and [^AaEeIiOoUu] matches any character that is not a vowel. This last case is not the same as saying that it matches consonants, because [^AaEeIiOoUu] also matches punctuation and other special characters that are neither vowels nor consonants.

Another mechanism we want to introduce is a repetition mechanism called an “interval expression,” written as \{n,m\}, where n is the minimum number of repetitions and m is the maximum. If it is written as \{n\} it means “exactly n times,” and when written as \{n,\} it means “at least n times.”

For example, the regular expression A\{5\} matches exactly five “A” characters in a row, whereas A\{5,\} matches five or more “A” characters.

7.8 Searching for an SSN

Problem

You need a regular expression to match a Social Security number.

Solution

In the US these numbers are nine digits long, typically grouped as three digits, then two digits, then a final four digits (e.g., 123-45-6789). Sometimes they are written without hyphens, so you need to make hyphens optional in the regular expression:

grep '[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}' datafile

You should be able to adapt this to other countries as needed, or consult one of the books we reference at the end of this recipe.

Discussion

These kinds of regular expressions are often jokingly referred to as write-only expressions, meaning that they can be difficult or impossible to read. We’ll take this one apart to help you understand it. In general, though, in any bash script that you write using regular expressions, be sure to put comments nearby explaining what you intend the regular expression to match.

Adding some spaces to the regular expression would improve its readability, making visual comprehension easier, but it would also change the meaning—it would say that we’d need to match space characters at those points in the expression. Ignoring that for the moment, let’s insert some spaces into the previous regular expression so that we can read it more easily:

[0-9]\{3\} -\{0,1\} [0-9]\{2\} -\{0,1\} [0-9]\{4\}

The first grouping says “any digit” then “exactly 3 times.” The next grouping says “a dash” then “0 or 1 time.” The third grouping says “any digit” then “exactly 2 times.” The next grouping says “a dash” then “0 or 1 time.” The last grouping says “any digit” then “exactly 4 times.”

See Also

  • man regex (Linux, Solaris, HP-UX) or man re_format (BSD, Mac) for the details of your regular expression library

  • Classic Shell Scripting by Nelson H. F. Beebe and Arnold Robbins (O’Reilly), Section 3.2, for more about regular expressions and the tools that use them

  • Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly)

  • Regular Expressions Cookbook, 2nd Edition, by Jan Goyvaerts and Steven Levithan (O’Reilly)

  • Recipe 9.5, “Finding Files Irrespective of Case”

7.9 Grepping Compressed Files

Problem

You need to grep some compressed files. Do you have to uncompress them first?

Solution

Not if you have zgrep, zcat, or gzcat on your system.

zgrep is simply a grep that understands various compressed and uncompressed file types (which types are understood varies from system to system). You will commonly run into this when searching syslog messages on Linux, since the log rotation facilities leave the current logfile uncompressed (so it can be in use), but gzip archival logs:

zgrep 'search term' /var/log/messages*

zcat is simply a cat that understands various compressed and uncompressed files (which types are understood varies from system to system). It might understand more formats than zgrep, and it might be installed on more systems by default. It is also used in recovering damaged compressed files, since it will simply output everything it possibly can, instead of erroring out as gunzip or other tools might:

zcat /var/log/messages.1.gz

gzcat is similar to zcat, the differences having to do with commercial versus free Unix variants, and backward compatibility.

Discussion

The less utility may also be configured to transparently display various compressed files, which is very handy. See Recipe 8.15.

7.10 Keeping Some Output, Discarding the Rest

Problem

You need a way to keep some of your output and discard the rest.

Solution

The following code prints the first word of every line of input:

awk '{print $1}' myinput.file

Words are delineated by whitespace. The awk utility reads data from the filename supplied on the command line, or from standard input if no filename is given. Therefore, you can redirect the input from a file, like this:

awk '{print $1}' < myinput.file

or even from a pipe, like this:

cat myinput.file | awk '{print $1}'

Discussion

The awk program can be used in several different ways. Its easiest, simplest use is just to print one or more selected fields from its input.

Fields are delineated by whitespace (or specified with the -F option) and are numbered starting at 1. The field $0 represents the entire line of input.

awk is a complete programming language; awk scripts can become extremely complex. This is only the beginning.

See Also

7.11 Keeping Only a Portion of a Line of Output

Problem

You want to keep only a portion of a line of output, such as just the first and last words. For example, you would like ls to list just filenames and permissions, without all of the other information provided by ls -l. However, you can’t find any options to ls that would limit the output in that way.

Solution

Pipe ls into awk, and just pull out the fields that you need:

$ ls -l | awk '{print $1, $NF}'
total 151130
-rw-r--r-- add.1
drwxr-xr-x art
drwxr-xr-x bin
-rw-r--r-- BuddyIcon.png
drwxr-xr-x CDs
drwxr-xr-x downloads
drwxr-sr-x eclipse
...
$

Discussion

Consider the output from the ls -l command. One line of it looks like this:

drwxr-xr-x 2 username group      176 2006-10-28 20:09 bin

so it is convenient for awk to parse (by default, whitespace delineates fields in awk). The output from ls -l has the permissions as the first field and the filename as the last field.

We use a bit of a trick to print the filename. Since the various fields are referenced in awk using a dollar sign followed by the field number (e.g., $1, $2, $3), and since awk has a built-in variable called NF that holds the number of fields found on the current line, $NF always refers to the last field. (For example, the ls output line has eight fields, so the variable NF contains 8, so $NF refers to the eighth field of the input line, which in our example is the filename.)

Just remember that you don’t use a $ to read the value of an awk variable (unlike bash variables). NF is a valid variable reference by itself. Adding a $ before it changes its meaning from “the number of fields on the current line” to “the last field on the current line.”

See Also

7.12 Reversing the Words on Each Line

Problem

You want to print the input lines with words in the reverse order.

Solution

$ awk '{
>     for (i=NF; i>=0; i--) {
>         printf "%s ", $i;
>     }
>     printf "\n"
> }' <filename>

You don’t type the > characters; the shell will print those as a prompt to say that you haven’t ended your command yet (it is looking for the matching single-quote mark). Because the awk program is enclosed in single quotes, the bash shell lets us type multiple lines, prompting us with the secondary prompt > until we supply the matching end quote. We spaced out the program for readability, even though we could have stuffed it all onto one line like this:

$ awk '{for (i=NF; i>=0; i--) {printf "%s ", $i;} printf "\n" }'<filename>

Discussion

The awk language has syntax for a for loop, very much like C. It even supports a printf mechanism for formatted output, again modeled after the C version (the bash version, too). We use the for loop to count down from the last to the first field, and print each field as we go. We deliberately don’t put a \n on that first printf because we want to keep the several fields on the same line of output. When the loop is done, we add a newline to terminate the line of output.

The reference to $i is very different in awk compared to bash. In bash, when we write $i we are getting at the value stored in the variable named i. But in awk, as with most programming languages, we simply reference the value in i by naming it—that is, by just writing i. So what is meant by $i in awk? The value of the variable i is resolved to a number, and then the dollar-number expression is understood as a reference to a field (or word) of input—that is, the ith field. So as i counts down from the last field to the first, this loop will print the fields in that reversed order.

See Also

7.13 Summing a List of Numbers

Problem

You need to sum a list of numbers, including numbers that don’t appear on lines by themselves.

Solution

Use awk both to isolate the field to be summed and to do the summing. Here we’ll sum up the numbers that are the file sizes from the output of an ls -l command:

ls -l | awk '{sum += $5}; END {print sum}'

Discussion

We are summing up the fifth field of the ls -l output. The output of ls -l looks like this:

-rw-r--r-- 1 albing users 267 2005-09-26 21:26 lilmax

The fields are: permissions, links, owner, group, size (in bytes), last modification date, time of modification, and filename. We’re only interested in the size, so we use $5 in our awk program to reference that field.

We enclose the two bodies of our awk program in braces ({}); note that there can be more than one body (or block) of code in an awk program. A block of code preceded by the literal keyword END is only run once, when the rest of the program has finished. Similarly, you can prefix a block of code with BEGIN and supply some code that will be run before any input is read. The BEGIN block is useful for initializing variables, and we could have used one here to initialize sum, but awk guarantees that variables will start out empty.

If you look at the output of an ls -l command, you will notice that the first line is a total, and doesn’t fit our expected format for the other lines.

We have two choices for dealing with that. First, we can pretend it’s not there, which is the approach taken in the preceding solution. Since that undesired line doesn’t have a fifth field, our reference to $5 will be empty, and our sum won’t change.

The more conscientious approach would be to eliminate that line. We could do so before we give the output to awk by using grep:

ls -l | grep -v '^total' | awk '{sum += $5}; END {print sum}'

or we could do a similar thing within awk:

ls -l | awk '/^total/{next} {sum += $5}; END {print sum}'

The ^total is a regular expression (regex); it means “the letters t-o-t-a-l occurring at the beginning of a line” (the leading ^ anchors the search to the beginning of a line). For any line of input matching that regex, the associated block of code will be executed. The second block of code (the sum) has no leading text, the absence of which tells awk to execute it for every line of input (meaning this will happen regardless of whether the line matches the regex).

Now, the whole point of adding the special case for “total” was to exclude such a line from our summing. Therefore, in the ^total block we add a next command, which ends processing on this line of input and starts over with the next line of input. Since that next line of input will not begin with “total”, awk will execute the second block of code with this new line of input. We could also have used a getline in place of the next command. getline does not rematch all the patterns from the top, only the ones from there on down. Note that in awk programming, the order of the blocks of code matters.

See Also

7.14 Counting String Values with awk

Problem

You need to count all the occurrences of several different strings, including some strings whose values you don’t know beforehand. That is, you’re not trying to count the occurrences of a predetermined set of strings. Rather, you are going to encounter some strings in your data and you want to count these as-yet-unknown strings.

Solution

Use awk’s associative arrays (also known as hashes or dictionaries in other languages) for your counting.

For our example, we’ll count how many files are owned by various users on our system. The username shows up as the third field in ls -l output, so we’ll use that field ($3) as the index of the array and increment that member of the array (see Example 7-1).

Example 7-1. ch07/asar.awk
#!/usr/bin/awk -f
# cookbook filename: asar.awk
# Associative arrays in Awk
# Usage: ls -lR /usr/local | asar.awk

NF > 7 {
    user[$3]++
}
END {
    for (i in user) {
        printf "%s owns %d files\n", i, user[i]
    }
}

We invoke awk a bit differently here. Because this awk script is a bit more complex, we’ve put it in a separate file. We use the -f option to tell awk where to get the script file just for fun, but we could have used a #!/usr/bin/awk shebang line in the script itself too:

$ ls -lR /usr/local | awk -f asar.awk
bin owns 68 files
albing owns 1801 files
root owns 13755 files
man owns 11491 files
$

Discussion

We use the condition NF > 7 as a qualifier to part of the awk script to weed out the lines that do not contain filenames, which appear in the ls -lR output and are useful for readability—they include blank lines to separate different directories as well as total counts for each subdirectory. Such lines don’t have as many fields (or words). The expression NF > 7 that precedes the opening brace is not enclosed in slashes, which is to say that it is not a regular expression. It’s a logical expression, much like you would use in an if statement, and it evaluates to true or false. The NF variable is a special built-in variable that refers to the number of fields for the current line of input. So, only if a line of input has more than seven fields (words of text) will it be processed by the statements within the braces.

The key line, however, is this one:

 user[$3]++

Here, the username (e.g., bin) is used as the index to the array. It’s called an associative array because a hash table (or similar mechanism) is being used to associate each unique string with a numerical value. awk is doing all that work for you behind the scenes; you don’t have to write any string comparisons or lookups and such.

Once you’ve built such an array, it might seem difficult to get the values back out. For this, awk has a special form of the for loop. Instead of the numeric for(i=0; i<max; i++) that awk also supports, there is a particular syntax for associative arrays:

for (i in user)

In this expression, the variable i will take on successive values (in no particular order) from the various values used as indexes to the array user. In our example, this means that i will take on the values (bin, albing, root, and man), one in each iteration of the loop. If you haven’t seen associative arrays before, then we hope that you’re surprised and impressed. This is a very powerful feature of awk (and Perl).

See Also

7.15 Counting String Values with bash

Problem

You need to count all the occurrences of several different strings, including some strings whose values you don’t know beforehand. That is, you’re not trying to count the occurrences of a predetermined set of strings. Rather, you are going to encounter some strings in your data and you want to count these as-yet-unknown strings.

Solution

If you are using version 4.0 or newer, use bash’s associative arrays (also known as hashes or dictionaries in other languages) for your counting.

For our example, we’ll count how many files are owned by various users on our system. The username shows up as the third field in ls -l output, so we’ll use that value ($3) as the index of the array, and increment that member of the array:

Example 7-2. ch07/cnt_owner.sh
# # cookbook filename: cnt_owner
# count owners of a file using bash
# pipe "ls -l" into this script

declare -A AACOUNT
while read  -a LSL
do
    # only consider lines that are 7 words or longer
    if (( ${#LSL[*]} > 7 ))     # the size of the array
    then
        NDX=${LSL[3]}                 # string assign
        (( AACOUNT[${NDX}] += 1 ))    # math increment
    fi
done

for VALS in "${!AACOUNT[@]}"      # index of each element
do
    echo $VALS "owns" ${AACOUNT[$VALS]} "files"
done

We can invoke the program as follows with the results as shown:

$ ls -lR /usr/local | bash cnt_owner.sh
bin owns 68 files
root owns 13755 files
man owns 11491 files
albing owns 1801 files
$

Discussion

The read -a LSL reads a line at a time, and each word (delineated by whitespace) is assigned to an entry in the array LSL. We check to see how many words were read by checking the size of the array to weed out the lines that do not contain filenames. Such lines are part of the ls -lR output and are usually useful for readability because they include blank lines to separate different directories as well as total counts for each subdirectory. They don’t have useful information for our script, but fortunately such lines don’t have as many fields (or words) as the lines we want.

Only for lines with at least seven words do we take the third word, which should be the owner of the file, and use that as an index to our associative array. With standard arrays, such as LSL, each element is referred to by its index and that index is an integer. With an associative array, however, the index can be a string.

To print out the results we need to loop over the list of index values that were used with this array. The construct "${AACOUNT[@]}" would generate a list of all the values in the array, but add the “bang”—"${!AACOUNT[@]}"—and you get a list of all the index values used with this array.

Note that the output is in no particular order (it’s related to the internals of the hashing algorithm). If you want it sorted by name or by number of files, then pipe this result into the sort command.

7.16 Showing Data as a Quick and Easy Histogram

Problem

You need a quick screen-based histogram of some data.

Solution

Use the associative arrays of awk, as discussed in Recipe 7.14 (see Example 7-3).

Example 7-3. ch07/hist.awk
#!/usr/bin/awk -f
# cookbook filename: hist.awk
# Histograms in Awk
# Usage: ls -lR /usr/local | hist.awk

function max(arr, big)
{
    big = 0;
    for (i in user)
    {
        if (user[i] > big) { big=user[i];}
    }
    return big
}

NF > 7 {
    user[$3]++
}
END {
    # for scaling
    maxm = max(user);
    for (i in user) {
        #printf "%s owns %d files\n", i, user[i]
        scaled = 60 * user[i] / maxm ;
        printf "%-10.10s [%8d]:", i, user[i]
        for (i=0; i<scaled; i++) {
            printf "#";
        }
        printf "\n";
    }
}

When we run it with the same input as Recipe 7.14, we get:

$ ls -lR /usr/local | awk -f hist.awk
bin       [      68]:#
albing    [    1801]:#######
root      [   13755]:##################################################
man       [   11491]:##########################################
$

Discussion

We could have put the code for max as the first code inside the END block, but we wanted to show you that you can define functions in awk. We are using a fancier printf statement here. The string format %-10.10s will left-justify and pad to 10 characters but also truncate at 10 characters. The integer format %8d will assure that the integer is printed in an 8-character field. This gives each histogram the same starting point, by using the same amount of space regardless of the username or the size of the integer.

Like all arithmetic in awk, the scaling calculation is done with floating-point numbers unless we explicitly truncate the result with a call to the built-in int() function. We don’t do so, which means that the for loop will execute at least once, so that even the smallest amount of data will still display a single hash mark.

The data returned from the for (i in user+ loop is in no particular order, probably based on some convenient ordering of the underlying hash table. If you wanted the histogram displayed in a sorted order, either numeric by count or alphabetical by username, you would have to add some sorting. One way to do this is to break this program apart into two pieces, sending the output from the first part into the sort command and then piping that output into the second piece to print the histogram.

See Also

7.17 An Easy Histogram with bash

Problem

You’d like to use bash rather than an external program to compute and draw your histogram. Is that possible?

Solution

Yes, thanks to associative arrays. They are available in versions of bash from 4.0 onward. Based on the code for counting strings (Recipe 7.15), the difference is only in the output section. First we make a pass over the values to find the largest value, so that we can scale our output to fit on the page:

    BIG=0
    for VALS in "${!UCOUNT[@]}"
    do
        if (( UCOUNT[$VALS] > BIG )) ; then BIG=${UCOUNT[$VALS]} ; fi
    done

With a maximum value (in BIG), we output a line for each entry in the array:

#
# print the histogram
#
for VALS in "${!UCOUNT[@]}"
do
    printf "%-9.9s [%7d]:" $VALS ${UCOUNT[$VALS]}
    # scale to the max value (BIG); N.B. integer /
    SCALED=$(( ( (59 * UCOUNT[$VALS]) / BIG) +1 ))
    for ((i=0; i<SCALED; i++)) {
        printf "#"
    }
    printf "\n"
done

Discussion

As in Recipe 7.15, the construct "${!UCOUNT[@]" is crucial. It evaluates to a list of index values used on the array (in this case, the array UCOUNT). The for loop takes each value one at a time and uses it as the index into the array to get the count for that user.

We scale it to 59 spaces and then add 1 so that any nonzero value will have at least one mark on the histogram. This isn’t a problem in the awk version (Recipe 7.16) because awk uses floating-point math, but the bash version uses integer math so anything too small may end up as 0 after the division.

7.18 Showing a Paragraph of Text After a Found Phrase

Problem

You are searching for a phrase in a document, and want to show the paragraph after the found phrase.

Solution

We’re assuming a simple text file, where paragraph means all the text between blank lines, so the occurrence of a blank line implies a paragraph break. Given that, it’s a pretty short awk program:

$ cat para.awk
/keyphrase/ { flag=1 }
flag == 1 { print }
/^$/ { flag=0 }

$ awk -f para.awk < searchthis.txt

Discussion

There are just three simple code blocks. The first is invoked when a line of input matches the regular expression (here just the word “keyphrase”). If “keyphrase” occurs anywhere within a line of input, that is a match and this block of code will be executed. All that happens in this block is that the flag is set.

The second code block is invoked for every line of input, since there is no regular expression preceding its open brace. Even the input that matches “keyphrase” will also be applied to this code block (if we didn’t want that effect, we could use a next statement in the first block). All this second block does is print the entire input line, but only if the flag is set.

The third block has a regular expression that, if satisfied, will simply reset (turn off) the flag. That regular expression uses two characters with special meaning: the caret (^), when used as the first character of a regular expression, matches the beginning of the line; the dollar sign ($), when used as the last character, matches the end of the line. So, the regular expression ^$ matches an empty line, with no characters between the beginning and end of the line.

We could have used a slightly more complicated regular expression for an empty line to let it handle any line with just whitespace rather than a completely blank line. That would make the third line look like this:

/^[[:blank:]]*$/ { flag=0 }

Perl programmers love the sort of problem and solution discussed in this recipe, but we’ve implemented it with awk because Perl is (mostly) beyond the scope of this book. If you know Perl, by all means use it. If not, awk might be all you need.

See Also