Linux Shell Scripting Cookbook

The egrep command converts the text file into a stream of words, one word per line. The \b[[:alpha:]]+\b pattern matches each word and removes whitespace and punctuation. The -o option prints the matching character sequences as one word in each line.

The awk command counts each word. It executes the statements in the { } block for each line, so we don't need a specific loop for doing that. The count is incremented by the count[$0]++ command, in which $0 is the current line and count is an associative array. After all the lines are processed, the END{} block prints the words and their count.

The body of this procedure can be modified using other tools we've looked at. We can merge capitalized and non-capitalized words into a single count with the tr command, and sort the output using the sort command, like this:

egrep -o "\b[[:alpha:]]+\b" $filename | tr [A=Z] [a-z] | \ 
  awk '{ count[$0]++ } 
    END{ printf("%-14s%s\n","Word","Count") ; 
      for(ind in count) 
        {  printf("%-14s%d\n",ind,count[ind]); 
        }
      }' | sort

Table of Contents for
Linux Shell Scripting Cookbook - Third Edition

How it works...

Table of Contents for Linux Shell Scripting Cookbook - Third Edition

Table of Contents for
Linux Shell Scripting Cookbook - Third Edition