Table of Contents for
sed & awk, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition sed & awk, 2nd Edition by Arnold Robbins Published by O'Reilly Media, Inc., 1997
  1. sed & awk, 2nd Edition
  2. Cover
  3. sed & awk, 2nd Edition
  4. A Note Regarding Supplemental Files
  5. Dedication
  6. Preface
  7. Scope of This Handbook
  8. Availability of sed and awk
  9. Obtaining Example Source Code
  10. Conventions Used in This Handbook
  11. About the Second Edition
  12. Acknowledgments from the First Edition
  13. Comments and Questions
  14. 1. Power Tools for Editing
  15. 1.1. May You Solve Interesting Problems
  16. 1.2. A Stream Editor
  17. 1.3. A Pattern-Matching Programming Language
  18. 1.4. Four Hurdles to Mastering sed and awk
  19. 2. Understanding Basic Operations
  20. 2.1. Awk, by Sed and Grep, out of Ed
  21. 2.2. Command-Line Syntax
  22. 2.3. Using sed
  23. 2.4. Using awk
  24. 2.5. Using sed and awk Together
  25. 3. Understanding Regular Expression Syntax
  26. 3.1. That’s an Expression
  27. 3.2. A Line-Up of Characters
  28. 3.3. I Never Metacharacter I Didn’t Like
  29. 4. Writing sed Scripts
  30. 4.1. Applying Commands in a Script
  31. 4.2. A Global Perspective on Addressing
  32. 4.3. Testing and Saving Output
  33. 4.4. Four Types of sed Scripts
  34. 4.5. Getting to the PromiSed Land
  35. 5. Basic sed Commands
  36. 5.1. About the Syntax of sed Commands
  37. 5.2. Comment
  38. 5.3. Substitution
  39. 5.4. Delete
  40. 5.5. Append, Insert, and Change
  41. 5.6. List
  42. 5.7. Transform
  43. 5.8. Print
  44. 5.9. Print Line Number
  45. 5.10. Next
  46. 5.11. Reading and Writing Files
  47. 5.12. Quit
  48. 6. Advanced sed Commands
  49. 6.1. Multiline Pattern Space
  50. 6.2. A Case for Study
  51. 6.3. Hold That Line
  52. 6.4. Advanced Flow Control Commands
  53. 6.5. To Join a Phrase
  54. 7. Writing Scripts for awk
  55. 7.1. Playing the Game
  56. 7.2. Hello, World
  57. 7.3. Awk’s Programming Model
  58. 7.4. Pattern Matching
  59. 7.5. Records and Fields
  60. 7.6. Expressions
  61. 7.7. System Variables
  62. 7.8. Relational and Boolean Operators
  63. 7.9. Formatted Printing
  64. 7.10. Passing Parameters Into a Script
  65. 7.11. Information Retrieval
  66. 8. Conditionals, Loops, and Arrays
  67. 8.1. Conditional Statements
  68. 8.2. Looping
  69. 8.3. Other Statements That Affect Flow Control
  70. 8.4. Arrays
  71. 8.5. An Acronym Processor
  72. 8.6. System Variables That Are Arrays
  73. 9. Functions
  74. 9.1. Arithmetic Functions
  75. 9.2. String Functions
  76. 9.3. Writing Your Own Functions
  77. 10. The Bottom Drawer
  78. 10.1. The getline Function
  79. 10.2. The close( ) Function
  80. 10.3. The system( ) Function
  81. 10.4. A Menu-Based Command Generator
  82. 10.5. Directing Output to Files and Pipes
  83. 10.6. Generating Columnar Reports
  84. 10.7. Debugging
  85. 10.8. Limitations
  86. 10.9. Invoking awk Using the #! Syntax
  87. 11. A Flock of awks
  88. 11.1. Original awk
  89. 11.2. Freely Available awks
  90. 11.3. Commercial awks
  91. 11.4. Epilogue
  92. 12. Full-Featured Applications
  93. 12.1. An Interactive Spelling Checker
  94. 12.2. Generating a Formatted Index
  95. 12.3. Spare Details of the masterindex Program
  96. 13. A Miscellany of Scripts
  97. 13.1. uutot.awk—Report UUCP Statistics
  98. 13.2. phonebill—Track Phone Usage
  99. 13.3. combine—Extract Multipart uuencoded Binaries
  100. 13.4. mailavg—Check Size of Mailboxes
  101. 13.5. adj—Adjust Lines for Text Files
  102. 13.6. readsource—Format Program Source Files for troff
  103. 13.7. gent—Get a termcap Entry
  104. 13.8. plpr—lpr Preprocessor
  105. 13.9. transpose—Perform a Matrix Transposition
  106. 13.10. m1—Simple Macro Processor
  107. A. Quick Reference for sed
  108. A.1. Command-Line Syntax
  109. A.2. Syntax of sed Commands
  110. A.3. Command Summary for sed
  111. B. Quick Reference for awk
  112. B.1. Command-Line Syntax
  113. B.2. Language Summary for awk
  114. B.3. Command Summary for awk
  115. C. Supplement for Chapter 12
  116. C.1. Full Listing of spellcheck.awk
  117. C.2. Listing of masterindex Shell Script
  118. C.3. Documentation for masterindex
  119. masterindex
  120. C.3.1. Background Details
  121. C.3.2. Coding Index Entries
  122. C.3.3. Output Format
  123. C.3.4. Compiling a Master Index
  124. Index
  125. About the Authors
  126. Colophon
  127. Copyright

Substitution

We have already demonstrated many uses of the substitute command. Let’s look carefully at its syntax:

[address]s/pattern/replacement/flags

where the flags that modify the substitution are:

n

A number (1 to 512) indicating that a replacement should be made for only the nth occurrence of the pattern.

g

Make changes globally on all occurrences in the pattern space. Normally only the first occurrence is replaced.

p

Print the contents of the pattern space.

w file

Write the contents of the pattern space to file.

The substitute command is applied to the lines matching the address. If no address is specified, it is applied to all lines that match the pattern, a regular expression. If a regular expression is supplied as an address, and no pattern is specified, the substitute command matches what is matched by the address. This can be useful when the substitute command is one of multiple commands applied at the same address. For an example, see Section 5.11.1 later in this chapter.

Unlike addresses, which require a slash (/) as a delimiter, the regular expression can be delimited by any character except a newline. Thus, if the pattern contained slashes, you could choose another character, such as an exclamation mark, as the delimiter.

s!/usr/mail!/usr2/mail!

Note that the delimiter appears three times and is required after the replacement. Regardless of which delimiter you use, if it does appear in the regular expression, or in the replacement text, use a backslash (\) to escape it.

Once upon a time, computers stored text in fixed-length records. A line ended after so many characters (typically 80), and then the next line started. There was no explicit character in the data to mark the end of one line and the beginning of the next; every line had the same (fixed) number of characters. Modern systems are more flexible; they use a special character (referred to as newline) to mark the end of the line. This allows lines to be of arbitrary[3] length.

Since newline is just another character when stored internally, a regular expression can use “\n” to match an embedded newline. This occurs, as you will see in the next chapter, in the special case when another line is appended to the current line in the pattern space. (See Chapter 2, for a discussion of line addressing and Chapter 3, for a discussion of regular expression syntax.)

The replacement is a string of characters that will replace what is matched by the regular expression. (See Section 3.2.12.1 in Chapter 3.) In the replacement section, only the following characters have special meaning:

&

Replaced by the string matched by the regular expression.

\n

Matches the nth substring (n is a single digit) previously specified in the pattern using “\(” and “\)”.

\

Used to escape the ampersand (&), the backslash (\), and the substitution command’s delimiter when they are used literally in the replacement section. In addition, it can be used to escape the newline and create a multiline replacement string.

Thus, besides metacharacters in regular expressions, sed also has metacharacters in the replacement. See the next section, “Replacement Metacharacters,” for examples of using them.

Flags can be used in combination where it makes sense. For instance, gp makes the substitution globally on the line and prints the line. The global flag is by far the most commonly used. Without it, the replacement is made only for the first occurrence on the line. The print flag and the write flag both provide the same functionality as the print and write commands (which are discussed later in this chapter) with one important difference. These actions are contingent upon a successful substitution occurring. In other words, if the replacement is made, the line is printed or written to file. Because the default action is to pass through all lines, regardless of whether any action is taken, the print and write flags are typically used when the default output is suppressed (the -n option). In addition, if a script contains multiple substitute commands that match the same line, multiple copies of that line will be printed or written to file.

The numeric flag can be used in the rare instances where the regular expression repeats itself on a line and the replacement must be made for only one of those occurrences by position. For instance, a line, perhaps containing tbl input, might contain multiple tabs. Let’s say that there are three tabs per line, and you’d like to replace the second tab with “>”. The following substitute command would do it:

s/•/>/2

“•” represents an actual tab character, which is otherwise invisible on the screen. If the input is a one-line file such as the following:

Column1•Column2•Column3•Column4

the output produced by running the script on this file will be:

Column1•Column2>Column3•Column4

Note that without the numeric flag, the substitute command would replace only the first tab. (Therefore “1” can be considered the default numeric flag.)

Replacement Metacharacters

The replacement metacharacters are backslash (\), ampersand (&), and \n. The backslash is generally used to escape the other metacharacters but it is also used to include a newline in a replacement string.

We can do a variation on the previous example to replace the second tab on each line with a newline.

s/•/\
/2

Note that no spaces are permitted after the backslash. This script produces the following result:

Column1•Column2
Column3•Column4

Another example comes from the conversion of a file for troff to an ASCII input format for Ventura Publisher. It converts the following line for troff:

.Ah "Major Heading"

to a similar line for Ventura Publisher:

@A HEAD = Major Heading

The twist in this problem is that the line needs to be preceded and followed by blank lines. It is an example of writing a multiline replacement string.

/^\.Ah/{
s/\.Ah */\
\
@A HEAD = /
s/"//g
s/$/\
/    
}

The first substitute command replaces “.Ah” with two newlines and “@A HEAD =”. A backslash at the end of the line is necessary to escape the newline. The second substitution removes the quotation marks. The last command matches the end of line in the pattern space (not the embedded newlines) and adds a newline after it.

In the next example, the backslash is used to escape the ampersand, which appears literally in the replacement section.

s/ORA/O'Reilly \& Associates, Inc./g

It’s easy to forget about the ampersand appearing literally in the replacement string. If we had not escaped it in this example, the output would have been “O’Reilly ORA Associates, Inc.”

As a metacharacter, the ampersand (&) represents the extent of the pattern match, not the line that was matched. You might use the ampersand to match a word and surround it by troff requests. The following example surrounds a word with point-size requests:

s/UNIX/\\s-2&\\s0/g

Because backslashes are also replacement metacharacters, two backslashes are necessary to output a single backslash. The “&” in the replacement string refers to “UNIX.” If the input line is:

on the UNIX Operating System.

then the substitute command produces:

on the \s-2UNIX\s0 Operating System.

The ampersand is particularly useful when the regular expression matches variations of a word. It allows you to specify a variable replacement string that corresponds to what was actually matched. For instance, let’s say that you wanted to surround with parentheses any cross reference to a numbered section in a document. In other words, any reference such as “See Section 1.4” or “See Section 12.9” should appear in parentheses, as “(See Section 12.9).” A regular expression can match the different combination of numbers, so we use “&” in the replacement string and surround whatever was matched.

s/See Section [1-9][0-9]*\.[1-9][0-9]*/(&)/

The ampersand makes it possible to reference the entire match in the replacement string.

Now let’s look at the metacharacters that allow us to select any individual portion of a string that is matched and recall it in the replacement string. A pair of escaped parentheses are used in sed to enclose any part of a regular expression and save it for recall. Up to nine “saves” are permitted for a single line. "\n" is used to recall the portion of the match that was saved, where n is a number from 1 to 9 referencing a particular “saved” string in order of use.

For example, to put the section numbers in boldface when they appeared as a cross reference, we could write the following substitution:

s/\(See Section \)\([1-9][0-9]*\.[1-9][0-9]*\)/\1\\fB\2\\fP/

Two pairs of escaped parentheses are specified. The first captures “See Section□” (because this is a fixed string, it could have been simply retyped in the replacement string). The second captures the section number. The replacement string recalls the first saved substring as “\1” and the second as “\2,” which is surrounded by bold-font requests.

We can use a similar technique to match parts of a line and swap them. For instance, let’s say there are two parts of a line separated by a colon. We can match each part, putting them within escaped parentheses and swapping them in the replacement.

$ cat test1
first:second
one:two
$ sed  's/\(.*\):\(.*\)/\2:\1/' test1
second:first
two:one

The larger point is that you can recall a saved substring in any order, and multiple times, as you’ll see in the next example.

Correcting index entries

Later, in the awk section of this book, we will present a program for formatting an index, such as the one for this book. The first step in creating an index is to place index codes in the document files. We use an index macro named .XX, which takes a single argument, the index entry. A sample index entry might be:

.XX "sed, substitution command"

Each index entry appears on a line by itself. When you run an index, you get a collection of index entries with page numbers that are then sorted and merged in a list. An editor poring over that list will typically find errors and inconsistencies that need to be corrected. It is, in short, a pain to have to track down the file where an index entry resides and then make the correction, particularly when there are dozens of entries to be corrected.

Sed can be a great help in making these edits across a group of files. One can simply create a list of edits in a sed script and then run it on all the files. A key point is that the substitute command needs an address that limits it to lines beginning “.XX”. Your script should not make changes in the text itself.

Let’s say that we wanted to change the index entry above to “sed, substitute command.” The following command would do it:

/^\.XX /s/sed, substitution command/sed, substitute command/

The address matches all lines that begin with “.XX " and only on those lines does it attempt to make the replacement. You might wonder, why not specify a shorter regular expression? For example:

/^\.XX /s/substitution/substitute/

The answer is simply that there could be other entries which use the word “substitution” correctly and which we would not want to change.

We can go a step further and provide a shell script that creates a list of index entries prepared for editing as a series of sed substitute commands.

#! /bin/sh
# index.edit -- compile list of index entries for editing.
grep "^\.XX" $* | sort -u |
sed '
s/^\.XX \(.*\)$/\/^\\.XX \/s\/\1\/\1\//'

The index.edit shell script uses grep to extract all lines containing index entries from any number of files specified on the command line. It passes this list through sort which, with the -u option, sorts and removes duplicate entries. The list is then piped to sed, and the one-line sed script builds a substitution command.

Let’s look at it more closely. Here’s just the regular expression:

^\.XX \(.*\)$

It matches the entire line, saving the index entry for recall. Here’s just the replacement string:

\/^\\.XX \/s\/\1\/\1\/

It generates a substitute command beginning with an address: a slash, followed by two backslashes—to output one backslash to protect the dot in the “.XX” that follows—then comes a space, then another slash to complete the address. Next we output an “s” followed by a slash, and then recall the saved portion to be used as a regular expression. That is followed by another slash and again we recall the saved substring as the replacement string. A slash finally ends the command.

When the index.edit script is run on a file, it creates a listing similar to this:

$ index.edit ch05
/^\.XX /s/"append command(a)"/"append command(a)"/
/^\.XX /s/"change command"/"change command"/
/^\.XX /s/"change command(c)"/"change command(c)"/
/^\.XX /s/"commands:sed, summary of"/"commands:sed, summary of"/
/^\.XX /s/"delete command(d)"/"delete command(d)"/
/^\.XX /s/"insert command(i)"/"insert command(i)"/
/^\.XX /s/"line numbers:printing"/"line numbers:printing"/
/^\.XX /s/"list command(l)"/"list command(l)"/

This output could be captured in a file. Then you can delete the entries that don’t need to change and you can make changes by editing the replacement string. At that point, you can use this file as a sed script to correct the index entries in all document files.

When doing a large book with lots of entries, you might use grep again to extract particular entries from the output of index.edit and direct them into their own file for editing. This saves you from having to wade through numerous entries.

There is one small failing in this program. It should look for metacharacters that might appear literally in index entries and protect them in regular expressions. For instance, if an index entry contains an asterisk, it will not be interpreted as such, but as a metacharacter. To make that change effectively requires the use of several advanced commands, so we’ll put off improving this script until the next chapter.



[3] Well, more or less. Many UNIX programs have internal limits on the length of the lines that they will process. Most GNU programs, though, do not have such limits.