Table of Contents for
sed & awk, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition sed & awk, 2nd Edition by Arnold Robbins Published by O'Reilly Media, Inc., 1997
  1. sed & awk, 2nd Edition
  2. Cover
  3. sed & awk, 2nd Edition
  4. A Note Regarding Supplemental Files
  5. Dedication
  6. Preface
  7. Scope of This Handbook
  8. Availability of sed and awk
  9. Obtaining Example Source Code
  10. Conventions Used in This Handbook
  11. About the Second Edition
  12. Acknowledgments from the First Edition
  13. Comments and Questions
  14. 1. Power Tools for Editing
  15. 1.1. May You Solve Interesting Problems
  16. 1.2. A Stream Editor
  17. 1.3. A Pattern-Matching Programming Language
  18. 1.4. Four Hurdles to Mastering sed and awk
  19. 2. Understanding Basic Operations
  20. 2.1. Awk, by Sed and Grep, out of Ed
  21. 2.2. Command-Line Syntax
  22. 2.3. Using sed
  23. 2.4. Using awk
  24. 2.5. Using sed and awk Together
  25. 3. Understanding Regular Expression Syntax
  26. 3.1. That’s an Expression
  27. 3.2. A Line-Up of Characters
  28. 3.3. I Never Metacharacter I Didn’t Like
  29. 4. Writing sed Scripts
  30. 4.1. Applying Commands in a Script
  31. 4.2. A Global Perspective on Addressing
  32. 4.3. Testing and Saving Output
  33. 4.4. Four Types of sed Scripts
  34. 4.5. Getting to the PromiSed Land
  35. 5. Basic sed Commands
  36. 5.1. About the Syntax of sed Commands
  37. 5.2. Comment
  38. 5.3. Substitution
  39. 5.4. Delete
  40. 5.5. Append, Insert, and Change
  41. 5.6. List
  42. 5.7. Transform
  43. 5.8. Print
  44. 5.9. Print Line Number
  45. 5.10. Next
  46. 5.11. Reading and Writing Files
  47. 5.12. Quit
  48. 6. Advanced sed Commands
  49. 6.1. Multiline Pattern Space
  50. 6.2. A Case for Study
  51. 6.3. Hold That Line
  52. 6.4. Advanced Flow Control Commands
  53. 6.5. To Join a Phrase
  54. 7. Writing Scripts for awk
  55. 7.1. Playing the Game
  56. 7.2. Hello, World
  57. 7.3. Awk’s Programming Model
  58. 7.4. Pattern Matching
  59. 7.5. Records and Fields
  60. 7.6. Expressions
  61. 7.7. System Variables
  62. 7.8. Relational and Boolean Operators
  63. 7.9. Formatted Printing
  64. 7.10. Passing Parameters Into a Script
  65. 7.11. Information Retrieval
  66. 8. Conditionals, Loops, and Arrays
  67. 8.1. Conditional Statements
  68. 8.2. Looping
  69. 8.3. Other Statements That Affect Flow Control
  70. 8.4. Arrays
  71. 8.5. An Acronym Processor
  72. 8.6. System Variables That Are Arrays
  73. 9. Functions
  74. 9.1. Arithmetic Functions
  75. 9.2. String Functions
  76. 9.3. Writing Your Own Functions
  77. 10. The Bottom Drawer
  78. 10.1. The getline Function
  79. 10.2. The close( ) Function
  80. 10.3. The system( ) Function
  81. 10.4. A Menu-Based Command Generator
  82. 10.5. Directing Output to Files and Pipes
  83. 10.6. Generating Columnar Reports
  84. 10.7. Debugging
  85. 10.8. Limitations
  86. 10.9. Invoking awk Using the #! Syntax
  87. 11. A Flock of awks
  88. 11.1. Original awk
  89. 11.2. Freely Available awks
  90. 11.3. Commercial awks
  91. 11.4. Epilogue
  92. 12. Full-Featured Applications
  93. 12.1. An Interactive Spelling Checker
  94. 12.2. Generating a Formatted Index
  95. 12.3. Spare Details of the masterindex Program
  96. 13. A Miscellany of Scripts
  97. 13.1. uutot.awk—Report UUCP Statistics
  98. 13.2. phonebill—Track Phone Usage
  99. 13.3. combine—Extract Multipart uuencoded Binaries
  100. 13.4. mailavg—Check Size of Mailboxes
  101. 13.5. adj—Adjust Lines for Text Files
  102. 13.6. readsource—Format Program Source Files for troff
  103. 13.7. gent—Get a termcap Entry
  104. 13.8. plpr—lpr Preprocessor
  105. 13.9. transpose—Perform a Matrix Transposition
  106. 13.10. m1—Simple Macro Processor
  107. A. Quick Reference for sed
  108. A.1. Command-Line Syntax
  109. A.2. Syntax of sed Commands
  110. A.3. Command Summary for sed
  111. B. Quick Reference for awk
  112. B.1. Command-Line Syntax
  113. B.2. Language Summary for awk
  114. B.3. Command Summary for awk
  115. C. Supplement for Chapter 12
  116. C.1. Full Listing of spellcheck.awk
  117. C.2. Listing of masterindex Shell Script
  118. C.3. Documentation for masterindex
  119. masterindex
  120. C.3.1. Background Details
  121. C.3.2. Coding Index Entries
  122. C.3.3. Output Format
  123. C.3.4. Compiling a Master Index
  124. Index
  125. About the Authors
  126. Colophon
  127. Copyright

Reading and Writing Files

The read (r) and write (w) commands allow you to work directly with files. Both take a single argument, the name of a file. The syntax follows:

[line-address]r file
[address]w file

The read command reads the contents of file into the pattern space after the addressed line. It cannot operate on a range of lines. The write command writes the contents of the pattern space to the file.

You must have a single space between the command and the filename. (Everything after that space and up to the newline is taken to be the filename. Thus, leading or embedded spaces will become part of the filename.) The read command will not complain if the file does not exist. The write command will create a file if it does not exist; if the file already exists, the write command will overwrite it each time the script is invoked. If there are multiple instructions writing to the same file in one script, then each write command appends to the file. Also, be aware that you can only open up to 10 files per script.

The read command can be useful for inserting the contents of one file at a particular place in another file. For instance, let’s say that there is a set of files and each file should close with the same one- or two-paragraph statement. A sed script would allow you to maintain the closing separately while inserting it as needed, for instance, when sending the file to the printer.

sed '$r closing' $* | pr | lp

The $ is an addressing symbol specifying the last line of the file. The contents of the file named closing are placed after the contents of pattern space and output with it. This example does not specify a pathname, assuming the file to be in the same directory as the command. A more general-purpose command should use the full pathname.

You may want to test out a few quirks of the read command. Let’s look at the following command:

/^<Company-list>/r company.list

That is, when sed matches a line beginning with the string “<Company-list>”, it is going to append the contents of the file company.list to the end of the matched line. No subsequent command will affect the lines read from the file. For instance, you can’t make any changes to the list of companies that you’ve read into the file. However, commands that address the original line will work. The previous command could be followed by a second command:

/^<Company-list>/d

to delete the original line. So that if the input file was as follows:

For service, contact any of the following companies:
<Company-list>
Thank you.

running the two-line script would produce:

For service, contact any of the following companies:
	Allied
	Mayflower
	United
Thank you.

Suppressing the automatic output, using the -n option or #n script syntax, prevents the original line in the pattern space from being output, but the result of a read command still goes to standard output.

Now let’s look at examples of the write command. One use is to extract information from one file and place it in its own file. For instance, imagine that we had a file listing the names of salespeople alphabetically. For each person, the listing designates which of four regions the person is assigned to. Here’s a sample:

Adams, Henrietta        Northeast
Banks, Freda            South
Dennis, Jim             Midwest
Garvey, Bill            Northeast
Jeffries, Jane          West
Madison, Sylvia         Midwest
Sommes, Tom             South

Writing a script for a seven-line file, of course, is ridiculous. Yet such a script can potentially handle as many names as you can put together, and is reusable.

If all we wanted was to extract the names for a particular region, we could easily use grep to do it. The advantage with sed is that we can break up the file into four separate files in a single step. The following four-line script does it:

/Northeast$/w region.northeast
/South$/w region.south
/Midwest$/w region.midwest
/West$/w region.west

All of the names of salespeople that are assigned to the Northeast region will be placed in a file named region.northeast.

The write command writes out the contents of the pattern space when the command is invoked, not when end of the script is reached. In the previous example, we might want to remove the name of the region before writing it to file. For each case, we could handle it as we show for the Northeast region:

/Northeast$/{
	s///
	w region.northeast
	}

The substitute command matches the same pattern as the address and removes it. There are many different uses for the write command; for example, you could use it in a script to generate several customized versions of the same source file.

Checking Out Reference Pages

Like many programs, a sed script often starts out small, and is simple to write and simple to read. In testing the script, you may discover specific cases for which the general rules do not apply. To account for these, you add lines to your script, making it longer, more complex, and more complete. While the amount of time you spend refining your script may cancel out the time saved by not doing the editing manually, at least during that time your mind has been engaged by your own seeming sleight-of-hand: “See! The computer did it.”

We encountered one such problem in preparing a formatted copy of command pages that the writer had typed as a text file without any formatting information. Although the files had no formatting codes, headings were used consistently to identify the format of the command pages. A sample file is shown below.

******************************************************************

NAME:	DBclose - closes a database

SYNTAX:
	void	DBclose(fdesc)
		DBFILE *fdesc;

USAGE:
	fdesc	- pointer to database file descriptor

DESC: 
DBclose( ) closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.

Other users are not affected when you call DBclose( ).  Their update
locks and pending writes are not changed.

Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.

DBclose( ) is analogous to the CLOSE statement in BASIC.

RETURNS:
	There is no return value

******************************************************************

The task was to format this document for the laser printer, using the reference header macros we had developed. Because there were perhaps forty of these command pages, it would have been utter drudgery to go through and add codes by hand. However, because there were that many, and even though the writer was generally consistent in entering them, there would be enough differences from command to command to have required several passes.

We’ll examine the process of building this sed script. In a sense, this is a process of looking carefully at each line of a sample input file and determining whether or not an edit must be made on that line. Then we look at the rest of the file for similar occurrences. We try to find specific patterns that mark the lines or range of lines that need editing.

For instance, by looking at the first line, we know we need to eliminate the row of asterisks separating each command. We specify an address for any line beginning and ending with an asterisk and look for zero or more asterisks in between. The regular expression uses an asterisk as a literal and as a metacharacter:

/^\*\**\*$/d

This command deletes entire lines of asterisks anywhere they occur in the file. We saw that blank lines were used to separate paragraphs, but replacing every blank line with a paragraph macro would cause other problems. In many cases, the blank lines can be removed because spacing has been provided in the macro. This is a case where we put off deleting or replacing blank lines on a global basis until we have dealt with specific cases. For instance, some blank lines separate labeled sections, and we can use them to define the end of a range of lines. The script, then, is designed to delete unwanted blank lines as the last operation.

Tabs were a similar problem. Tabs were used to indent syntax lines and in some cases after the colon following a label, such as “NAME”. Our first thought was to remove all tabs by replacing them with eight spaces, but there were tabs we wanted to keep, such as those inside the syntax line. So we removed only specific cases, tabs at the beginning of lines and tabs following a colon.

/^•/s///
/:•/s//:/

The next line we come across has the name of the command and a description.

NAME:	DBclose - closes a database

We need to replace it with the macro .Rh 0. Its syntax is:

.Rh 0 "command" "description"

We insert the macro at the beginning of the line, remove the hyphen, and surround the arguments with quotation marks.

/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
	}

We can jump ahead of ourselves a bit here and look at what this portion of our script does to the sample line:

.Rh 0 "DBclose" "closes a database"

The next part that we examine begins with “SYNTAX.” What we need to do here is put in the .Rh macro, plus some additional troff requests for indentation, a font change, and no-fill and no-adjust. (The indentation is required because we stripped the tabs at the beginning of the line.) These requests must go in before and after the syntax lines, turning the capabilities on and off. To do this, we define an address that specifies the range of lines between two patterns, the label and a blank line. Then, using the change command, we replace the label and the blank line with a series of formatting requests.

/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
	}

Following the change command, each line of input ends with a backslash except the last line. As a side effect of the change command, the current line is deleted from the pattern space.

The USAGE portion is next, consisting of one or more descriptions of variable items. Here we want to format each item as an indented paragraph with a hanging italicized label. First, we output the .Rh macro; then we search for lines having two parts separated by a tab and a hyphen. Each part is saved, using backslash-parentheses, and recalled during the substitution.

/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
	}

This is a good example of the power of regular expressions. Let’s look ahead, once again, and preview the output for the sample.

.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.

The next part we come across is the description. We notice that blank lines are used in this portion to separate paragraphs. In specifying the address for this portion, we use the next label, “RETURNS.”

/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/^$/.LP/
}

The first thing we do is insert a paragraph macro because the preceding USAGE section consisted of indented paragraphs. (We could have used the variable-list macros from the -mm package in the USAGE section; if so, we would insert the .LE at this point.) This is done only once, which is why it is keyed to the “DESC” label. Then we substitute the label “DESC” with the .Rh macro and replace all blank lines in this section with a paragraph macro.

When we tested this portion of the sed script on our sample file, it didn’t work because there was a single space following the DESC label. We changed the regular expression to look for zero or more spaces following the label. Although this worked for the sample file, there were other problems when we used a larger sample. The writer was inconsistent in his use of the “DESC” label. Mostly, it occurred on a line by itself; sometimes, though, it was included at the start of the second paragraph. So we had to add another pattern to deal with this case. It searches for the label followed by a space and one or more characters.

s/DESC: *$/.Rh Description/
s/DESC: \(.*\)/.Rh Description\
\\1/

In the second case, the reference header macro is output followed by a newline.

The next section, labeled “RETURNS,” is handled in the same way as the SYNTAX section.

We do make minor content changes, replacing the label “RETURNS” with “Return Value” and consequently adding this substitution:

s/There is no return value\.*/None./

The very last thing we do is delete remaining blank lines.

/^$/d

Our script is put in a file named refsed. Here it is in full:

# refsed -- add formatting codes to reference pages
/^\*\**\*$/d
/^•/s///
/:•/s//:/
/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
}
/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
}
/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
}
/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/DESC: \(.*\)/.Rh Description\
\1/
	s/^$/.LP/
}
/RETURNS:/,/^$/ {
	/RETURNS:/c\
.Rh "Return Value"
	s/There is no return value\.*/None./
}
/^$/d

As we have remarked, you should not have sed overwrite the original. It is best to redirect the output of sed to another file or let it go to the screen. If the sed script does not work properly, you will find that it is generally easier to change the script and re-run it on the original file than to write a new script to correct the problems caused by a previous run.

$ sed -f refsed refpage  
.Rh 0 "DBclose" "closes a database"
.Rh Syntax
.in +5n
.ft B
.nf
.na
void	DBclose(fdesc)
	DBFILE *fdesc;
.in -5n
.ft R
.fi
.ad b
.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.
.LP
.Rh Description
DBclose( ) closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.
.LP
Other users are not effected when you call DBclose( ).  Their update
locks and pending writes are not changed.
.LP
Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.
.LP
DBclose( ) is analogous to the CLOSE statement in BASIC.
.LP
.Rh "Return Value"
None.