sed & awk, 2nd Edition

Reading and Writing Files

The read (r) and write (w) commands allow you to work directly with files. Both take a single argument, the name of a file. The syntax follows:

[line-address]r file
[address]w file

The read command reads the contents of file into the pattern space after the addressed line. It cannot operate on a range of lines. The write command writes the contents of the pattern space to the file.

You must have a single space between the command and the filename. (Everything after that space and up to the newline is taken to be the filename. Thus, leading or embedded spaces will become part of the filename.) The read command will not complain if the file does not exist. The write command will create a file if it does not exist; if the file already exists, the write command will overwrite it each time the script is invoked. If there are multiple instructions writing to the same file in one script, then each write command appends to the file. Also, be aware that you can only open up to 10 files per script.

The read command can be useful for inserting the contents of one file at a particular place in another file. For instance, let’s say that there is a set of files and each file should close with the same one- or two-paragraph statement. A sed script would allow you to maintain the closing separately while inserting it as needed, for instance, when sending the file to the printer.

sed '$r closing' $* | pr | lp

The $ is an addressing symbol specifying the last line of the file. The contents of the file named closing are placed after the contents of pattern space and output with it. This example does not specify a pathname, assuming the file to be in the same directory as the command. A more general-purpose command should use the full pathname.

You may want to test out a few quirks of the read command. Let’s look at the following command:

/^<Company-list>/r company.list

That is, when sed matches a line beginning with the string “<Company-list>”, it is going to append the contents of the file company.list to the end of the matched line. No subsequent command will affect the lines read from the file. For instance, you can’t make any changes to the list of companies that you’ve read into the file. However, commands that address the original line will work. The previous command could be followed by a second command:

/^<Company-list>/d

to delete the original line. So that if the input file was as follows:

For service, contact any of the following companies:
<Company-list>
Thank you.

running the two-line script would produce:

For service, contact any of the following companies:
	Allied
	Mayflower
	United
Thank you.

Suppressing the automatic output, using the -n option or #n script syntax, prevents the original line in the pattern space from being output, but the result of a read command still goes to standard output.

Now let’s look at examples of the write command. One use is to extract information from one file and place it in its own file. For instance, imagine that we had a file listing the names of salespeople alphabetically. For each person, the listing designates which of four regions the person is assigned to. Here’s a sample:

Adams, Henrietta        Northeast
Banks, Freda            South
Dennis, Jim             Midwest
Garvey, Bill            Northeast
Jeffries, Jane          West
Madison, Sylvia         Midwest
Sommes, Tom             South

Writing a script for a seven-line file, of course, is ridiculous. Yet such a script can potentially handle as many names as you can put together, and is reusable.

If all we wanted was to extract the names for a particular region, we could easily use grep to do it. The advantage with sed is that we can break up the file into four separate files in a single step. The following four-line script does it:

/Northeast$/w region.northeast
/South$/w region.south
/Midwest$/w region.midwest
/West$/w region.west

All of the names of salespeople that are assigned to the Northeast region will be placed in a file named region.northeast.

The write command writes out the contents of the pattern space when the command is invoked, not when end of the script is reached. In the previous example, we might want to remove the name of the region before writing it to file. For each case, we could handle it as we show for the Northeast region:

/Northeast$/{
	s///
	w region.northeast
	}

The substitute command matches the same pattern as the address and removes it. There are many different uses for the write command; for example, you could use it in a script to generate several customized versions of the same source file.

Checking Out Reference Pages

Like many programs, a sed script often starts out small, and is simple to write and simple to read. In testing the script, you may discover specific cases for which the general rules do not apply. To account for these, you add lines to your script, making it longer, more complex, and more complete. While the amount of time you spend refining your script may cancel out the time saved by not doing the editing manually, at least during that time your mind has been engaged by your own seeming sleight-of-hand: “See! The computer did it.”

We encountered one such problem in preparing a formatted copy of command pages that the writer had typed as a text file without any formatting information. Although the files had no formatting codes, headings were used consistently to identify the format of the command pages. A sample file is shown below.

******************************************************************

NAME:	DBclose - closes a database

SYNTAX:
	void	DBclose(fdesc)
		DBFILE *fdesc;

USAGE:
	fdesc	- pointer to database file descriptor

DESC: 
DBclose( ) closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.

Other users are not affected when you call DBclose( ).  Their update
locks and pending writes are not changed.

Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.

DBclose( ) is analogous to the CLOSE statement in BASIC.

RETURNS:
	There is no return value

******************************************************************

The task was to format this document for the laser printer, using the reference header macros we had developed. Because there were perhaps forty of these command pages, it would have been utter drudgery to go through and add codes by hand. However, because there were that many, and even though the writer was generally consistent in entering them, there would be enough differences from command to command to have required several passes.

We’ll examine the process of building this sed script. In a sense, this is a process of looking carefully at each line of a sample input file and determining whether or not an edit must be made on that line. Then we look at the rest of the file for similar occurrences. We try to find specific patterns that mark the lines or range of lines that need editing.

For instance, by looking at the first line, we know we need to eliminate the row of asterisks separating each command. We specify an address for any line beginning and ending with an asterisk and look for zero or more asterisks in between. The regular expression uses an asterisk as a literal and as a metacharacter:

/^\*\**\*$/d

This command deletes entire lines of asterisks anywhere they occur in the file. We saw that blank lines were used to separate paragraphs, but replacing every blank line with a paragraph macro would cause other problems. In many cases, the blank lines can be removed because spacing has been provided in the macro. This is a case where we put off deleting or replacing blank lines on a global basis until we have dealt with specific cases. For instance, some blank lines separate labeled sections, and we can use them to define the end of a range of lines. The script, then, is designed to delete unwanted blank lines as the last operation.

Tabs were a similar problem. Tabs were used to indent syntax lines and in some cases after the colon following a label, such as “NAME”. Our first thought was to remove all tabs by replacing them with eight spaces, but there were tabs we wanted to keep, such as those inside the syntax line. So we removed only specific cases, tabs at the beginning of lines and tabs following a colon.

/^•/s///
/:•/s//:/

The next line we come across has the name of the command and a description.

NAME:	DBclose - closes a database

We need to replace it with the macro .Rh 0. Its syntax is:

.Rh 0 "command" "description"

We insert the macro at the beginning of the line, remove the hyphen, and surround the arguments with quotation marks.

/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
	}

We can jump ahead of ourselves a bit here and look at what this portion of our script does to the sample line:

.Rh 0 "DBclose" "closes a database"

The next part that we examine begins with “SYNTAX.” What we need to do here is put in the .Rh macro, plus some additional troff requests for indentation, a font change, and no-fill and no-adjust. (The indentation is required because we stripped the tabs at the beginning of the line.) These requests must go in before and after the syntax lines, turning the capabilities on and off. To do this, we define an address that specifies the range of lines between two patterns, the label and a blank line. Then, using the change command, we replace the label and the blank line with a series of formatting requests.

/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
	}

Following the change command, each line of input ends with a backslash except the last line. As a side effect of the change command, the current line is deleted from the pattern space.

The USAGE portion is next, consisting of one or more descriptions of variable items. Here we want to format each item as an indented paragraph with a hanging italicized label. First, we output the .Rh macro; then we search for lines having two parts separated by a tab and a hyphen. Each part is saved, using backslash-parentheses, and recalled during the substitution.

/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
	}

This is a good example of the power of regular expressions. Let’s look ahead, once again, and preview the output for the sample.

.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.

The next part we come across is the description. We notice that blank lines are used in this portion to separate paragraphs. In specifying the address for this portion, we use the next label, “RETURNS.”

/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/^$/.LP/
}

The first thing we do is insert a paragraph macro because the preceding USAGE section consisted of indented paragraphs. (We could have used the variable-list macros from the -mm package in the USAGE section; if so, we would insert the .LE at this point.) This is done only once, which is why it is keyed to the “DESC” label. Then we substitute the label “DESC” with the .Rh macro and replace all blank lines in this section with a paragraph macro.

When we tested this portion of the sed script on our sample file, it didn’t work because there was a single space following the DESC label. We changed the regular expression to look for zero or more spaces following the label. Although this worked for the sample file, there were other problems when we used a larger sample. The writer was inconsistent in his use of the “DESC” label. Mostly, it occurred on a line by itself; sometimes, though, it was included at the start of the second paragraph. So we had to add another pattern to deal with this case. It searches for the label followed by a space and one or more characters.

s/DESC: *$/.Rh Description/
s/DESC: \(.*\)/.Rh Description\
\\1/

In the second case, the reference header macro is output followed by a newline.

The next section, labeled “RETURNS,” is handled in the same way as the SYNTAX section.

We do make minor content changes, replacing the label “RETURNS” with “Return Value” and consequently adding this substitution:

s/There is no return value\.*/None./

The very last thing we do is delete remaining blank lines.

/^$/d

Our script is put in a file named refsed. Here it is in full:

# refsed -- add formatting codes to reference pages
/^\*\**\*$/d
/^•/s///
/:•/s//:/
/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
}
/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
}
/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
}
/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/DESC: \(.*\)/.Rh Description\
\1/
	s/^$/.LP/
}
/RETURNS:/,/^$/ {
	/RETURNS:/c\
.Rh "Return Value"
	s/There is no return value\.*/None./
}
/^$/d

As we have remarked, you should not have sed overwrite the original. It is best to redirect the output of sed to another file or let it go to the screen. If the sed script does not work properly, you will find that it is generally easier to change the script and re-run it on the original file than to write a new script to correct the problems caused by a previous run.

$ sed -f refsed refpage  
.Rh 0 "DBclose" "closes a database"
.Rh Syntax
.in +5n
.ft B
.nf
.na
void	DBclose(fdesc)
	DBFILE *fdesc;
.in -5n
.ft R
.fi
.ad b
.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.
.LP
.Rh Description
DBclose( ) closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.
.LP
Other users are not effected when you call DBclose( ).  Their update
locks and pending writes are not changed.
.LP
Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.
.LP
DBclose( ) is analogous to the CLOSE statement in BASIC.
.LP
.Rh "Return Value"
None.

Table of Contents for sed & awk, 2nd Edition

Reading and Writing Files

Checking Out Reference Pages

Table of Contents for
sed & awk, 2nd Edition