Table of Contents for
sed & awk, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition sed & awk, 2nd Edition by Arnold Robbins Published by O'Reilly Media, Inc., 1997
  1. sed & awk, 2nd Edition
  2. Cover
  3. sed & awk, 2nd Edition
  4. A Note Regarding Supplemental Files
  5. Dedication
  6. Preface
  7. Scope of This Handbook
  8. Availability of sed and awk
  9. Obtaining Example Source Code
  10. Conventions Used in This Handbook
  11. About the Second Edition
  12. Acknowledgments from the First Edition
  13. Comments and Questions
  14. 1. Power Tools for Editing
  15. 1.1. May You Solve Interesting Problems
  16. 1.2. A Stream Editor
  17. 1.3. A Pattern-Matching Programming Language
  18. 1.4. Four Hurdles to Mastering sed and awk
  19. 2. Understanding Basic Operations
  20. 2.1. Awk, by Sed and Grep, out of Ed
  21. 2.2. Command-Line Syntax
  22. 2.3. Using sed
  23. 2.4. Using awk
  24. 2.5. Using sed and awk Together
  25. 3. Understanding Regular Expression Syntax
  26. 3.1. That’s an Expression
  27. 3.2. A Line-Up of Characters
  28. 3.3. I Never Metacharacter I Didn’t Like
  29. 4. Writing sed Scripts
  30. 4.1. Applying Commands in a Script
  31. 4.2. A Global Perspective on Addressing
  32. 4.3. Testing and Saving Output
  33. 4.4. Four Types of sed Scripts
  34. 4.5. Getting to the PromiSed Land
  35. 5. Basic sed Commands
  36. 5.1. About the Syntax of sed Commands
  37. 5.2. Comment
  38. 5.3. Substitution
  39. 5.4. Delete
  40. 5.5. Append, Insert, and Change
  41. 5.6. List
  42. 5.7. Transform
  43. 5.8. Print
  44. 5.9. Print Line Number
  45. 5.10. Next
  46. 5.11. Reading and Writing Files
  47. 5.12. Quit
  48. 6. Advanced sed Commands
  49. 6.1. Multiline Pattern Space
  50. 6.2. A Case for Study
  51. 6.3. Hold That Line
  52. 6.4. Advanced Flow Control Commands
  53. 6.5. To Join a Phrase
  54. 7. Writing Scripts for awk
  55. 7.1. Playing the Game
  56. 7.2. Hello, World
  57. 7.3. Awk’s Programming Model
  58. 7.4. Pattern Matching
  59. 7.5. Records and Fields
  60. 7.6. Expressions
  61. 7.7. System Variables
  62. 7.8. Relational and Boolean Operators
  63. 7.9. Formatted Printing
  64. 7.10. Passing Parameters Into a Script
  65. 7.11. Information Retrieval
  66. 8. Conditionals, Loops, and Arrays
  67. 8.1. Conditional Statements
  68. 8.2. Looping
  69. 8.3. Other Statements That Affect Flow Control
  70. 8.4. Arrays
  71. 8.5. An Acronym Processor
  72. 8.6. System Variables That Are Arrays
  73. 9. Functions
  74. 9.1. Arithmetic Functions
  75. 9.2. String Functions
  76. 9.3. Writing Your Own Functions
  77. 10. The Bottom Drawer
  78. 10.1. The getline Function
  79. 10.2. The close( ) Function
  80. 10.3. The system( ) Function
  81. 10.4. A Menu-Based Command Generator
  82. 10.5. Directing Output to Files and Pipes
  83. 10.6. Generating Columnar Reports
  84. 10.7. Debugging
  85. 10.8. Limitations
  86. 10.9. Invoking awk Using the #! Syntax
  87. 11. A Flock of awks
  88. 11.1. Original awk
  89. 11.2. Freely Available awks
  90. 11.3. Commercial awks
  91. 11.4. Epilogue
  92. 12. Full-Featured Applications
  93. 12.1. An Interactive Spelling Checker
  94. 12.2. Generating a Formatted Index
  95. 12.3. Spare Details of the masterindex Program
  96. 13. A Miscellany of Scripts
  97. 13.1. uutot.awk—Report UUCP Statistics
  98. 13.2. phonebill—Track Phone Usage
  99. 13.3. combine—Extract Multipart uuencoded Binaries
  100. 13.4. mailavg—Check Size of Mailboxes
  101. 13.5. adj—Adjust Lines for Text Files
  102. 13.6. readsource—Format Program Source Files for troff
  103. 13.7. gent—Get a termcap Entry
  104. 13.8. plpr—lpr Preprocessor
  105. 13.9. transpose—Perform a Matrix Transposition
  106. 13.10. m1—Simple Macro Processor
  107. A. Quick Reference for sed
  108. A.1. Command-Line Syntax
  109. A.2. Syntax of sed Commands
  110. A.3. Command Summary for sed
  111. B. Quick Reference for awk
  112. B.1. Command-Line Syntax
  113. B.2. Language Summary for awk
  114. B.3. Command Summary for awk
  115. C. Supplement for Chapter 12
  116. C.1. Full Listing of spellcheck.awk
  117. C.2. Listing of masterindex Shell Script
  118. C.3. Documentation for masterindex
  119. masterindex
  120. C.3.1. Background Details
  121. C.3.2. Coding Index Entries
  122. C.3.3. Output Format
  123. C.3.4. Compiling a Master Index
  124. Index
  125. About the Authors
  126. Colophon
  127. Copyright

Four Types of sed Scripts

In this section, we are going to look at four types of scripts, each one illustrating a typical sed application.

Multiple Edits to the Same File

The first type of sed script demonstrates making a series of edits in a file. The example we use is a script that converts a file created by a word processing program into a file coded for troff.

One of the authors once did a writing project for a computer company, here referred to as BigOne Computer. The document had to include a product bulletin for “Horsefeathers Software.” The company promised that the product bulletin was online and that they would send it. Unfortunately, when the file arrived, it contained the formatted output for a line printer, the only way they could provide it. A portion of that file (saved for testing in a file named horsefeathers) follows.

HORSEFEATHERS SOFTWARE PRODUCT BULLETIN

  DESCRIPTION
+   ___________

  BigOne Computer  offers three  software packages from the  suite  
  of Horsefeathers  software products  --  Horsefeathers  Business
  BASIC, BASIC  Librarian,  and LIDO.  These software products can
  fill  your    requirements    for    powerful,    sophisticated,
  general-purpose business  software providing you with a base for
  software customization or development.

  Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne  
  machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
  is a full screen program editor, which also provides the ability

Note that the text has been justified with spaces added between words. There are also spaces added to create a left margin.

We find that when we begin to tackle a problem using sed, we do best if we make a mental list of all the things we want to do. When we begin coding, we write a script containing a single command that does one thing. We test that it works, then we add another command, repeating this cycle until we’ve done all that’s obvious to do. (“All that’s obvious” because the list is not always complete, and the cycle of implement-and-test often adds other items to the list.)

It may seem to be a rather tedious process to work this way and indeed there are a number of scripts where it’s fine to take a crack at writing the whole script in one pass and then begin testing it. However, the one-step-at-a-time technique is highly recommended for beginners because you isolate each command and get to easily see what is working and what is not. When you try to do several commands at once, you might find that when problems arise you end up recreating the recommended process in reverse; that is, removing commands one by one until you locate the problem.

Here is a list of the obvious edits that need to be made to the Horsefeathers Software bulletin:

  1. Replace all blank lines with a paragraph macro (.LP).

  2. Remove all leading spaces from each line.

  3. Remove the printer underscore line, the one that begins with a “+”.

  4. Remove multiple blank spaces that were added between words.

The first edit requires that we match blank lines. However, in looking at the input file, it wasn’t obvious whether the blank lines had leading spaces or not. As it turns out, they do not, so blank lines can be matched using the pattern “^$”. (If there were spaces on the line, the pattern could be written “^□*$”.) Thus, the first edit is fairly straightforward to accomplish:

s/^$/.LP/

It replaces each blank line with “.LP”. Note that you do not escape the literal period in the replacement section of the substitute command. We can put this command in a file named sedscr and test the command as follows:

$ sed -f sedscr horsefeathers
                  HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
  DESCRIPTION
+   ___________
.LP
  BigOne Computer  offers three  software packages from the  suite
  of Horsefeathers  software products  --  Horsefeathers  Business
  BASIC, BASIC  Librarian,  and LIDO.  These software products can
  fill  your    requirements    for    powerful,    sophisticated,
  general-purpose business  software providing you with a base for
  software customization or development.
.LP
  Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne
  machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
  is a full screen program editor, which also provides the ability

It is pretty obvious which lines have changed. (It is frequently helpful to cut out a portion of a file to use for testing. It works best if the portion is small enough to fit on the screen yet is large enough to include different examples of what you want to change. After all edits have been applied successfully to the test file, a second level of testing occurs when you apply them to the complete, original file.)

The next edit that we make is to remove the line that begins with a “+” and contains a line-printer underscore. We can simply delete this line using the delete command, d. In writing a pattern to match this line, we have a number of choices. Each of the following would match that line:

/^+/
/^+□/
/^+□□*/
/^+□□*__*/

As you can see, each successive regular expression matches a greater number of characters. Only through testing can you determine how complex the expression needs to be to match a specific line and not others. The longer the pattern that you define in a regular expression, the more comfort you have in knowing that it won’t produce unwanted matches. For this script, we’ll choose the third expression:

/^+□□*/d

This command will delete any line that begins with a plus sign and is followed by at least one space. The pattern specifies two spaces, but the second is modified by “*”, which means that the second space might or might not be there.

This command was added to the sed script and tested but since it only affects one line, we’ll omit showing the results and move on. The next edit needs to remove the spaces that pad the beginning of a line. The pattern for matching that sequence is very similar to the address for the previous command.

s/^□□*//

This command removes any sequence of spaces found at the beginning of a line. The replacement portion of the substitute command is empty, meaning that the matched string is removed.

We can add this command to the script and test it.

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer  offers three  software packages from the  suite
of Horsefeathers  software products  --  Horsefeathers  Business
BASIC, BASIC  Librarian,  and LIDO.  These software products can
fill  your    requirements    for    powerful,    sophisticated,
general-purpose business  software providing you with a base for
software customization or development.
.LP
Horsefeathers  BASIC is  BASIC optimized for use on  the  BigOne
machine with UNIX  or MS-DOS operating systems.  BASIC Librarian
is a full screen program editor, which also provides the ability

The next edit attempts to deal with the extra spaces added to justify each line. We can write a substitute command to match any string of consecutive spaces and replace it with a single space.

s/□□*/□/g

We add the global flag at the end of the command so that all occurrences, not just the first, are replaced. Note that, like previous regular expressions, we are not specifying how many spaces are there, just that one or more be found. There might be two, three, or four consecutive spaces. No matter how many, we want to reduce them to one.[3]

Let’s test the new script:

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer offers three software packages from the suite
of Horsefeathers software products -- Horsefeathers Business
BASIC, BASIC Librarian, and LIDO. These software products can
fill your requirements for powerful, sophisticated,
general-purpose business software providing you with a base for
software customization or development.
.LP
Horsefeathers BASIC is BASIC optimized for use on the BigOne
machine with UNIX or MS-DOS operating systems. BASIC Librarian
is a full screen program editor, which also provides the ability

It works as advertised, reducing two or more spaces to one. On closer inspection, though, you might notice that the script removes a sequence of two spaces following a period, a place where they might belong.

We could perfect our substitute command such that it does not make the replacement for spaces following a period. The problem is that there are cases when three spaces follow a period and we’d like to reduce that to two. The best way seems to be to write a separate command that deals with the special case of a period followed by spaces.

s/\.□□*/.□□/g

This command replaces a period followed by any number of spaces with a period followed by two spaces. It should be noted that the previous command reduces multiple spaces to one, so that only one space will be found following a period.[4] Nonetheless, this pattern works regardless of how many spaces follow the period, as long as there is at least one. (It would not, for instance, affect a filename of the form test.ext if it appeared in the document.) This command is placed at the end of the script and tested:

$ sed -f sedscr horsefeathers
HORSEFEATHERS SOFTWARE PRODUCT BULLETIN
.LP
DESCRIPTION
.LP
BigOne Computer offers three software packages from the suite 
of Horsefeathers software products -- Horsefeathers Business 
BASIC, BASIC Librarian, and LIDO.  These software products can
fill your requirements for powerful, sophisticated,
general-purpose business software providing you with a base for
software customization or development.
.LP
Horsefeathers BASIC is BASIC optimized for use on the BigOne 
machine with UNIX or MS-DOS operating systems.  BASIC Librarian
is a full screen program editor, which also provides the ability

It works. Here’s the completed script:

s/^$/.LP/
/^+□□*/d
s/^□□*//
s/□□*/□/g
s/\.□□*/.□□/g

As we said earlier, the next stage would be to test the script on the complete file (hf.product.bulletin), using testsed, and examine the results thoroughly. When we are satisfied with the results, we can use runsed to make the changes permanent:

$ runsed hf.product.bulletin
done

By executing runsed, we have overwritten the original file.

Before leaving this script, it is instructive to point out that although the script was written to process a specific file, each of the commands in the script is one that you might expect to use again, even if you don’t use the entire script again. In other words, you may well write other scripts that delete blank lines or check for two spaces following a period. Recognizing how commands can be reused in other situations reduces the time it takes to develop and test new scripts. It’s like a singer learning a song and adding it to his or her repetoire.

Making Changes Across a Set of Files

The most common use of sed is in making a set of search-and-replacement edits across a set of files. Many times these scripts aren’t very unusual or interesting, just a list of substitute commands that change one word or phrase to another. Of course, such scripts don’t need to be interesting as long as they are useful and save doing the work manually.

The example we look at in this section is a conversion script, designed to modify various “machine-specific” terms in a UNIX documentation set. One person went through the documentation set and made a list of things that needed to be changed. Another person worked from the list to create the following list of substitutions.

s/ON switch/START switch/g
s/ON button/START switch/g
s/STANDBY switch/STOP switch/g
s/STANDBY button/STOP switch/g
s/STANDBY/STOP/g
s/[cC]abinet [Ll]ight/control panel light/g
s/core system diskettes/core system tape/g
s/TERM=542[05] /TERM=PT200 /g
s/Teletype 542[05]/BigOne PT200/g
s/542[05] terminal/PT200 terminal/g
s/Documentation Road Map/Documentation Directory/g
s/Owner\/Operator Guide/Installation and Operation Guide/g
s/AT&T 3B20 [cC]omputer/BigOne XL Computer/g
s/AT&T 3B2 [cC]omputer/BigOne XL Computer/g
s/3B2 [cC]omputer/BigOne XL Computer/g
s/3B2/BigOne XL Computer/g

The script is straightforward. The beauty is not in the script itself but in sed’s ability to apply this script to the hundreds of files comprising the documentation set. Once this script is tested, it can be executed using runsed to process as many files as there are at once.

Such a script can be a tremendous time-saver, but it can also be an opportunity to make big-time mistakes. What sometimes happens is that a person writes the script, tests it on one or two out of the hundreds of files and concludes from that test that the script works fine. While it may not be practical to test each file, it is important that the test files you do choose be both representative and exceptional. Remember that text is extremely variable and you cannot typically trust that what is true for a particular occurrence is true for all occurrences.

Using grep to examine large amounts of input can be very helpful. For instance, if you wanted to determine how “core system diskettes” appears in the documents, you could grep for it everywhere and pore over the listing. To be thorough, you should also grep for “core,” “core system,” “system diskettes,” and “diskettes” to look for occurrences split over multiple lines. (You could also use the phrase script in Chapter 6 to look for occurrences of multiple words over consecutive lines.) Examining the input is the best way to know what your script must do.

In some ways, writing a script is like devising a hypothesis, given a certain set of facts. You try to prove the validity of the hypothesis by increasing the amount of data that you test it against. If you are going to be running a script on multiple files, use testsed to run the script on several dozen files after you’ve tested it on a smaller sample. Then compare the temporary files to the originals to see if your assumptions were correct. The script might be off slightly and you can revise it. The more time you spend testing, which is actually rather interesting work, the less chance you will spend your time unraveling problems caused by a botched script.

Extracting Contents of a File

One type of sed application is used for extracting relevant material from a file. In this way, sed functions like grep, with the additional advantage that the input can be modified prior to output. This type of script is a good candidate for a shell script.

Here are two examples: extracting a macro definition from a macro package and displaying the outline of a document.

Extracting a macro definition

troff macros are defined in a macro package, often a single file that’s located in a directory such as /usr/lib/macros. A troff macro definition always begins with the string “.de”, followed by an optional space and the one- or two-letter name of the macro. The definition ends with a line beginning with two dots (..). The script we show in this section extracts a particular macro definition from a macro package. (It saves you from having to locate and open the file with an editor and search for the lines that you want to examine.)

The first step in designing this script is to write one that extracts a specific macro, in this case, the BL (Bulleted List) macro in the -mm package.[5]

$ sed -n  "/^\.deBL/,/^\.\.$/p" /usr/lib/macros/mmt
.deBL
.if\\n(.$<1 .)L \\n(Pin 0 1n 0 \\*(BU
.if\\n(.$=1 .LB 0\\$1 0 1 0 \\*(BU
.if\\n(.$>1 \{.ie !\w^G\\$1^G .)L \\n(Pin 0 1n 0 \\*(BU 0 1
.el.LB 0\\$1 0 1 0 \\*(BU 0 1 \}
..

Sed is invoked with the -n option to keep it from printing out the entire file. With this option, sed will print only the lines it is explicitly told to print via the print command. The sed script contains two addresses: the first matches the start of the macro definition “.deBL” and the second matches its termination, “..” on a line by itself. Note that dots appear literally in the two patterns and are escaped using the backslash.

The two addresses specify a range of lines for the print command, p. It is this capability that distinguishes this kind of search script from grep, which cannot match a range of lines.

We can take this command line and make it more general by placing it in a shell script. One obvious advantage of creating a shell script is that it saves typing. Another advantage is that a shell script can be designed for more general usage. For instance, we can allow the user to supply information from the command line. In this case, rather than hard-code the name of the macro in the sed script, we can use a command-line argument to supply it. You can refer to each argument on the command line in a shell script by positional notation: the first argument is $1, the second is $2, and so on. Here’s the getmac script:

#! /bin/sh
# getmac -- print mm macro definition for $1 
sed -n "/^\.de$1/,/^\.\.$/p" /usr/lib/macros/mmt

The first line of the shell script forces interpretation of the script by the Bourne shell, using the “#!” executable interpreter mechanism available on all modern UNIX systems. The second line is a comment that describes the name and purpose of the script. The sed command, on the third line, is identical to the previous example, except that “BL” is replaced by “$1”, a variable representing the first command-line argument. Note that the double quotes surrounding the sed script are necessary. Single quotes would not allow interpretation of “$1” by the shell.

This script, getmac, can be executed as follows:

$ getmac BL

where “BL” is the first command-line argument. It produces the same output as the previous example.

This script can be adapted to work with any of several macro packages. The following version of getmac allows the user to specify the name of a macro package as the second command-line argument.

#! /bin/sh
# getmac - read macro definition for $1 from package $2
file=/usr/lib/macros/mmt
mac="$1"
case $2 in
 -ms) file="/work/macros/current/tmac.s";;
 -mm) file="/usr/lib/macros/mmt";;
 -man) file="/usr/lib/macros/an";;
esac
sed -n "/^\.de *$mac/,/^\.\.$/p" $file

What is new here is a case statement that tests the value of $2 and then assigns a value to the variable file. Notice that we assign a default value to file so if the user does not designate a macro package, the -mm macro package is searched. Also, for clarity and readability, the value of $1 is assigned to the variable mac.

In creating this script, we discovered a difference among macro packages in the first line of the macro definition. The -ms macros include a space between “.de” and the name of the macro, while -mm and -man do not. Fortunately, we are able to modify the pattern to accommodate both cases.

/^\.de *$mac/

Following “.de”, we specify a space followed by an asterisk, which means the space is optional.

The script prints the result on standard output, but it can easily be redirected into a file, where it can become the basis for the redefinition of a macro.

Generating an outline

Our next example not only extracts information; it modifies it to make it easier to read. We create a shell script named do.outline that uses sed to give an outline view of a document. It processes lines containing coded section headings, such as the following:

.Ah "Shell Programming"

The macro package we use has a chapter heading macro named “Se” and hierarchical headings named “Ah”, “Bh”, and “Ch”. In the -mm macro package, these macros might be “H”, “H1”, “H2”, “H3”, etc. You can adapt the script to whatever macros or tags identify the structure of a document. The purpose of the do.outline script is to make the structure more apparent by printing the headings in an indented outline format.

The result of do.outline is shown below:

$ do.outline ch13/sect1
CHAPTER  13 Let the Computer Do the Dirty Work
     A.  Shell Programming 
          B.  Stored Commands
          B.  Passing Arguments to Shell Scripts
          B.  Conditional Execution
          B.  Discarding Used Arguments
          B.  Repetitive Execution
          B.  Setting Default Values
          B.  What We've Accomplished

It prints the result to standard output (without, of course, making any changes within the files themselves).

Let’s look at how to put together this script. The script needs to match lines that begin with the macros for:

  • Chapter title (.Se)

  • Section heading (.Ah)

  • Subsection heading (.Bh)

We need to make substitutions on those lines, replacing macros with a text marker (A, B, for instance) and adding the appropriate amount of spacing (using tabs) to indent each heading. (Remember, the “•” denotes a tab character.)

Here’s the basic script:

sed -n '
s/^\.Se /CHAPTER /p
s/^\.Ah /•A. /p
s/^\.Bh /••B.  /p' $*

do.outline operates on all files specified on the command line (“$*”). The -n option suppresses the default output of the program. The sed script contains three substitute commands that replace the codes with the letters and indent each line. Each substitute command is modified by the p flag that indicates the line should be printed.

When we test this script, the following results are produced:

CHAPTER  "13" "Let the Computer Do the Dirty Work"
     A.  "Shell Programming" 
          B.  "Stored Commands"
          B.  "Passing Arguments to Shell Scripts"

The quotation marks that surround the arguments to a macro are passed through. We can write a substitute command to remove the quotation marks.

s/"//g

It is necessary to specify the global flag, g, to catch all occurrences on a single line. However, the key decision is where to put this command in the script. If we put it at the end of the script, it will remove the quotation marks after the line has already been output. We have to put it at the top of the script and perform this edit for all lines, regardless of whether or not they are output later in the script.

sed -n '
s/"//g
s/^\.Se /CHAPTER /p
s/^\.Ah /•A. /p
s/^\.Bh /••B.  /p' $*

This script now produces the results that were shown earlier.

You can modify this script to search for almost any kind of coded format. For instance, here’s a rough version for a file:

sed -n '
s/[{}]//g
s/\\section/•A. /p
s/\\subsection/••B.  /p' $*

Edits To Go

Let’s consider an application that shows sed in its role as a true stream editor, making edits in a pipeline—edits that are never written back into a file.

On a typewriter-like device (including a CRT), an em-dash is typed as a pair of hyphens (--). In typesetting, it is printed as a single, long dash (—). troff provides a special character name for the em-dash, but it is inconvenient to type “\(em”.

The following command changes two consecutive dashes into an em-dash.

s/--/\\(em/g

We double the backslashes in the replacement string for \(em, since the backslash has a special meaning to sed.

Perhaps there are cases in which we don’t want this substitute command to be applied. What if someone is using hyphens to draw a horizontal line? We can refine this command to exclude lines containing three or more consecutive hyphens. To do this, we use the ! address modifier:

/---/!s/--/\\(em/g

It may take a moment to penetrate this syntax. What’s different is that we use a pattern address to restrict the lines that are affected by the substitute command, and we use ! to reverse the sense of the pattern match. It says, simply, “If you find a line containing three consecutive hyphens, don’t apply the edit.” On all other lines, the substitute command will be applied.

We can use this command in a script that automatically inserts em-dashes for us. To do that, we will use sed as a preprocessor for a troff file. The file will be processed by sed and then piped to troff.

sed '/---/!s/--/\\(em/g' file | troff

In other words, sed changes the input file and passes the output directly to troff, without creating an intermediate file. The edits are made on-the-go, and do not affect the input file. You might wonder why not just make the changes permanently in the original file? One reason is simply that it’s not necessary—the input remains consistent with what the user typed but troff still produces what looks best for typeset-quality output. Furthermore, because it is embedded in a larger shell script, the transformation of hyphens to em-dashes is invisible to the user, and not an additional step in the formatting process.

We use a shell script named format that uses sed for this purpose. Here’s what the shell script looks like:

#! /bin/sh
eqn=  pic=  col=
files=  options=  roff="ditroff -Tps"
sed="| sed '/---/!s/--/\\(em/g'"
while [ $# -gt 0 ]
do
   case $1 in
     -E) eqn="| eqn";;
     -P) pic="| pic";;
     -N) roff="nroff"  col="| col"  sed= ;;
     -*) options="$options $1";;

      *) if [ -f $1 ]
         then files="$files $1"
         else echo "format: $1: file not found"; exit 1
         fi;;
   esac
   shift
done
eval "cat $files $sed | tbl $eqn $pic | $roff $options $col | lp"

This script assigns and evaluates a number of variables (prefixed by a dollar sign) that construct the command line that is submitted to format and print a document. (Notice that we’ve set up the -N option for nroff so that it sets the sed variable to the empty string, since we only want to make this change if we are using troff. Even though nroff understands the \(em special character, making this change would have no actual effect on the output.)

Changing hyphens to em-dashes is not the only “prettying up” edit we might want to make when typesetting a document. For example, most keyboards do not allow you to type open and close quotation marks (” and " as opposed to “and”). In troff, you can indicate a open quotation mark by typing two consecutive grave accents, or “backquotes” (``), and a close quotation mark by typing two consecutive single quotes (''). We can use sed to change each doublequote character to a pair of single open-quotes or close-quotes (depending on context), which, when typeset, will produce the appearance of a proper “double quote.”

This is a considerably more difficult edit to make, since there are many separate cases involving punctuation marks, space, and tabs. Our script might look like this:

s/^"/``/
s/"$/''/
s/"?□/''?□/g
s/"?$/''?/g
s/□"/□``/g
s/"□/''□/g
s/•"/•``/g
s/"•/''•/g
s/")/'')/g
s/"]/'']/g
s/("/(``/g
s/\["/\[``/g
s/";/'';/g
s/":/'':/g
s/,"/,''/g
s/",/'',/g
s/\."/.\\\&''/g
s/"\./''.\\\&/g
s/\\(em\\^"/\\(em``/g
s/"\\(em/''\\(em/g
s/\\(em"/\\(em``/g
s/@DQ@/"/g

The first substitute command looks for a quotation mark at the beginning of a line and changes it to an open-quote. The second command looks for a quotation mark at the end of a line and changes it to a close-quote. The remaining commands look for the quotation mark in different contexts, before or after a punctuation mark, a space, a tab, or an em-dash. The last command allows us to get a real doublequote (") into the troff input if we need it. We put these commands in a “cleanup” script, along with the command changing hyphens to dashes, and invoke it in the pipeline that formats and prints documents using troff.



[3] This command will also match just a single space. But since the replacement is also a single space, such a case is effectively a “no-op.”

[4] The command could therefore be simplified to:

s/\.□/.□□/g

[5] We happen to know that the -mm macros don’t have a space after the “.de” command.