Table of Contents for
sed & awk, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition sed & awk, 2nd Edition by Arnold Robbins Published by O'Reilly Media, Inc., 1997
  1. sed & awk, 2nd Edition
  2. Cover
  3. sed & awk, 2nd Edition
  4. A Note Regarding Supplemental Files
  5. Dedication
  6. Preface
  7. Scope of This Handbook
  8. Availability of sed and awk
  9. Obtaining Example Source Code
  10. Conventions Used in This Handbook
  11. About the Second Edition
  12. Acknowledgments from the First Edition
  13. Comments and Questions
  14. 1. Power Tools for Editing
  15. 1.1. May You Solve Interesting Problems
  16. 1.2. A Stream Editor
  17. 1.3. A Pattern-Matching Programming Language
  18. 1.4. Four Hurdles to Mastering sed and awk
  19. 2. Understanding Basic Operations
  20. 2.1. Awk, by Sed and Grep, out of Ed
  21. 2.2. Command-Line Syntax
  22. 2.3. Using sed
  23. 2.4. Using awk
  24. 2.5. Using sed and awk Together
  25. 3. Understanding Regular Expression Syntax
  26. 3.1. That’s an Expression
  27. 3.2. A Line-Up of Characters
  28. 3.3. I Never Metacharacter I Didn’t Like
  29. 4. Writing sed Scripts
  30. 4.1. Applying Commands in a Script
  31. 4.2. A Global Perspective on Addressing
  32. 4.3. Testing and Saving Output
  33. 4.4. Four Types of sed Scripts
  34. 4.5. Getting to the PromiSed Land
  35. 5. Basic sed Commands
  36. 5.1. About the Syntax of sed Commands
  37. 5.2. Comment
  38. 5.3. Substitution
  39. 5.4. Delete
  40. 5.5. Append, Insert, and Change
  41. 5.6. List
  42. 5.7. Transform
  43. 5.8. Print
  44. 5.9. Print Line Number
  45. 5.10. Next
  46. 5.11. Reading and Writing Files
  47. 5.12. Quit
  48. 6. Advanced sed Commands
  49. 6.1. Multiline Pattern Space
  50. 6.2. A Case for Study
  51. 6.3. Hold That Line
  52. 6.4. Advanced Flow Control Commands
  53. 6.5. To Join a Phrase
  54. 7. Writing Scripts for awk
  55. 7.1. Playing the Game
  56. 7.2. Hello, World
  57. 7.3. Awk’s Programming Model
  58. 7.4. Pattern Matching
  59. 7.5. Records and Fields
  60. 7.6. Expressions
  61. 7.7. System Variables
  62. 7.8. Relational and Boolean Operators
  63. 7.9. Formatted Printing
  64. 7.10. Passing Parameters Into a Script
  65. 7.11. Information Retrieval
  66. 8. Conditionals, Loops, and Arrays
  67. 8.1. Conditional Statements
  68. 8.2. Looping
  69. 8.3. Other Statements That Affect Flow Control
  70. 8.4. Arrays
  71. 8.5. An Acronym Processor
  72. 8.6. System Variables That Are Arrays
  73. 9. Functions
  74. 9.1. Arithmetic Functions
  75. 9.2. String Functions
  76. 9.3. Writing Your Own Functions
  77. 10. The Bottom Drawer
  78. 10.1. The getline Function
  79. 10.2. The close( ) Function
  80. 10.3. The system( ) Function
  81. 10.4. A Menu-Based Command Generator
  82. 10.5. Directing Output to Files and Pipes
  83. 10.6. Generating Columnar Reports
  84. 10.7. Debugging
  85. 10.8. Limitations
  86. 10.9. Invoking awk Using the #! Syntax
  87. 11. A Flock of awks
  88. 11.1. Original awk
  89. 11.2. Freely Available awks
  90. 11.3. Commercial awks
  91. 11.4. Epilogue
  92. 12. Full-Featured Applications
  93. 12.1. An Interactive Spelling Checker
  94. 12.2. Generating a Formatted Index
  95. 12.3. Spare Details of the masterindex Program
  96. 13. A Miscellany of Scripts
  97. 13.1. uutot.awk—Report UUCP Statistics
  98. 13.2. phonebill—Track Phone Usage
  99. 13.3. combine—Extract Multipart uuencoded Binaries
  100. 13.4. mailavg—Check Size of Mailboxes
  101. 13.5. adj—Adjust Lines for Text Files
  102. 13.6. readsource—Format Program Source Files for troff
  103. 13.7. gent—Get a termcap Entry
  104. 13.8. plpr—lpr Preprocessor
  105. 13.9. transpose—Perform a Matrix Transposition
  106. 13.10. m1—Simple Macro Processor
  107. A. Quick Reference for sed
  108. A.1. Command-Line Syntax
  109. A.2. Syntax of sed Commands
  110. A.3. Command Summary for sed
  111. B. Quick Reference for awk
  112. B.1. Command-Line Syntax
  113. B.2. Language Summary for awk
  114. B.3. Command Summary for awk
  115. C. Supplement for Chapter 12
  116. C.1. Full Listing of spellcheck.awk
  117. C.2. Listing of masterindex Shell Script
  118. C.3. Documentation for masterindex
  119. masterindex
  120. C.3.1. Background Details
  121. C.3.2. Coding Index Entries
  122. C.3.3. Output Format
  123. C.3.4. Compiling a Master Index
  124. Index
  125. About the Authors
  126. Colophon
  127. Copyright

Language Summary for awk

This section summarizes how awk processes input records and describes the various syntactic elements that make up an awk program.

Records and Fields

Each line of input is split into fields. By default, the field delimiter is one or more spaces and/or tabs. You can change the field separator by using the -F command-line option. Doing so also sets the value of FS. The following command-line changes the field separator to a colon:

awk -F: -f awkscr /etc/passwd

You can also assign the delimiter to the system variable FS. This is typically done in the BEGIN procedure, but can also be passed as a parameter on the command line.

awk -f awkscr FS=: /etc/passwd

Each input line forms a record containing any number of fields. Each field can be referenced by its position in the record. “$1” refers to the value of the first field; “$2” to the second field, and so on. “$0” refers to the entire record. The following action prints the first field of each input line:

{ print $1 }

The default record separator is a newline. The following procedure sets FS and RS so that awk interprets an input record as any number of lines up to a blank line, with each line being a separate field.

BEGIN { FS = "\n"; RS = "" }

It is important to know that when RS is set to the empty string, newline always separates fields, in addition to whatever value FS may have. This is discussed in more detail in both The AWK Programming Language and Effective AWK Programming.

Format of a Script

An awk script is a set of pattern-matching rules and actions:

pattern { action }

An action is one or more statements that will be performed on those input lines that match the pattern. If no pattern is specified, the action is performed for every input line. The following example uses the print statement to print each line in the input file:

{ print }

If only a pattern is specified, then the default action consists of the print statement, as shown above.

Function definitions can also appear:

function name (parameter list) { statements }

This syntax defines the function name, making available the list of parameters for processing in the body of the function. Variables specified in the parameter-list are treated as local variables within the function. All other variables are global and can be accessed outside the function. When calling a user-defined function, no space is permitted between the name of the function and the opening parenthesis. Spaces are allowed in the function’s definition. User-defined functions are described in Chapter 9.

Line termination

A line in an awk script is terminated by a newline or a semicolon. Using semicolons to put multiple statements on a line, while permitted, reduces the readability of most programs. Blank lines are permitted between statements.

Program control statements (do, if, for, or while) continue on the next line, where a dependent statement is listed. If multiple dependent statements are specified, they must be enclosed within braces.

if (NF > 1) {
        name = $1
        total += $2
}

You cannot use a semicolon to avoid using braces for multiple statements.

You can type a single statement over multiple lines by escaping the newline with a backslash (\). You can also break lines following any of the following characters:

, { && ||

Gawk also allows you to continue a line after either a “?” or a “:”. Strings cannot be broken across a line (except in gawk, using “\” followed by a newline).

Comments

A comment begins with a “#” and ends with a newline. It can appear on a line by itself or at the end of a line. Comments are descriptive remarks that explain the operation of the script. Comments cannot be continued across lines by ending them with a backslash.

Patterns

A pattern can be any of the following:

/regular expression/
relational expression
BEGIN
END
pattern, pattern
  1. Regular expressions use the extended set of metacharacters and must be enclosed in slashes. For a full discussion of regular expressions, see Chapter 3.

  2. Relational expressions use the relational operators listed under “Expressions” later in this chapter.

  3. The BEGIN pattern is applied before the first line of input is read and the END pattern is applied after the last line of input is read.

  4. Use ! to negate the match; i.e., to handle lines not matching the pattern.

  5. You can address a range of lines, just as in sed:

    pattern, pattern

    Patterns, except BEGIN and END, can be expressed in compound forms using the following operators:

    &&Logical And
    ||Logical Or

    Sun’s version of nawk (SunOS 4.1.x) does not support treating regular expressions as parts of a larger Boolean expression. E.g., “/cute/ && /sweet/” or “/fast/ || /quick/” do not work.

    In addition the C conditional operator ?: (pattern ? pattern : pattern) may be used in a pattern.

  6. Patterns can be placed in parentheses to ensure proper evaluation.

  7. BEGIN and END patterns must be associated with actions. If multiple BEGIN and END rules are written, they are merged into a single rule before being applied.

Regular Expressions

Table B.1 summarizes the regular expressions as described in Chapter 3. The metacharacters are listed in order of precedence.

Table B.1. Regular Expression Metacharacters
Special 
CharactersUsage
c

Matches any literal character c that is not a metacharacter.

\

Escapes any metacharacter that follows, including itself.

^

Anchors following regular expression to the beginning of string.

$

Anchors preceding regular expression to the end of string.

.

Matches any single character, including newline.

[...]

Matches any one of the class of characters enclosed between the brackets. A circumflex (^) as the first character inside brackets reverses the match to all characters except those listed in the class. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in a class is a member of the class. All other metacharacters lose their meaning when specified as members of a class, except \, which can be used to escape ], even if it is not first.

r1|r2

Between two regular expressions, r1 and r2, it allows either of the regular expressions to be matched.

(r1)(r2)

Used for concatenating regular expressions.

r*

Matches any number (including zero) of the regular expression that immediately precedes it.

r+

Matches one or more occurrences of the preceding regular expression.

r?

Matches 0 or 1 occurrences of the preceding regular expression.

(r)

Used for grouping regular expressions.

Regular expressions can also make use of the escape sequences for accessing special characters, as defined in Section 2.2.5.2 later in this appendix.

Note that ^ and $ work on strings; they do not match against newlines embedded in a record or string.

Within a pair of brackets, POSIX allows special notations for matching non-English characters. They are described in Table B.2.

Table B.2. POSIX Character List Facilities
NotationFacility
[.symbol.]

Collating symbols. A collating symbol is a multi-character sequence that should be treated as a unit.

[=equiv=]

Equivalence classes. An equivalence class lists a set of characters that should be considered equivalent, such as “e” and “è”.

[:class:]

Character classes. Character class keywords describe different classes of characters such as alphabetic characters, control characters, and so on.

[:alnum:]Alphanumeric characters
[:alpha:]Alphabetic characters
[:blank:]Space and tab characters
[:cntrl:]Control characters
[:digit:]Numeric characters
[:graph:]

Printable and visible (non-space) characters

[:lower:]Lowercase characters
[:print:]Printable characters
[:punct:]Punctuation characters
[:space:]Whitespace characters
[:upper:]Uppercase characters
[:xdigit:]Hexadecimal digits

Note that these facilities (as of this writing) are still not widely implemented.

Expressions

An expression can be made up of constants, variables, operators and functions. A constant is a string (any sequence of characters) or a numeric value. A variable is a symbol that references a value. You can think of it as a piece of information that retrieves a particular numeric or string value.

Constants

There are two types of constants, string and numeric. A string constant must be quoted while a numeric constant is not.

Escape sequences

The escape sequences described in Table B.3 can be used in strings and regular expressions.

Table B.3. Escape Sequences
SequenceDescription
\aAlert character, usually ASCII BEL character
\bBackspace
\fFormfeed
\nNewline
\rCarriage return
\tHorizontal tab
\vVertical tab
\dddCharacter represented as 1 to 3 digit octal value
\xhexCharacter represented as hexadecimal value[1]
\c

Any literal character c (e.g., \” for ")[2]

[1] POSIX does not provide “\x”, but it is commonly available.

[2] Like ANSI C, POSIX leaves it purposely undefined what you get when you put a backslash before any character not listed in the table. In most awks, you just get that character.

Variables

There are three kinds of variables: user-defined, built-in, and fields. By convention, the names of built-in or system variables consist of all capital letters.

The name of a variable cannot start with a digit. Otherwise, it consists of letters, digits, and underscores. Case is significant in variable names.

A variable does not need to be declared or initialized. A variable can contain either a string or numeric value. An uninitialized variable has the empty string (“”) as its string value and 0 as its numeric value. Awk attempts to decide whether a value should be processed as a string or a number depending upon the operation.

The assignment of a variable has the form:

var = expr

It assigns the value of the expression to var. The following expression assigns a value of 1 to the variable x.

x = 1

The name of the variable is used to reference the value:

{ print x }

prints the value of the variable x. In this case, it would be 1.

See the section System variables for information on built-in variables. A field variable is referenced using $n, where n is any number 0 to NF, that references the field by position. It can be supplied by a variable, such as $NF meaning the last field, or constant, such as $1 meaning the first field.

Arrays

An array is a variable that can be used to store a set of values. The following statement assigns a value to an element of an array:

array[index] = value

In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number.

An associative array makes an “association” between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array.

You can use the special for loop to read all the elements of an associative array.

for (item in array)

The index of the array is available as item, while the value of an element of the array can be referenced as array[item].

You can use the operator in to test that an element exists by testing to see if its index exists.

if (index in array)

tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].

You can also delete individual elements of the array using the delete statement.

System variables

Awk defines a number of special variables that can be referenced or reset inside a program, as shown in Table B.4 (defaults are listed in parentheses).

Table B.4. Awk System Variables
VariableDescription
ARGCNumber of arguments on command line
ARGVAn array containing the command-line arguments
CONVFMTString conversion format for numbers (%.6g). (POSIX)
ENVIRONAn associative array of environment variables
FILENAMECurrent filename
FNRLike NR, but relative to the current file
FSField separator (a blank)
NFNumber of fields in current record
NRNumber of the current record
OFMTOutput format for numbers (%.6g)
OFSOutput field separator (a blank)
ORSOutput record separator (a newline)
RLENGTH

Length of the string matched by match( ) function

RSRecord separator (a newline)
RSTART

First position in the string matched by match( ) function

SUBSEP

Separator character for array subscripts (\034)

Operators

Table B.5 lists the operators in the order of precedence (low to high) that are available in awk.

Table B.5. Operators
OperatorsDescription
= += -= *= /= %= ^= **=Assignment
?:C conditional expression
||Logical OR
&&Logical AND
~ !~Match regular expression and negation
< <= > >= != ==Relational operators
(blank)Concatenation
+ -Addition, subtraction
* / %Multiplication, division, and modulus
+ - !Unary plus and minus, and logical negation
^ **Exponentiation
++ --Increment and decrement, either prefix or postfix
$Field reference

Note

While “**” and “**=” are common extensions, they are not part of POSIX awk.

Statements and Functions

An action is enclosed in braces and consists of one or more statements and/or expressions. The difference between a statement and a function is that a function returns a value, and its argument list is specified within parentheses. (The formal syntactical difference does not always hold true: printf is considered a statement, but its argument list can be put in parentheses; getline is a function that does not use parentheses.)

Awk has a number of predefined arithmetic and string functions. A function is typically called as follows:

return = function(arg1,arg2)

where return is a variable created to hold what the function returns. (In fact, the return value of a function can be used anywhere in an expression, not just on the right-hand side of an assignment.) Arguments to a function are specified as a comma-separated list. The left parenthesis follows after the name of the function. (With built-in functions, a space is permitted between the function name and the parentheses.)