Copyright © 2000 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The use of the slender loris image in association with sed & awk is a trademark of O'Reilly & Associates, Inc.
While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
The pocket reference follows certain typographic conventions, outlined here:
Constant Width
Is used for code examples, commands, directory names, filenames, and options.
Constant Width Italic
Is used in syntax and command summaries to show replaceable text; this text should be replaced with user-supplied values.
Constant Width Bold
Is used in code examples to show commands or other text that should be typed literally by the user.
Italic
Is used to show generic arguments and options; these should be replaced with user-supplied values. Italic is also used to highlight comments in examples.
$
Is used in some examples as the Bourne shell or Korn shell prompt.
[ ]
Surround optional elements in a description of syntax. (The brackets themselves should never be typed.)
A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).
This section presents the following topics:
Filenames versus patterns
List of metacharacters available to each program
Description of metacharacters
Examples
Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. The command:
$ grep [A-Z]* chap[12]
could, for example, be transformed by the shell into:
$ grep Array.c Bug.c Comp.c chap1 chap2
and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes:
$ grep "[A-Z]*" chap[12]
Double quotes suffice in most cases, but single quotes are the safest bet.
Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.
The characters in the following table have special meaning only in search patterns:
Many Unix systems allow the use of POSIX "character classes" within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.
| Class | Characters Matched |
|---|---|
| alnum | Alphanumeric characters |
| alpha | Alphabetic characters |
| blank | Space or tab |
| cntrl | Control characters |
| digit | Decimal digits |
| graph | Non-space characters |
| lower | Lowercase characters |
| Printable characters | |
| space | White-space characters |
| upper | Uppercase characters |
| xdigit | Hexadecimal digits |
The characters in the following table have special meaning only in replacement patterns.
| Character | Pattern |
|---|---|
| \ | Turn off the special meaning of the following character. |
| \n | Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left. |
| & | Reuse the text matched by the search pattern as part of the replacement pattern. |
| ~ | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ex and vi). |
| % | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ed). |
| \u | Convert first character of replacement pattern to uppercase. |
| \U | Convert entire replacement pattern to uppercase. |
| \l | Convert first character of replacement pattern to lowercase. |
| \L | Convert entire replacement pattern to lowercase. |
Some metacharacters are valid for one program but not for another.
Those that are available to a Unix program are marked by a bullet
(
) in the following table.
(This table is correct for SVR4 and Solaris and most commerical
Unix systems, but it's always a good idea to verify your
system's behavior.)
Items marked with a "P" are specified by POSIX; double
check your system's version.
Full descriptions were provided in the previous section.
| Symbol | ed | ex\vi | sed\grep | awk\egrep | Action |
|---|---|---|---|---|---|
. | ![]() | ![]() | ![]() | ![]() | Match any character. |
* | ![]() | ![]() | ![]() | ![]() | Match zero or more preceding. |
^ | ![]() | ![]() | ![]() | ![]() | Match beginning of line/string. |
$ | ![]() | ![]() | ![]() | ![]() | Match end of line/string. |
\ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
[ ] | ![]() | ![]() | ![]() | ![]() | Match one from a set. |
\( \) | ![]() | ![]() | ![]() | Store pattern for later replay.[1] | |
| \n | ![]() | ![]() | ![]() | Replay sub-pattern in match. | |
{ } | P | Match a range of instances. | |||
\{ \} | ![]() | ![]() | Match a range of instances. | ||
\ | ![]() | ![]() | Match word's beginning or end. | ||
+ | ![]() | Match one or more preceding. | |||
? | ![]() | Match zero or one preceding. | |||
| | ![]() | Separate choices to match. | |||
( ) | ![]() | Group expressions to match. |
[1] Stored sub-patterns can be "replayed" during matching. See the examples, below.
Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters above are meaningful only in a search pattern.
In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:
| Symbol | ex | vi | sed | ed | Action |
|---|---|---|---|---|---|
| \ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
| \n | ![]() | ![]() | ![]() | ![]() | Text matching pattern stored in \( \). |
| & | ![]() | ![]() | ![]() | ![]() | Text matching search pattern. |
| ~ | ![]() | ![]() | Reuse previous replacement pattern. | ||
| % | ![]() | Reuse previous replacement pattern. | |||
| \u \U | ![]() | ![]() | Change character(s) to uppercase. | ||
| \l \L | ![]() | ![]() | Change character(s) to lowercase. | ||
| \E | ![]() | ![]() | Turn off previous \U or \L. | ||
| \e | ![]() | ![]() | Turn off previous \u or \l. |
When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns.
| Pattern | What Does It Match? |
|---|---|
| bag | The string bag. |
| ^bag | bag at the beginning of the line. |
| bag$ | bag at the end of the line. |
| ^bag$ | bag as the only word on the line. |
| [Bb]ag | Bag or bag. |
| b[aeiou]g | Second letter is a vowel. |
| b[^aeiou]g | Second letter is a consonant (or uppercase or symbol). |
| b.g | Second letter is any character. |
| ^…$ | Any line containing exactly three characters. |
| ^\. | Any line that begins with a dot. |
| ^\.[a-z][a-z] | Same, followed by two lowercase letters (e.g., troff requests). |
| ^\.[a-z]\{2\} | Same as previous, ed, grep and sed only. |
| ^[^.] | Any line that doesn't begin with a dot. |
| bugs* | bug, bugs, bugss, etc. |
| "word" | A word in quotes. |
| "*word"* | A word, with or without quotes. |
| [A-Z][A-Z]* | One or more uppercase letters. |
| [A-Z]+ | Same as previous, egrep or awk only. |
| [[:upper:]]+ | Same as previous, POSIX egrep or awk. |
| [A-Z].* | An uppercase letter, followed by zero or more characters. |
| [A-Z]* | Zero or more uppercase letters. |
| [a-zA-Z] | Any letter, either lower- or uppercase. |
| [^0-9A-Za-z] | Any symbol or space (not a letter or a number). |
| [^[:alnum:]] | Same, using POSIX character class. |
| egrep or awk pattern | What Does It Match? |
|---|---|
| [567] | One of the numbers 5, 6, or 7. |
| five|six|seven | One of the words five, six, or seven. |
| 80[2-4]?86 | 8086, 80286, 80386, or 80486. |
| 80[2-4]?86|(Pentium(-II)?) | 8086, 80286, 80386, 80486, Pentium, or Pentium-II. |
| compan(y|ies) | company or companies. |
| ex or vi pattern | What Does It Match? |
|---|---|
| \<the | Words like theater, there or the. |
| the\> | Words like breathe, seethe or the. |
| \<the\> | The word the. |
| ed, sed, or grep pattern | What Does It Match? |
|---|---|
| 0\{5,\} | Five or more zeros in a row. |
| [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\} | U.S. Social Security number (nnn-nn-nnnn). |
| \(why\).*\1 | A line with two occurrences of why. |
| \([[:alpha:]_][[:alnum:]_.]*\) = \1; | C/C++ simple assignment statements. |
The following examples show the metacharacters
available to sed or ex.
Note that ex commands begin with a colon.
A space is marked by a
; a tab is marked by a
.
Finally, some sed examples for transposing words. A simple transposition of two words might look like this:
s/die or do/do or die/Transpose words
The real trick is to use hold buffers to transpose variable patterns. For example:
s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/ Transpose, using
hold buffers
This section presents the following topics:
Conceptual overview of sed
Command-line syntax
Syntax of sed commands
Group summary of sed commands
Alphabetical summary of sed commands
sed is a non-interactive, or stream-oriented, editor. It interprets a script and performs the actions in the script. sed is stream-oriented because, like many Unix programs, input flows through the program and is directed to standard output. For example, sort is stream-oriented; vi is not. sed's input typically comes from a file or pipe, but it can also be directed from the keyboard. Output goes to the screen by default but can be captured in a file or sent through a pipe instead.
The Free Software Foundation has a version of sed, available from ftp://gnudist.gnu.org/gnu/sed/sed-3.02.tar.gz. The somewhat older version, 2.05, is also available.
Typical uses of sed include:
Editing one or more files automatically
Simplifying repetitive edits to multiple files
Writing conversion programs
sed operates as follows:
Each line of input is copied into a "pattern space," an internal buffer where editing operations are performed.
All editing commands in a sed script are applied, in order, to each line of input.
Editing commands are applied to all lines (globally) unless line addressing restricts the lines affected.
If a command changes the input, subsequent commands and address tests will be applied to the current line in the pattern space, not the original input line.
The original input file is unchanged because the editing commands modify a copy of each original input line. The copy is sent to standard output (but can be redirected to a file).
sed also maintains the "hold space," a separate buffer that can be used to save data for later retrieval.
The syntax for invoking sed has two forms:
sed [-n] [-e] 'command' file(s) sed [-n] -f scriptfile file(s)
The first form allows you to specify an editing command on the command line, surrounded by single quotes. The second form allows you to specify a scriptfile, a file containing sed commands. Both forms may be used together, and they may be used multiple times. If no file(s) is specified, sed reads from standard input.
The following options are recognized:
-n
Suppress the default output; sed displays only those lines specified with the p command or with the p flag of the s command.
-e cmd
Next argument is an editing command. Useful if multiple scripts or commands are specified.
-f file
Next argument is a file containing editing commands.
If the first line of the script is #n, sed behaves as if -n had been specified.
sed commands have the general form:
[address[, address]][!]command [arguments]
sed copies each line of input into the pattern space. sed instructions consist of addresses and editing commands. If the address of the command matches the line in the pattern space, then the command is applied to that line. If a command has no address, then it is applied to each input line. If a command changes the contents of the pattern space, subsequent commands and addresses will be applied to the current line in the pattern space, not the original input line.
commands consist of a single letter or symbol; they are described later, alphabetically and by group. arguments include the label supplied to b or t, the filename supplied to r or w, and the substitution flags for s. addresses are described in the next section.
A sed command can specify zero, one, or two addresses. An address can be a line number, the symbol $ (for last line), or a regular expression enclosed in slashes (/pattern/). Regular expressions are described in Section 1.3. Additionally, \n can be used to match any newline in the pattern space (resulting from the N command), but not the newline at the end of the pattern space.
| If the Command Specifies: | Then the Command Is Applied To: |
|---|---|
| No address | Each input line. |
| One address | Any line matching the address. Some commands accept only one address: a, i, r, q, and =. |
| Two comma-separated addresses | First matching line and all succeeding lines up to and including a line matching the second address. |
| An address followed by ! | All lines that do not match the address. |
| s/xx/yy/g | Substitute on all lines (all occurrences). |
| /BSD/d | Delete lines containing BSD. |
| /^BEGIN/,/^END/p | Print between BEGIN and END, inclusive. |
| /SAVE/!d | Delete any line that doesn't contain SAVE. |
| /BEGIN/,/END/!s/xx/yy/g | Substitute on all lines, except between BEGIN and END. |
Braces ({ }) are used in sed to nest one address inside another or to apply multiple commands at the same address.
[/pattern/[,/pattern/]]{
command1
command2
}
The opening curly brace must end its line, and the closing curly brace must be on a line by itself. Be sure there are no spaces after the braces.
In the lists that follow, the sed commands are grouped by function and are described tersely. Full descriptions, including syntax and examples, can be found afterward in the Section 1.4.5 section.
| a\ | Append text after a line. |
| c\ | Replace text (usually a text block). |
| i\ | Insert text before a line. |
| d | Delete lines. |
| s | Make substitutions. |
| y | Translate characters (like Unix tr). |
| = | Display line number of a line. |
| l | Display control characters in ASCII. |
| p | Display the line. |
| n | Skip current line and go to line below. |
| r | Read another file's contents into the output stream. |
| w | Write input lines to another file. |
| q | Quit the sed script (no further output). |
| h | Copy into hold space; wipe out what's there. |
| H | Copy into hold space; append to what's there. |
| g | Get the hold space back; wipe out the destination line. |
| G | Get the hold space back; append to the pattern space. |
| x | Exchange contents of the hold and pattern spaces. |
| b | Branch to label or to end of script. |
| t | Same as b, but branch only after substitution. |
| :label | Label branched to by t or b. |
| N | Read another line of input (creates embedded newline). |
| D | Delete up to the embedded newline. |
| P | Print up to the embedded newline. |
This section presents the following topics:
Conceptual overview
Command-line syntax
Patterns and procedures
Built-in variables
Operators
Variables and array assignment
User-defined functions
Group listing of functions and commands
Implementation limits
Alphabetical summary of functions and commands
awk is a pattern-matching program for processing files, especially when they are databases. The new version of awk, called nawk, provides additional capabilities. (It really isn't so new. The additional features were added in 1984, and it was first shipped with System V Release 3.1 in 1987. Nevertheless, the name was never changed on most systems.) Every modern Unix system comes with a version of new awk, and its use is recommended over old awk.
Different systems vary in what the two versions are called. Some have oawk and awk, for the old and new versions, respectively. Others have awk and nawk. Still others only have awk, which is the new version. This example shows what happens if your awk is the old one:
$ awk 1 /dev/null awk: syntax error near line 1 awk: bailing out near line 1
awk will exit silently if it is the new version.
Source code for the latest version of awk, from Bell Labs, can be downloaded starting at Brian Kernighan's home page: http://cm.bell-labs.com/~bwk. Michael Brennan's mawk is available via anonymous FTP from ftp://ftp.whidbey.net/pub/brennan/mawk1.3.3.tar.gz. Finally, the Free Software Foundation has a version of awk called gawk, available from ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz. All three programs implement "new" awk. Thus, references in the following text such as "nawk only," apply to all three. gawk has additional features.
With original awk, you can:
Think of a text file as made up of records and fields in a textual database.
Perform arithmetic and string operations.
Use programming constructs such as loops and conditionals.
Produce formatted reports.
With nawk, you can also:
Define your own functions.
Execute Unix commands from a script.
Process the results of Unix commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
In addition, with GNU awk (gawk), you can:
Use regular expressions to separate records, as well as fields.
Skip to the start of the next file, not just the next record.
Perform more powerful string substitutions.
Retrieve and format system time values.
The syntax for invoking awk has two forms:
awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s)
You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. nawk allows multiple -f scripts. Variables can be assigned a value on the command line. The value can be a literal, a shell variable ($name), or a command substitution (`cmd`), but the value is available only after the BEGIN statement is executed.
awk operates on one or more files. If none are specified (or if - is specified), awk reads from the standard input.
The recognized options are:
-Ffs
Set the field separator to fs. This is the same as setting the built-in variable FS. Original awk only allows the field separator to be a single character. nawk allows fs to be a regular expression. Each input line, or record, is divided into fields by white space (spaces or tabs) or by some other user-definable field separator. Fields are referred to by the variables $1, $2,…, $n. $0 refers to the entire record.
-v var= value
Available in nawk only. Assign a value to variable var. This allows assignment before the script begins execution.
For example, to print the first three (colon-separated) fields of each record on separate lines:
awk -F: '{ print $1; print $2; print $3 }' /etc/passwd
Numerous examples are shown later in the Section 1.5.3.3 section.
awk scripts consist of patterns and procedures:
pattern { procedure }
Both are optional. If pattern is missing, { procedure } is applied to all lines. If { procedure } is missing, the matched line is printed.
A pattern can be any of the following:
/regular expression/ relational expression pattern-matching expression BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, or any of the predefined variables described later under Section 1.5.4.
Regular expressions use the extended set of metacharacters and are described earlier in Section 1.3.
^ and $ refer to the beginning and end of a string (such as the fields), respectively, rather than the beginning and end of a line. In particular, these metacharacters will not match at a newline embedded in the middle of a string.
Relational expressions use the relational operators listed under "Operators" later in this book. For example, $2 > $1 selects lines for which the second field is greater than the first. Comparisons can be either string or numeric. Thus, depending on the types of data in $1 and $2, awk will do either a numeric or a string comparison. This can change from one record to the next.
Pattern-matching expressions use the operators ~ (match) and !~ (don't match). See "Operators" later in this book.
The BEGIN pattern lets you specify procedures that will take place before the first input line is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place after the last input record is read.
In nawk, BEGIN and END patterns may appear multiple times. The procedures are merged as if there had been one large procedure.
Except for BEGIN and END, patterns can be combined with the Boolean operators || (or), && (and), and ! (not). A range of lines can also be specified using comma-separated patterns:
pattern,pattern
Procedures consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons, and are contained within curly braces. Commands fall into five groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
User-defined functions (nawk only)
Print first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern/
Print first field of lines that contain pattern:
/pattern/ { print $1 }
Select records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line. Each line is a single field:
BEGIN { FS = "\n"; RS = "" }
Print fields 2 and 3 in switched order, but only on lines whose first field matches the string URGENT:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern/ { ++x }
END { print x }
Add numbers in second column and print total:
{ total += $2 }
END { print "column total is", total}
Print lines that contain less than 20 characters:
length($0) < 20
Print each line that begins with Name: and that contains exactly seven fields:
NF == 7 && /^Name:/
Print the fields of each record in reverse order, one per line:
{
for (i = NF; i >= 1; i--)
print $i
}
All awk variables are included in nawk. All nawk variables are included in gawk.
The following table lists the operators, in order of increasing precedence, that are available in awk.
| Symbol | Meaning |
|---|---|
| = += −= *= /= %= ^= **= | Assignment. |
| ?: | C conditional expression (nawk only). |
| || | Logical OR (short-circuit). |
| && | Logical AND (short-circuit). |
| in | Array membership (nawk only). |
| ~ !~ | Match regular expression and negation. |
| < < = > > = != = = | Relational operators. |
| (blank) | Concatenation. |
| + - | Addition, subtraction. |
| * / % | Multiplication, division, and modulus (remainder). |
| + - ! | Unary plus and minus, and logical negation. |
| ^ ** | Exponentiation. |
| ++ - - | Increment and decrement, either prefix or postfix. |
| $ | Field reference. |
Note: While ** and **= are common extensions, they are not part of POSIX awk.
Variables can be assigned a value with an = sign. For example:
FS = ","
Expressions using the operators +, -, /, and % (modulo) can be assigned to variables.
Arrays can be created with the split( ) function (described later), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1], …, array[n]) or with strings. Arrays subscripted by strings are called "associative arrays." (In fact, all arrays in awk are associative; numeric subscripts are converted to strings before using them as array subscripts. Associative arrays are one of awk's most powerful features.)
For example, to count the number of widgets you have, you could use the following script:
/widget/ { count["widget"]++ } Count widgets
END { print count["widget"] } Print the count
You can use the special for loop to read all the elements of an associative array:
for (item in array) process array[item]
The index of the array is available as item, while the value of an element of the array can be referenced as array[item].
You can use the operator in to test that an element exists by testing to see if its index exists (nawk only). For example:
if (index in array) …
tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].
You can also delete individual elements of the array using the delete statement (nawk only).
Within string and regular expression constants, the following escape sequences may be used.
| Sequence | Meaning | Sequence | Meaning |
|---|---|---|---|
| \a | Alert (bell) | \v | Vertical tab |
| \b | Backspace | \\ | Literal backslash |
| \f | Form feed | \nnn | Octal value nnn |
| \n | Newline | \xnn | Hexadecimal value nn |
| \r | Carriage return | \" | Literal double quote (in strings) |
| \t | Tab | \/ | Literal slash (in regular expressions) |
Note: The \x escape sequence is a common extension; it is not part of POSIX awk.
nawk allows you to define your own functions. This makes it easy to encapsulate sequences of steps that need to be repeated into a single place, and re-use the code from anywhere in your program.
The following function capitalizes each word in a string. It has one parameter, named input, and five local variables, which are written as extra parameters:
# capitalize each word in a string
function capitalize(input, result, words, n, i, w)
{
result = " "
n = split(input, words, " ")
for (i = 1; i <= n; i++) {
w = words[i]
w = toupper(substr(w, 1, 1)) substr(w, 2)
if (i > 1)
result = result " "
result = result w
}
return result
}
# main program, for testing
{ print capitalize($0) }
With this input data:
A test line with words and numbers like 12 on it.
This program produces:
A Test Line With Words And Numbers Like 12 On It.
Note: For user-defined functions, no space is allowed between the function name and the left parenthesis when the function is called.
awk functions and commands may be classified as follows:
| Functions | Commands | ||
|---|---|---|---|
| Arithmetic Functions | atan2[2] | int | sin[2] |
| cos[2] | log | sqrt | |
| exp | rand[2] | srand[2] | |
| String Functions | index | match[2] | tolower[2] |
| gensub[9] | split | toupper[2] | |
| gsub[2] | sprintf | ||
| length | sub[2] | ||
| Control Flow Statements | break | exit | return[2] |
| continue | for | while | |
| do/while[2] | if | ||
| Input/Output Processing | close[2] | next | printf |
| fflush[16] | nextfile[16] | ||
| getline[2] | |||
| Time Functions | strftime[9] | systime[9] | |
| Programming | delete[2] | function[2] | system[2] |
[2] Available in nawk.
[9] Available in gawk.
[16] Available in Bell Labs awk and gawk.
Many versions of awk have various implementation limits, on things such as:
Number of fields per record
Number of characters per input record
Number of characters per output record
Number of characters per field
Number of characters per printf string
Number of characters in literal string
Number of characters in character class
Number of files open
Number of pipes open
The ability to handle 8-bit characters and characters that are all zero (ASCII NUL)
gawk does not have limits on any of the above items, other than those imposed by the machine architecture and/or the operating system.
The following alphabetical list of keywords and functions includes all that are available in awk, nawk, and gawk. nawk includes all old awk functions and keywords, plus some additional ones (marked as {N}). gawk includes all nawk functions and keywords, plus some additional ones (marked as {G}). Items marked with {B} are available in the Bell Labs awk. Items that aren't marked with a symbol are available in all versions.
| Command | Description |
|---|---|
| atan2 | atan2(y, x) Return the arctangent of y/x in radians. {N} |
| break | break Exit from a while, for, or do loop. |
| close | close(expr) In most implementations of awk, you can only have up to ten files open simultaneously and one pipe. Therefore, nawk provides a close function that allows you to close a file or a pipe. It takes the same expression that opened the pipe or file as an argument. This expression must be identical, character by character, to the one that opened the file or pipe—even whitespace is significant. {N} |
| continue | continue Begin next iteration of while, for, or do loop. |
| cos | cos(x) Return the cosine of x, an angle in radians. {N} |
| delete | delete
array[element]
delete array Delete element from array. The brackets are typed literally. {N} The second form is a common extension, which deletes all elements of the array at one shot. {B} {G} |
| do | do
statement while (expr) Looping statement. Execute statement, then evaluate expr and if true, execute statement again. A series of statements must be put within braces. {N} |
| exit | exit [expr] Exit from script, reading no new input. The END procedure, if it exists, will be executed. An optional expr becomes awk's return value. |
| exp | exp(x) Return exponential of x (ex). |
| fflush | fflush([output-expr])
Flush any buffers associated with open output file or pipe output-expr. {B} gawk extends this function. If no output-expr is supplied, it flushes standard output. If output-expr is the null string (" "), it flushes all open files and pipes. {G} |
| for | for (init-expr;
test-expr;
incr-expr)
statement C-style looping construct. init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. All of the expressions are optional. A missing test-expr is considered to be true. A series of statements must be put within braces. |
| for | for (item
in
array)
statement Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array [item]. A series of statements must be put within braces. |
| function | function
name(parameter-list) {
statements } Create name as a user-defined function consisting of awk statements that apply to the specified list of parameters. No space is allowed between name and the left parenthesis when the function is called. {N} |
| getline | getline [var] [<
file]
command | getline [var] Read next line of input. Original awk does not support the syntax to open multiple input streams. The first form reads input from file and the second form reads the output of command. Both forms read one record at a time, and each time the statement is executed it gets the next record of input. The record is assigned to $0 and is parsed into fields, setting NF, NR and FNR. If var is specified, the result is assigned to var and $0 and NF are not changed. Thus, if the result is assigned to a variable, the current record does not change. getline is actually a function and it returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and −1 if for some reason it is otherwise unsuccessful. {N} |
| gensub | gensub(r, s, h [, t]) General substitution function. Substitute s for matches of the regular expression r in the string t. If h is a number, replace the hth match. If it is "g" or "G", substitute globally. If t is not supplied, $0 is used. Return the new string value. The original t is not modified. (Compare gsub and sub.) {G} |
| gsub | gsub(r, s [, t]) Globally substitute s for each match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return the number of substitutions. {N} |
| if | if (condition)
statement [else statement] If condition is true, do statement(s), otherwise do statement in optional else clause. Condition can be an expression using any of the relational operators <, < =, = =, !=, > =, or >, as well as the array membership operator in, and the pattern-matching operators ~ and !~ (e.g., if ($1 ~ /[Aa].*/)). A series of statements must be put within braces. Another if can directly follow an else in order to produce a chain of tests or decisions. |
| index | index(str, substr) Return the position (starting at 1) of substr in str, or zero if substr is not present in str. |
| int | int(x) Return integer value of x by truncating any fractional part. |
| length | length([arg]) Return length of arg, or the length of $0 if no argument. |
| log | log(x) Return the natural logarithm (base e) of x. |
| match | match(s, r) Function that matches the pattern, specified by the regular expression r, in the string s and returns either the position in s where the match begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH to the start and length of the match, respectively. {N} |
| next | next Read next input line and start new cycle through pattern/procedures statements. |
| nextfile | nextfile Stop processing the current input file and start new cycle through pattern/procedures statements, beginning with the first record of the next file. {B} {G} |
| print [ output-expr[ , …]] [ dest-expr ] Evaluate the output-expr and direct it to standard output followed by the value of ORS. Each comma-separated output-expr is separated in the output by the value of OFS. With no output-expr, print $0. The output may be redirected to a file or pipe via the dest-expr, which is described in the section "Output Redirections" following this table. | |
| printf | printf(format [, expr-list ]) [ dest-expr ] An alternative output statement borrowed from the C language. It has the ability to produce formatted output. It can also be used to output data without automatically producing a newline. format is a string of format specifications and constants. expr-list is a list of arguments corresponding to format specifiers. As for print, output may be redirected to a file or pipe. See the section "printf formats" following this table for a description of allowed format specifiers. |
| rand | rand() Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using srand( ). {N} |
| return | return [expr] Used within a user-defined function to exit the function, returning value of expression. The return value of a function is undefined if expr is not provided. {N} |
| sin | sin(x) Return the sine of x, an angle in radians. {N} |
| split | split(string, array [, sep]) Split string into elements of array array[1],…,array[n]. The string is split at each occurrence of separator sep. If sep is not specified, FS is used. Returns the number of array elements created. |
| sprintf | sprintf(format [, expressions]) Return the formatted value of one or more expressions, using the specified format. Data is formatted but not printed. See the section "printf formats" following this table for a description of allowed format specifiers. |
| sqrt | sqrt(arg) Return square root of arg. |
| srand | srand([expr]) Use optional expr to set a new seed for the random number generator. Default is the time of day. Return value is the old seed. {N} |
| strftime | strftime([format [,timestamp]]) Format timestamp according to format. Return the formatted string. The timestamp is a time-of-day value in seconds since Midnight, January 1, 1970, UTC. The format string is similar to that of sprintf. If timestamp is omitted, it defaults to the current time. If format is omitted, it defaults to a value that produces output similar to that of the Unix date command. {G} |
| sub | sub(r, s [, t]) Substitute s for first match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return 1 if successful; 0 otherwise. {N} |
| substr | substr(string, beg [, len]) Return substring of string at beginning position beg, and the characters that follow to maximum specified length len. If no length is given, use the rest of the string. |
| system | system(command)
Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A non-zero value indicates a failure of some sort. The documentation for the command you're running will give you the details. The output of the command is not available for processing within the awk script. Use command | getline to read the output of a command into the script. {N} |
| systime | systime( ) Return a time-of-day value in seconds since Midnight, January 1, 1970, UTC. {G} |
| tolower | tolower(str) Translate all uppercase characters in str to lowercase and return the new string.[24] {N} |
| toupper | toupper(str) Translate all lowercase characters in str to uppercase and return the new string. {N} |
| while | while (condition)
statement Do statement while condition is true (see if for a description of allowable conditions). A series of statements must be put within braces. |
[24] Very early versions of nawk don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk.
For print and printf, dest-expr is an optional expression that directs the output to a file or pipe.
> file
Directs the output to a file, overwriting its previous contents.
>> file
Appends the output to a file, preserving its previous contents. In both of these cases, the file will be created if it does not already exist.
| command
Directs the output as the input to a system command.
Be careful not to mix > and >> for the same file. Once a file has been opened with >, subsequent output statements continue to append to the file until it is closed.
Remember to call close() when you have finished with a file or pipe. If you don't, eventually you will hit the system limit on the number of simultaneously open files.
Format specifiers for printf and sprintf have the following form:
%[flag][width][.precision]letter
The control letter is required. The format conversion control letters are given in the following table.
| Character | Description |
|---|---|
| c | ASCII character. |
| d | Decimal integer. |
| i | Decimal integer. (Added in POSIX) |
| e | Floating-point format ([-]d.precisione[+-]dd). |
| E | Floating-point format ([-]d.precisionE[+-]dd). |
| f | Floating-point format ([-]ddd.precision). |
| g | e or f conversion, whichever is shortest, with trailing zeros removed. |
| G | E or f conversion, whichever is shortest, with trailing zeros removed. |
| o | Unsigned octal value. |
| s | String. |
| x | Unsigned hexadecimal number. Uses a-f for 10 to 15. |
| X | Unsigned hexadecimal number. Uses A-F for 10 to 15. |
| % | Literal %. |
The optional flag is one of the following:
The optional width is the minimum number of characters to output. The result will be padded to this size if it is smaller. The 0 flag causes padding with zeros; otherwise, padding is with spaces.
The precision is optional. Its meaning varies by control letter, as shown in this table:
| Conversion | Precision Means |
|---|---|
| %d, %i, %o, %u, %x, %X | The minimum number of digits to print. |
| %e, %E, %f | The number of digits to the right of the decimal point. |
| %g, %G | The maximum number of significant digits. |
| %s | The maximum number of characters to print. |
Copyright © 2000 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The use of the slender loris image in association with sed & awk is a trademark of O'Reilly & Associates, Inc.
While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
The pocket reference follows certain typographic conventions, outlined here:
Constant Width
Is used for code examples, commands, directory names, filenames, and options.
Constant Width Italic
Is used in syntax and command summaries to show replaceable text; this text should be replaced with user-supplied values.
Constant Width Bold
Is used in code examples to show commands or other text that should be typed literally by the user.
Italic
Is used to show generic arguments and options; these should be replaced with user-supplied values. Italic is also used to highlight comments in examples.
$
Is used in some examples as the Bourne shell or Korn shell prompt.
[ ]
Surround optional elements in a description of syntax. (The brackets themselves should never be typed.)
A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).
This section presents the following topics:
Filenames versus patterns
List of metacharacters available to each program
Description of metacharacters
Examples
Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. The command:
$ grep [A-Z]* chap[12]
could, for example, be transformed by the shell into:
$ grep Array.c Bug.c Comp.c chap1 chap2
and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes:
$ grep "[A-Z]*" chap[12]
Double quotes suffice in most cases, but single quotes are the safest bet.
Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.
The characters in the following table have special meaning only in search patterns:
Many Unix systems allow the use of POSIX "character classes" within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.
| Class | Characters Matched |
|---|---|
| alnum | Alphanumeric characters |
| alpha | Alphabetic characters |
| blank | Space or tab |
| cntrl | Control characters |
| digit | Decimal digits |
| graph | Non-space characters |
| lower | Lowercase characters |
| Printable characters | |
| space | White-space characters |
| upper | Uppercase characters |
| xdigit | Hexadecimal digits |
The characters in the following table have special meaning only in replacement patterns.
| Character | Pattern |
|---|---|
| \ | Turn off the special meaning of the following character. |
| \n | Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left. |
| & | Reuse the text matched by the search pattern as part of the replacement pattern. |
| ~ | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ex and vi). |
| % | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ed). |
| \u | Convert first character of replacement pattern to uppercase. |
| \U | Convert entire replacement pattern to uppercase. |
| \l | Convert first character of replacement pattern to lowercase. |
| \L | Convert entire replacement pattern to lowercase. |
Some metacharacters are valid for one program but not for another.
Those that are available to a Unix program are marked by a bullet
(
) in the following table.
(This table is correct for SVR4 and Solaris and most commerical
Unix systems, but it's always a good idea to verify your
system's behavior.)
Items marked with a "P" are specified by POSIX; double
check your system's version.
Full descriptions were provided in the previous section.
| Symbol | ed | ex\vi | sed\grep | awk\egrep | Action |
|---|---|---|---|---|---|
. | ![]() | ![]() | ![]() | ![]() | Match any character. |
* | ![]() | ![]() | ![]() | ![]() | Match zero or more preceding. |
^ | ![]() | ![]() | ![]() | ![]() | Match beginning of line/string. |
$ | ![]() | ![]() | ![]() | ![]() | Match end of line/string. |
\ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
[ ] | ![]() | ![]() | ![]() | ![]() | Match one from a set. |
\( \) | ![]() | ![]() | ![]() | Store pattern for later replay.[1] | |
| \n | ![]() | ![]() | ![]() | Replay sub-pattern in match. | |
{ } | P | Match a range of instances. | |||
\{ \} | ![]() | ![]() | Match a range of instances. | ||
\ | ![]() | ![]() | Match word's beginning or end. | ||
+ | ![]() | Match one or more preceding. | |||
? | ![]() | Match zero or one preceding. | |||
| | ![]() | Separate choices to match. | |||
( ) | ![]() | Group expressions to match. |
[1] Stored sub-patterns can be "replayed" during matching. See the examples, below.
Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters above are meaningful only in a search pattern.
In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:
| Symbol | ex | vi | sed | ed | Action |
|---|---|---|---|---|---|
| \ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
| \n | ![]() | ![]() | ![]() | ![]() | Text matching pattern stored in \( \). |
| & | ![]() | ![]() | ![]() | ![]() | Text matching search pattern. |
| ~ | ![]() | ![]() | Reuse previous replacement pattern. | ||
| % | ![]() | Reuse previous replacement pattern. | |||
| \u \U | ![]() | ![]() | Change character(s) to uppercase. | ||
| \l \L | ![]() | ![]() | Change character(s) to lowercase. | ||
| \E | ![]() | ![]() | Turn off previous \U or \L. | ||
| \e | ![]() | ![]() | Turn off previous \u or \l. |
When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns.
| Pattern | What Does It Match? |
|---|---|
| bag | The string bag. |
| ^bag | bag at the beginning of the line. |
| bag$ | bag at the end of the line. |
| ^bag$ | bag as the only word on the line. |
| [Bb]ag | Bag or bag. |
| b[aeiou]g | Second letter is a vowel. |
| b[^aeiou]g | Second letter is a consonant (or uppercase or symbol). |
| b.g | Second letter is any character. |
| ^…$ | Any line containing exactly three characters. |
| ^\. | Any line that begins with a dot. |
| ^\.[a-z][a-z] | Same, followed by two lowercase letters (e.g., troff requests). |
| ^\.[a-z]\{2\} | Same as previous, ed, grep and sed only. |
| ^[^.] | Any line that doesn't begin with a dot. |
| bugs* | bug, bugs, bugss, etc. |
| "word" | A word in quotes. |
| "*word"* | A word, with or without quotes. |
| [A-Z][A-Z]* | One or more uppercase letters. |
| [A-Z]+ | Same as previous, egrep or awk only. |
| [[:upper:]]+ | Same as previous, POSIX egrep or awk. |
| [A-Z].* | An uppercase letter, followed by zero or more characters. |
| [A-Z]* | Zero or more uppercase letters. |
| [a-zA-Z] | Any letter, either lower- or uppercase. |
| [^0-9A-Za-z] | Any symbol or space (not a letter or a number). |
| [^[:alnum:]] | Same, using POSIX character class. |
| egrep or awk pattern | What Does It Match? |
|---|---|
| [567] | One of the numbers 5, 6, or 7. |
| five|six|seven | One of the words five, six, or seven. |
| 80[2-4]?86 | 8086, 80286, 80386, or 80486. |
| 80[2-4]?86|(Pentium(-II)?) | 8086, 80286, 80386, 80486, Pentium, or Pentium-II. |
| compan(y|ies) | company or companies. |
| ex or vi pattern | What Does It Match? |
|---|---|
| \<the | Words like theater, there or the. |
| the\> | Words like breathe, seethe or the. |
| \<the\> | The word the. |
| ed, sed, or grep pattern | What Does It Match? |
|---|---|
| 0\{5,\} | Five or more zeros in a row. |
| [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\} | U.S. Social Security number (nnn-nn-nnnn). |
| \(why\).*\1 | A line with two occurrences of why. |
| \([[:alpha:]_][[:alnum:]_.]*\) = \1; | C/C++ simple assignment statements. |
The following examples show the metacharacters
available to sed or ex.
Note that ex commands begin with a colon.
A space is marked by a
; a tab is marked by a
.
Finally, some sed examples for transposing words. A simple transposition of two words might look like this:
s/die or do/do or die/Transpose words
The real trick is to use hold buffers to transpose variable patterns. For example:
s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/ Transpose, using
hold buffers
This section presents the following topics:
Conceptual overview of sed
Command-line syntax
Syntax of sed commands
Group summary of sed commands
Alphabetical summary of sed commands
sed is a non-interactive, or stream-oriented, editor. It interprets a script and performs the actions in the script. sed is stream-oriented because, like many Unix programs, input flows through the program and is directed to standard output. For example, sort is stream-oriented; vi is not. sed's input typically comes from a file or pipe, but it can also be directed from the keyboard. Output goes to the screen by default but can be captured in a file or sent through a pipe instead.
The Free Software Foundation has a version of sed, available from ftp://gnudist.gnu.org/gnu/sed/sed-3.02.tar.gz. The somewhat older version, 2.05, is also available.
Typical uses of sed include:
Editing one or more files automatically
Simplifying repetitive edits to multiple files
Writing conversion programs
sed operates as follows:
Each line of input is copied into a "pattern space," an internal buffer where editing operations are performed.
All editing commands in a sed script are applied, in order, to each line of input.
Editing commands are applied to all lines (globally) unless line addressing restricts the lines affected.
If a command changes the input, subsequent commands and address tests will be applied to the current line in the pattern space, not the original input line.
The original input file is unchanged because the editing commands modify a copy of each original input line. The copy is sent to standard output (but can be redirected to a file).
sed also maintains the "hold space," a separate buffer that can be used to save data for later retrieval.
The syntax for invoking sed has two forms:
sed [-n] [-e] 'command' file(s) sed [-n] -f scriptfile file(s)
The first form allows you to specify an editing command on the command line, surrounded by single quotes. The second form allows you to specify a scriptfile, a file containing sed commands. Both forms may be used together, and they may be used multiple times. If no file(s) is specified, sed reads from standard input.
The following options are recognized:
-n
Suppress the default output; sed displays only those lines specified with the p command or with the p flag of the s command.
-e cmd
Next argument is an editing command. Useful if multiple scripts or commands are specified.
-f file
Next argument is a file containing editing commands.
If the first line of the script is #n, sed behaves as if -n had been specified.
sed commands have the general form:
[address[, address]][!]command [arguments]
sed copies each line of input into the pattern space. sed instructions consist of addresses and editing commands. If the address of the command matches the line in the pattern space, then the command is applied to that line. If a command has no address, then it is applied to each input line. If a command changes the contents of the pattern space, subsequent commands and addresses will be applied to the current line in the pattern space, not the original input line.
commands consist of a single letter or symbol; they are described later, alphabetically and by group. arguments include the label supplied to b or t, the filename supplied to r or w, and the substitution flags for s. addresses are described in the next section.
A sed command can specify zero, one, or two addresses. An address can be a line number, the symbol $ (for last line), or a regular expression enclosed in slashes (/pattern/). Regular expressions are described in Section 1.3. Additionally, \n can be used to match any newline in the pattern space (resulting from the N command), but not the newline at the end of the pattern space.
| If the Command Specifies: | Then the Command Is Applied To: |
|---|---|
| No address | Each input line. |
| One address | Any line matching the address. Some commands accept only one address: a, i, r, q, and =. |
| Two comma-separated addresses | First matching line and all succeeding lines up to and including a line matching the second address. |
| An address followed by ! | All lines that do not match the address. |
| s/xx/yy/g | Substitute on all lines (all occurrences). |
| /BSD/d | Delete lines containing BSD. |
| /^BEGIN/,/^END/p | Print between BEGIN and END, inclusive. |
| /SAVE/!d | Delete any line that doesn't contain SAVE. |
| /BEGIN/,/END/!s/xx/yy/g | Substitute on all lines, except between BEGIN and END. |
Braces ({ }) are used in sed to nest one address inside another or to apply multiple commands at the same address.
[/pattern/[,/pattern/]]{
command1
command2
}
The opening curly brace must end its line, and the closing curly brace must be on a line by itself. Be sure there are no spaces after the braces.
In the lists that follow, the sed commands are grouped by function and are described tersely. Full descriptions, including syntax and examples, can be found afterward in the Section 1.4.5 section.
| a\ | Append text after a line. |
| c\ | Replace text (usually a text block). |
| i\ | Insert text before a line. |
| d | Delete lines. |
| s | Make substitutions. |
| y | Translate characters (like Unix tr). |
| = | Display line number of a line. |
| l | Display control characters in ASCII. |
| p | Display the line. |
| n | Skip current line and go to line below. |
| r | Read another file's contents into the output stream. |
| w | Write input lines to another file. |
| q | Quit the sed script (no further output). |
| h | Copy into hold space; wipe out what's there. |
| H | Copy into hold space; append to what's there. |
| g | Get the hold space back; wipe out the destination line. |
| G | Get the hold space back; append to the pattern space. |
| x | Exchange contents of the hold and pattern spaces. |
| b | Branch to label or to end of script. |
| t | Same as b, but branch only after substitution. |
| :label | Label branched to by t or b. |
| N | Read another line of input (creates embedded newline). |
| D | Delete up to the embedded newline. |
| P | Print up to the embedded newline. |
This section presents the following topics:
Conceptual overview
Command-line syntax
Patterns and procedures
Built-in variables
Operators
Variables and array assignment
User-defined functions
Group listing of functions and commands
Implementation limits
Alphabetical summary of functions and commands
awk is a pattern-matching program for processing files, especially when they are databases. The new version of awk, called nawk, provides additional capabilities. (It really isn't so new. The additional features were added in 1984, and it was first shipped with System V Release 3.1 in 1987. Nevertheless, the name was never changed on most systems.) Every modern Unix system comes with a version of new awk, and its use is recommended over old awk.
Different systems vary in what the two versions are called. Some have oawk and awk, for the old and new versions, respectively. Others have awk and nawk. Still others only have awk, which is the new version. This example shows what happens if your awk is the old one:
$ awk 1 /dev/null awk: syntax error near line 1 awk: bailing out near line 1
awk will exit silently if it is the new version.
Source code for the latest version of awk, from Bell Labs, can be downloaded starting at Brian Kernighan's home page: http://cm.bell-labs.com/~bwk. Michael Brennan's mawk is available via anonymous FTP from ftp://ftp.whidbey.net/pub/brennan/mawk1.3.3.tar.gz. Finally, the Free Software Foundation has a version of awk called gawk, available from ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz. All three programs implement "new" awk. Thus, references in the following text such as "nawk only," apply to all three. gawk has additional features.
With original awk, you can:
Think of a text file as made up of records and fields in a textual database.
Perform arithmetic and string operations.
Use programming constructs such as loops and conditionals.
Produce formatted reports.
With nawk, you can also:
Define your own functions.
Execute Unix commands from a script.
Process the results of Unix commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
In addition, with GNU awk (gawk), you can:
Use regular expressions to separate records, as well as fields.
Skip to the start of the next file, not just the next record.
Perform more powerful string substitutions.
Retrieve and format system time values.
The syntax for invoking awk has two forms:
awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s)
You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. nawk allows multiple -f scripts. Variables can be assigned a value on the command line. The value can be a literal, a shell variable ($name), or a command substitution (`cmd`), but the value is available only after the BEGIN statement is executed.
awk operates on one or more files. If none are specified (or if - is specified), awk reads from the standard input.
The recognized options are:
-Ffs
Set the field separator to fs. This is the same as setting the built-in variable FS. Original awk only allows the field separator to be a single character. nawk allows fs to be a regular expression. Each input line, or record, is divided into fields by white space (spaces or tabs) or by some other user-definable field separator. Fields are referred to by the variables $1, $2,…, $n. $0 refers to the entire record.
-v var= value
Available in nawk only. Assign a value to variable var. This allows assignment before the script begins execution.
For example, to print the first three (colon-separated) fields of each record on separate lines:
awk -F: '{ print $1; print $2; print $3 }' /etc/passwd
Numerous examples are shown later in the Section 1.5.3.3 section.
awk scripts consist of patterns and procedures:
pattern { procedure }
Both are optional. If pattern is missing, { procedure } is applied to all lines. If { procedure } is missing, the matched line is printed.
A pattern can be any of the following:
/regular expression/ relational expression pattern-matching expression BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, or any of the predefined variables described later under Section 1.5.4.
Regular expressions use the extended set of metacharacters and are described earlier in Section 1.3.
^ and $ refer to the beginning and end of a string (such as the fields), respectively, rather than the beginning and end of a line. In particular, these metacharacters will not match at a newline embedded in the middle of a string.
Relational expressions use the relational operators listed under "Operators" later in this book. For example, $2 > $1 selects lines for which the second field is greater than the first. Comparisons can be either string or numeric. Thus, depending on the types of data in $1 and $2, awk will do either a numeric or a string comparison. This can change from one record to the next.
Pattern-matching expressions use the operators ~ (match) and !~ (don't match). See "Operators" later in this book.
The BEGIN pattern lets you specify procedures that will take place before the first input line is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place after the last input record is read.
In nawk, BEGIN and END patterns may appear multiple times. The procedures are merged as if there had been one large procedure.
Except for BEGIN and END, patterns can be combined with the Boolean operators || (or), && (and), and ! (not). A range of lines can also be specified using comma-separated patterns:
pattern,pattern
Procedures consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons, and are contained within curly braces. Commands fall into five groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
User-defined functions (nawk only)
Print first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern/
Print first field of lines that contain pattern:
/pattern/ { print $1 }
Select records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line. Each line is a single field:
BEGIN { FS = "\n"; RS = "" }
Print fields 2 and 3 in switched order, but only on lines whose first field matches the string URGENT:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern/ { ++x }
END { print x }
Add numbers in second column and print total:
{ total += $2 }
END { print "column total is", total}
Print lines that contain less than 20 characters:
length($0) < 20
Print each line that begins with Name: and that contains exactly seven fields:
NF == 7 && /^Name:/
Print the fields of each record in reverse order, one per line:
{
for (i = NF; i >= 1; i--)
print $i
}
All awk variables are included in nawk. All nawk variables are included in gawk.
The following table lists the operators, in order of increasing precedence, that are available in awk.
| Symbol | Meaning |
|---|---|
| = += −= *= /= %= ^= **= | Assignment. |
| ?: | C conditional expression (nawk only). |
| || | Logical OR (short-circuit). |
| && | Logical AND (short-circuit). |
| in | Array membership (nawk only). |
| ~ !~ | Match regular expression and negation. |
| < < = > > = != = = | Relational operators. |
| (blank) | Concatenation. |
| + - | Addition, subtraction. |
| * / % | Multiplication, division, and modulus (remainder). |
| + - ! | Unary plus and minus, and logical negation. |
| ^ ** | Exponentiation. |
| ++ - - | Increment and decrement, either prefix or postfix. |
| $ | Field reference. |
Note: While ** and **= are common extensions, they are not part of POSIX awk.
Variables can be assigned a value with an = sign. For example:
FS = ","
Expressions using the operators +, -, /, and % (modulo) can be assigned to variables.
Arrays can be created with the split( ) function (described later), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1], …, array[n]) or with strings. Arrays subscripted by strings are called "associative arrays." (In fact, all arrays in awk are associative; numeric subscripts are converted to strings before using them as array subscripts. Associative arrays are one of awk's most powerful features.)
For example, to count the number of widgets you have, you could use the following script:
/widget/ { count["widget"]++ } Count widgets
END { print count["widget"] } Print the count
You can use the special for loop to read all the elements of an associative array:
for (item in array) process array[item]
The index of the array is available as item, while the value of an element of the array can be referenced as array[item].
You can use the operator in to test that an element exists by testing to see if its index exists (nawk only). For example:
if (index in array) …
tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].
You can also delete individual elements of the array using the delete statement (nawk only).
Within string and regular expression constants, the following escape sequences may be used.
| Sequence | Meaning | Sequence | Meaning |
|---|---|---|---|
| \a | Alert (bell) | \v | Vertical tab |
| \b | Backspace | \\ | Literal backslash |
| \f | Form feed | \nnn | Octal value nnn |
| \n | Newline | \xnn | Hexadecimal value nn |
| \r | Carriage return | \" | Literal double quote (in strings) |
| \t | Tab | \/ | Literal slash (in regular expressions) |
Note: The \x escape sequence is a common extension; it is not part of POSIX awk.
nawk allows you to define your own functions. This makes it easy to encapsulate sequences of steps that need to be repeated into a single place, and re-use the code from anywhere in your program.
The following function capitalizes each word in a string. It has one parameter, named input, and five local variables, which are written as extra parameters:
# capitalize each word in a string
function capitalize(input, result, words, n, i, w)
{
result = " "
n = split(input, words, " ")
for (i = 1; i <= n; i++) {
w = words[i]
w = toupper(substr(w, 1, 1)) substr(w, 2)
if (i > 1)
result = result " "
result = result w
}
return result
}
# main program, for testing
{ print capitalize($0) }
With this input data:
A test line with words and numbers like 12 on it.
This program produces:
A Test Line With Words And Numbers Like 12 On It.
Note: For user-defined functions, no space is allowed between the function name and the left parenthesis when the function is called.
awk functions and commands may be classified as follows:
| Functions | Commands | ||
|---|---|---|---|
| Arithmetic Functions | atan2[2] | int | sin[2] |
| cos[2] | log | sqrt | |
| exp | rand[2] | srand[2] | |
| String Functions | index | match[2] | tolower[2] |
| gensub[9] | split | toupper[2] | |
| gsub[2] | sprintf | ||
| length | sub[2] | ||
| Control Flow Statements | break | exit | return[2] |
| continue | for | while | |
| do/while[2] | if | ||
| Input/Output Processing | close[2] | next | printf |
| fflush[16] | nextfile[16] | ||
| getline[2] | |||
| Time Functions | strftime[9] | systime[9] | |
| Programming | delete[2] | function[2] | system[2] |
[2] Available in nawk.
[9] Available in gawk.
[16] Available in Bell Labs awk and gawk.
Many versions of awk have various implementation limits, on things such as:
Number of fields per record
Number of characters per input record
Number of characters per output record
Number of characters per field
Number of characters per printf string
Number of characters in literal string
Number of characters in character class
Number of files open
Number of pipes open
The ability to handle 8-bit characters and characters that are all zero (ASCII NUL)
gawk does not have limits on any of the above items, other than those imposed by the machine architecture and/or the operating system.
The following alphabetical list of keywords and functions includes all that are available in awk, nawk, and gawk. nawk includes all old awk functions and keywords, plus some additional ones (marked as {N}). gawk includes all nawk functions and keywords, plus some additional ones (marked as {G}). Items marked with {B} are available in the Bell Labs awk. Items that aren't marked with a symbol are available in all versions.
| Command | Description |
|---|---|
| atan2 | atan2(y, x) Return the arctangent of y/x in radians. {N} |
| break | break Exit from a while, for, or do loop. |
| close | close(expr) In most implementations of awk, you can only have up to ten files open simultaneously and one pipe. Therefore, nawk provides a close function that allows you to close a file or a pipe. It takes the same expression that opened the pipe or file as an argument. This expression must be identical, character by character, to the one that opened the file or pipe—even whitespace is significant. {N} |
| continue | continue Begin next iteration of while, for, or do loop. |
| cos | cos(x) Return the cosine of x, an angle in radians. {N} |
| delete | delete
array[element]
delete array Delete element from array. The brackets are typed literally. {N} The second form is a common extension, which deletes all elements of the array at one shot. {B} {G} |
| do | do
statement while (expr) Looping statement. Execute statement, then evaluate expr and if true, execute statement again. A series of statements must be put within braces. {N} |
| exit | exit [expr] Exit from script, reading no new input. The END procedure, if it exists, will be executed. An optional expr becomes awk's return value. |
| exp | exp(x) Return exponential of x (ex). |
| fflush | fflush([output-expr])
Flush any buffers associated with open output file or pipe output-expr. {B} gawk extends this function. If no output-expr is supplied, it flushes standard output. If output-expr is the null string (" "), it flushes all open files and pipes. {G} |
| for | for (init-expr;
test-expr;
incr-expr)
statement C-style looping construct. init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. All of the expressions are optional. A missing test-expr is considered to be true. A series of statements must be put within braces. |
| for | for (item
in
array)
statement Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array [item]. A series of statements must be put within braces. |
| function | function
name(parameter-list) {
statements } Create name as a user-defined function consisting of awk statements that apply to the specified list of parameters. No space is allowed between name and the left parenthesis when the function is called. {N} |
| getline | getline [var] [<
file]
command | getline [var] Read next line of input. Original awk does not support the syntax to open multiple input streams. The first form reads input from file and the second form reads the output of command. Both forms read one record at a time, and each time the statement is executed it gets the next record of input. The record is assigned to $0 and is parsed into fields, setting NF, NR and FNR. If var is specified, the result is assigned to var and $0 and NF are not changed. Thus, if the result is assigned to a variable, the current record does not change. getline is actually a function and it returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and −1 if for some reason it is otherwise unsuccessful. {N} |
| gensub | gensub(r, s, h [, t]) General substitution function. Substitute s for matches of the regular expression r in the string t. If h is a number, replace the hth match. If it is "g" or "G", substitute globally. If t is not supplied, $0 is used. Return the new string value. The original t is not modified. (Compare gsub and sub.) {G} |
| gsub | gsub(r, s [, t]) Globally substitute s for each match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return the number of substitutions. {N} |
| if | if (condition)
statement [else statement] If condition is true, do statement(s), otherwise do statement in optional else clause. Condition can be an expression using any of the relational operators <, < =, = =, !=, > =, or >, as well as the array membership operator in, and the pattern-matching operators ~ and !~ (e.g., if ($1 ~ /[Aa].*/)). A series of statements must be put within braces. Another if can directly follow an else in order to produce a chain of tests or decisions. |
| index | index(str, substr) Return the position (starting at 1) of substr in str, or zero if substr is not present in str. |
| int | int(x) Return integer value of x by truncating any fractional part. |
| length | length([arg]) Return length of arg, or the length of $0 if no argument. |
| log | log(x) Return the natural logarithm (base e) of x. |
| match | match(s, r) Function that matches the pattern, specified by the regular expression r, in the string s and returns either the position in s where the match begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH to the start and length of the match, respectively. {N} |
| next | next Read next input line and start new cycle through pattern/procedures statements. |
| nextfile | nextfile Stop processing the current input file and start new cycle through pattern/procedures statements, beginning with the first record of the next file. {B} {G} |
| print [ output-expr[ , …]] [ dest-expr ] Evaluate the output-expr and direct it to standard output followed by the value of ORS. Each comma-separated output-expr is separated in the output by the value of OFS. With no output-expr, print $0. The output may be redirected to a file or pipe via the dest-expr, which is described in the section "Output Redirections" following this table. | |
| printf | printf(format [, expr-list ]) [ dest-expr ] An alternative output statement borrowed from the C language. It has the ability to produce formatted output. It can also be used to output data without automatically producing a newline. format is a string of format specifications and constants. expr-list is a list of arguments corresponding to format specifiers. As for print, output may be redirected to a file or pipe. See the section "printf formats" following this table for a description of allowed format specifiers. |
| rand | rand() Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using srand( ). {N} |
| return | return [expr] Used within a user-defined function to exit the function, returning value of expression. The return value of a function is undefined if expr is not provided. {N} |
| sin | sin(x) Return the sine of x, an angle in radians. {N} |
| split | split(string, array [, sep]) Split string into elements of array array[1],…,array[n]. The string is split at each occurrence of separator sep. If sep is not specified, FS is used. Returns the number of array elements created. |
| sprintf | sprintf(format [, expressions]) Return the formatted value of one or more expressions, using the specified format. Data is formatted but not printed. See the section "printf formats" following this table for a description of allowed format specifiers. |
| sqrt | sqrt(arg) Return square root of arg. |
| srand | srand([expr]) Use optional expr to set a new seed for the random number generator. Default is the time of day. Return value is the old seed. {N} |
| strftime | strftime([format [,timestamp]]) Format timestamp according to format. Return the formatted string. The timestamp is a time-of-day value in seconds since Midnight, January 1, 1970, UTC. The format string is similar to that of sprintf. If timestamp is omitted, it defaults to the current time. If format is omitted, it defaults to a value that produces output similar to that of the Unix date command. {G} |
| sub | sub(r, s [, t]) Substitute s for first match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return 1 if successful; 0 otherwise. {N} |
| substr | substr(string, beg [, len]) Return substring of string at beginning position beg, and the characters that follow to maximum specified length len. If no length is given, use the rest of the string. |
| system | system(command)
Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A non-zero value indicates a failure of some sort. The documentation for the command you're running will give you the details. The output of the command is not available for processing within the awk script. Use command | getline to read the output of a command into the script. {N} |
| systime | systime( ) Return a time-of-day value in seconds since Midnight, January 1, 1970, UTC. {G} |
| tolower | tolower(str) Translate all uppercase characters in str to lowercase and return the new string.[24] {N} |
| toupper | toupper(str) Translate all lowercase characters in str to uppercase and return the new string. {N} |
| while | while (condition)
statement Do statement while condition is true (see if for a description of allowable conditions). A series of statements must be put within braces. |
[24] Very early versions of nawk don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk.
For print and printf, dest-expr is an optional expression that directs the output to a file or pipe.
> file
Directs the output to a file, overwriting its previous contents.
>> file
Appends the output to a file, preserving its previous contents. In both of these cases, the file will be created if it does not already exist.
| command
Directs the output as the input to a system command.
Be careful not to mix > and >> for the same file. Once a file has been opened with >, subsequent output statements continue to append to the file until it is closed.
Remember to call close() when you have finished with a file or pipe. If you don't, eventually you will hit the system limit on the number of simultaneously open files.
Format specifiers for printf and sprintf have the following form:
%[flag][width][.precision]letter
The control letter is required. The format conversion control letters are given in the following table.
| Character | Description |
|---|---|
| c | ASCII character. |
| d | Decimal integer. |
| i | Decimal integer. (Added in POSIX) |
| e | Floating-point format ([-]d.precisione[+-]dd). |
| E | Floating-point format ([-]d.precisionE[+-]dd). |
| f | Floating-point format ([-]ddd.precision). |
| g | e or f conversion, whichever is shortest, with trailing zeros removed. |
| G | E or f conversion, whichever is shortest, with trailing zeros removed. |
| o | Unsigned octal value. |
| s | String. |
| x | Unsigned hexadecimal number. Uses a-f for 10 to 15. |
| X | Unsigned hexadecimal number. Uses A-F for 10 to 15. |
| % | Literal %. |
The optional flag is one of the following:
The optional width is the minimum number of characters to output. The result will be padded to this size if it is smaller. The 0 flag causes padding with zeros; otherwise, padding is with spaces.
The precision is optional. Its meaning varies by control letter, as shown in this table:
| Conversion | Precision Means |
|---|---|
| %d, %i, %o, %u, %x, %X | The minimum number of digits to print. |
| %e, %E, %f | The number of digits to the right of the decimal point. |
| %g, %G | The maximum number of significant digits. |
| %s | The maximum number of characters to print. |
Copyright © 2000 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The use of the slender loris image in association with sed & awk is a trademark of O'Reilly & Associates, Inc.
While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
The pocket reference follows certain typographic conventions, outlined here:
Constant Width
Is used for code examples, commands, directory names, filenames, and options.
Constant Width Italic
Is used in syntax and command summaries to show replaceable text; this text should be replaced with user-supplied values.
Constant Width Bold
Is used in code examples to show commands or other text that should be typed literally by the user.
Italic
Is used to show generic arguments and options; these should be replaced with user-supplied values. Italic is also used to highlight comments in examples.
$
Is used in some examples as the Bourne shell or Korn shell prompt.
[ ]
Surround optional elements in a description of syntax. (The brackets themselves should never be typed.)
A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).
This section presents the following topics:
Filenames versus patterns
List of metacharacters available to each program
Description of metacharacters
Examples
Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. The command:
$ grep [A-Z]* chap[12]
could, for example, be transformed by the shell into:
$ grep Array.c Bug.c Comp.c chap1 chap2
and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes:
$ grep "[A-Z]*" chap[12]
Double quotes suffice in most cases, but single quotes are the safest bet.
Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.
The characters in the following table have special meaning only in search patterns:
Many Unix systems allow the use of POSIX "character classes" within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.
| Class | Characters Matched |
|---|---|
| alnum | Alphanumeric characters |
| alpha | Alphabetic characters |
| blank | Space or tab |
| cntrl | Control characters |
| digit | Decimal digits |
| graph | Non-space characters |
| lower | Lowercase characters |
| Printable characters | |
| space | White-space characters |
| upper | Uppercase characters |
| xdigit | Hexadecimal digits |
The characters in the following table have special meaning only in replacement patterns.
| Character | Pattern |
|---|---|
| \ | Turn off the special meaning of the following character. |
| \n | Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left. |
| & | Reuse the text matched by the search pattern as part of the replacement pattern. |
| ~ | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ex and vi). |
| % | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ed). |
| \u | Convert first character of replacement pattern to uppercase. |
| \U | Convert entire replacement pattern to uppercase. |
| \l | Convert first character of replacement pattern to lowercase. |
| \L | Convert entire replacement pattern to lowercase. |
Some metacharacters are valid for one program but not for another.
Those that are available to a Unix program are marked by a bullet
(
) in the following table.
(This table is correct for SVR4 and Solaris and most commerical
Unix systems, but it's always a good idea to verify your
system's behavior.)
Items marked with a "P" are specified by POSIX; double
check your system's version.
Full descriptions were provided in the previous section.
| Symbol | ed | ex\vi | sed\grep | awk\egrep | Action |
|---|---|---|---|---|---|
. | ![]() | ![]() | ![]() | ![]() | Match any character. |
* | ![]() | ![]() | ![]() | ![]() | Match zero or more preceding. |
^ | ![]() | ![]() | ![]() | ![]() | Match beginning of line/string. |
$ | ![]() | ![]() | ![]() | ![]() | Match end of line/string. |
\ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
[ ] | ![]() | ![]() | ![]() | ![]() | Match one from a set. |
\( \) | ![]() | ![]() | ![]() | Store pattern for later replay.[1] | |
| \n | ![]() | ![]() | ![]() | Replay sub-pattern in match. | |
{ } | P | Match a range of instances. | |||
\{ \} | ![]() | ![]() | Match a range of instances. | ||
\ | ![]() | ![]() | Match word's beginning or end. | ||
+ | ![]() | Match one or more preceding. | |||
? | ![]() | Match zero or one preceding. | |||
| | ![]() | Separate choices to match. | |||
( ) | ![]() | Group expressions to match. |
[1] Stored sub-patterns can be "replayed" during matching. See the examples, below.
Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters above are meaningful only in a search pattern.
In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:
| Symbol | ex | vi | sed | ed | Action |
|---|---|---|---|---|---|
| \ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
| \n | ![]() | ![]() | ![]() | ![]() | Text matching pattern stored in \( \). |
| & | ![]() | ![]() | ![]() | ![]() | Text matching search pattern. |
| ~ | ![]() | ![]() | Reuse previous replacement pattern. | ||
| % | ![]() | Reuse previous replacement pattern. | |||
| \u \U | ![]() | ![]() | Change character(s) to uppercase. | ||
| \l \L | ![]() | ![]() | Change character(s) to lowercase. | ||
| \E | ![]() | ![]() | Turn off previous \U or \L. | ||
| \e | ![]() | ![]() | Turn off previous \u or \l. |
When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns.
| Pattern | What Does It Match? |
|---|---|
| bag | The string bag. |
| ^bag | bag at the beginning of the line. |
| bag$ | bag at the end of the line. |
| ^bag$ | bag as the only word on the line. |
| [Bb]ag | Bag or bag. |
| b[aeiou]g | Second letter is a vowel. |
| b[^aeiou]g | Second letter is a consonant (or uppercase or symbol). |
| b.g | Second letter is any character. |
| ^…$ | Any line containing exactly three characters. |
| ^\. | Any line that begins with a dot. |
| ^\.[a-z][a-z] | Same, followed by two lowercase letters (e.g., troff requests). |
| ^\.[a-z]\{2\} | Same as previous, ed, grep and sed only. |
| ^[^.] | Any line that doesn't begin with a dot. |
| bugs* | bug, bugs, bugss, etc. |
| "word" | A word in quotes. |
| "*word"* | A word, with or without quotes. |
| [A-Z][A-Z]* | One or more uppercase letters. |
| [A-Z]+ | Same as previous, egrep or awk only. |
| [[:upper:]]+ | Same as previous, POSIX egrep or awk. |
| [A-Z].* | An uppercase letter, followed by zero or more characters. |
| [A-Z]* | Zero or more uppercase letters. |
| [a-zA-Z] | Any letter, either lower- or uppercase. |
| [^0-9A-Za-z] | Any symbol or space (not a letter or a number). |
| [^[:alnum:]] | Same, using POSIX character class. |
| egrep or awk pattern | What Does It Match? |
|---|---|
| [567] | One of the numbers 5, 6, or 7. |
| five|six|seven | One of the words five, six, or seven. |
| 80[2-4]?86 | 8086, 80286, 80386, or 80486. |
| 80[2-4]?86|(Pentium(-II)?) | 8086, 80286, 80386, 80486, Pentium, or Pentium-II. |
| compan(y|ies) | company or companies. |
| ex or vi pattern | What Does It Match? |
|---|---|
| \<the | Words like theater, there or the. |
| the\> | Words like breathe, seethe or the. |
| \<the\> | The word the. |
| ed, sed, or grep pattern | What Does It Match? |
|---|---|
| 0\{5,\} | Five or more zeros in a row. |
| [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\} | U.S. Social Security number (nnn-nn-nnnn). |
| \(why\).*\1 | A line with two occurrences of why. |
| \([[:alpha:]_][[:alnum:]_.]*\) = \1; | C/C++ simple assignment statements. |
The following examples show the metacharacters
available to sed or ex.
Note that ex commands begin with a colon.
A space is marked by a
; a tab is marked by a
.
Finally, some sed examples for transposing words. A simple transposition of two words might look like this:
s/die or do/do or die/Transpose words
The real trick is to use hold buffers to transpose variable patterns. For example:
s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/ Transpose, using
hold buffers
This section presents the following topics:
Conceptual overview of sed
Command-line syntax
Syntax of sed commands
Group summary of sed commands
Alphabetical summary of sed commands
sed is a non-interactive, or stream-oriented, editor. It interprets a script and performs the actions in the script. sed is stream-oriented because, like many Unix programs, input flows through the program and is directed to standard output. For example, sort is stream-oriented; vi is not. sed's input typically comes from a file or pipe, but it can also be directed from the keyboard. Output goes to the screen by default but can be captured in a file or sent through a pipe instead.
The Free Software Foundation has a version of sed, available from ftp://gnudist.gnu.org/gnu/sed/sed-3.02.tar.gz. The somewhat older version, 2.05, is also available.
Typical uses of sed include:
Editing one or more files automatically
Simplifying repetitive edits to multiple files
Writing conversion programs
sed operates as follows:
Each line of input is copied into a "pattern space," an internal buffer where editing operations are performed.
All editing commands in a sed script are applied, in order, to each line of input.
Editing commands are applied to all lines (globally) unless line addressing restricts the lines affected.
If a command changes the input, subsequent commands and address tests will be applied to the current line in the pattern space, not the original input line.
The original input file is unchanged because the editing commands modify a copy of each original input line. The copy is sent to standard output (but can be redirected to a file).
sed also maintains the "hold space," a separate buffer that can be used to save data for later retrieval.
The syntax for invoking sed has two forms:
sed [-n] [-e] 'command' file(s) sed [-n] -f scriptfile file(s)
The first form allows you to specify an editing command on the command line, surrounded by single quotes. The second form allows you to specify a scriptfile, a file containing sed commands. Both forms may be used together, and they may be used multiple times. If no file(s) is specified, sed reads from standard input.
The following options are recognized:
-n
Suppress the default output; sed displays only those lines specified with the p command or with the p flag of the s command.
-e cmd
Next argument is an editing command. Useful if multiple scripts or commands are specified.
-f file
Next argument is a file containing editing commands.
If the first line of the script is #n, sed behaves as if -n had been specified.
sed commands have the general form:
[address[, address]][!]command [arguments]
sed copies each line of input into the pattern space. sed instructions consist of addresses and editing commands. If the address of the command matches the line in the pattern space, then the command is applied to that line. If a command has no address, then it is applied to each input line. If a command changes the contents of the pattern space, subsequent commands and addresses will be applied to the current line in the pattern space, not the original input line.
commands consist of a single letter or symbol; they are described later, alphabetically and by group. arguments include the label supplied to b or t, the filename supplied to r or w, and the substitution flags for s. addresses are described in the next section.
A sed command can specify zero, one, or two addresses. An address can be a line number, the symbol $ (for last line), or a regular expression enclosed in slashes (/pattern/). Regular expressions are described in Section 1.3. Additionally, \n can be used to match any newline in the pattern space (resulting from the N command), but not the newline at the end of the pattern space.
| If the Command Specifies: | Then the Command Is Applied To: |
|---|---|
| No address | Each input line. |
| One address | Any line matching the address. Some commands accept only one address: a, i, r, q, and =. |
| Two comma-separated addresses | First matching line and all succeeding lines up to and including a line matching the second address. |
| An address followed by ! | All lines that do not match the address. |
| s/xx/yy/g | Substitute on all lines (all occurrences). |
| /BSD/d | Delete lines containing BSD. |
| /^BEGIN/,/^END/p | Print between BEGIN and END, inclusive. |
| /SAVE/!d | Delete any line that doesn't contain SAVE. |
| /BEGIN/,/END/!s/xx/yy/g | Substitute on all lines, except between BEGIN and END. |
Braces ({ }) are used in sed to nest one address inside another or to apply multiple commands at the same address.
[/pattern/[,/pattern/]]{
command1
command2
}
The opening curly brace must end its line, and the closing curly brace must be on a line by itself. Be sure there are no spaces after the braces.
In the lists that follow, the sed commands are grouped by function and are described tersely. Full descriptions, including syntax and examples, can be found afterward in the Section 1.4.5 section.
| a\ | Append text after a line. |
| c\ | Replace text (usually a text block). |
| i\ | Insert text before a line. |
| d | Delete lines. |
| s | Make substitutions. |
| y | Translate characters (like Unix tr). |
| = | Display line number of a line. |
| l | Display control characters in ASCII. |
| p | Display the line. |
| n | Skip current line and go to line below. |
| r | Read another file's contents into the output stream. |
| w | Write input lines to another file. |
| q | Quit the sed script (no further output). |
| h | Copy into hold space; wipe out what's there. |
| H | Copy into hold space; append to what's there. |
| g | Get the hold space back; wipe out the destination line. |
| G | Get the hold space back; append to the pattern space. |
| x | Exchange contents of the hold and pattern spaces. |
| b | Branch to label or to end of script. |
| t | Same as b, but branch only after substitution. |
| :label | Label branched to by t or b. |
| N | Read another line of input (creates embedded newline). |
| D | Delete up to the embedded newline. |
| P | Print up to the embedded newline. |
This section presents the following topics:
Conceptual overview
Command-line syntax
Patterns and procedures
Built-in variables
Operators
Variables and array assignment
User-defined functions
Group listing of functions and commands
Implementation limits
Alphabetical summary of functions and commands
awk is a pattern-matching program for processing files, especially when they are databases. The new version of awk, called nawk, provides additional capabilities. (It really isn't so new. The additional features were added in 1984, and it was first shipped with System V Release 3.1 in 1987. Nevertheless, the name was never changed on most systems.) Every modern Unix system comes with a version of new awk, and its use is recommended over old awk.
Different systems vary in what the two versions are called. Some have oawk and awk, for the old and new versions, respectively. Others have awk and nawk. Still others only have awk, which is the new version. This example shows what happens if your awk is the old one:
$ awk 1 /dev/null awk: syntax error near line 1 awk: bailing out near line 1
awk will exit silently if it is the new version.
Source code for the latest version of awk, from Bell Labs, can be downloaded starting at Brian Kernighan's home page: http://cm.bell-labs.com/~bwk. Michael Brennan's mawk is available via anonymous FTP from ftp://ftp.whidbey.net/pub/brennan/mawk1.3.3.tar.gz. Finally, the Free Software Foundation has a version of awk called gawk, available from ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz. All three programs implement "new" awk. Thus, references in the following text such as "nawk only," apply to all three. gawk has additional features.
With original awk, you can:
Think of a text file as made up of records and fields in a textual database.
Perform arithmetic and string operations.
Use programming constructs such as loops and conditionals.
Produce formatted reports.
With nawk, you can also:
Define your own functions.
Execute Unix commands from a script.
Process the results of Unix commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
In addition, with GNU awk (gawk), you can:
Use regular expressions to separate records, as well as fields.
Skip to the start of the next file, not just the next record.
Perform more powerful string substitutions.
Retrieve and format system time values.
The syntax for invoking awk has two forms:
awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s)
You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. nawk allows multiple -f scripts. Variables can be assigned a value on the command line. The value can be a literal, a shell variable ($name), or a command substitution (`cmd`), but the value is available only after the BEGIN statement is executed.
awk operates on one or more files. If none are specified (or if - is specified), awk reads from the standard input.
The recognized options are:
-Ffs
Set the field separator to fs. This is the same as setting the built-in variable FS. Original awk only allows the field separator to be a single character. nawk allows fs to be a regular expression. Each input line, or record, is divided into fields by white space (spaces or tabs) or by some other user-definable field separator. Fields are referred to by the variables $1, $2,…, $n. $0 refers to the entire record.
-v var= value
Available in nawk only. Assign a value to variable var. This allows assignment before the script begins execution.
For example, to print the first three (colon-separated) fields of each record on separate lines:
awk -F: '{ print $1; print $2; print $3 }' /etc/passwd
Numerous examples are shown later in the Section 1.5.3.3 section.
awk scripts consist of patterns and procedures:
pattern { procedure }
Both are optional. If pattern is missing, { procedure } is applied to all lines. If { procedure } is missing, the matched line is printed.
A pattern can be any of the following:
/regular expression/ relational expression pattern-matching expression BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, or any of the predefined variables described later under Section 1.5.4.
Regular expressions use the extended set of metacharacters and are described earlier in Section 1.3.
^ and $ refer to the beginning and end of a string (such as the fields), respectively, rather than the beginning and end of a line. In particular, these metacharacters will not match at a newline embedded in the middle of a string.
Relational expressions use the relational operators listed under "Operators" later in this book. For example, $2 > $1 selects lines for which the second field is greater than the first. Comparisons can be either string or numeric. Thus, depending on the types of data in $1 and $2, awk will do either a numeric or a string comparison. This can change from one record to the next.
Pattern-matching expressions use the operators ~ (match) and !~ (don't match). See "Operators" later in this book.
The BEGIN pattern lets you specify procedures that will take place before the first input line is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place after the last input record is read.
In nawk, BEGIN and END patterns may appear multiple times. The procedures are merged as if there had been one large procedure.
Except for BEGIN and END, patterns can be combined with the Boolean operators || (or), && (and), and ! (not). A range of lines can also be specified using comma-separated patterns:
pattern,pattern
Procedures consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons, and are contained within curly braces. Commands fall into five groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
User-defined functions (nawk only)
Print first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern/
Print first field of lines that contain pattern:
/pattern/ { print $1 }
Select records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line. Each line is a single field:
BEGIN { FS = "\n"; RS = "" }
Print fields 2 and 3 in switched order, but only on lines whose first field matches the string URGENT:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern/ { ++x }
END { print x }
Add numbers in second column and print total:
{ total += $2 }
END { print "column total is", total}
Print lines that contain less than 20 characters:
length($0) < 20
Print each line that begins with Name: and that contains exactly seven fields:
NF == 7 && /^Name:/
Print the fields of each record in reverse order, one per line:
{
for (i = NF; i >= 1; i--)
print $i
}
All awk variables are included in nawk. All nawk variables are included in gawk.
The following table lists the operators, in order of increasing precedence, that are available in awk.
| Symbol | Meaning |
|---|---|
| = += −= *= /= %= ^= **= | Assignment. |
| ?: | C conditional expression (nawk only). |
| || | Logical OR (short-circuit). |
| && | Logical AND (short-circuit). |
| in | Array membership (nawk only). |
| ~ !~ | Match regular expression and negation. |
| < < = > > = != = = | Relational operators. |
| (blank) | Concatenation. |
| + - | Addition, subtraction. |
| * / % | Multiplication, division, and modulus (remainder). |
| + - ! | Unary plus and minus, and logical negation. |
| ^ ** | Exponentiation. |
| ++ - - | Increment and decrement, either prefix or postfix. |
| $ | Field reference. |
Note: While ** and **= are common extensions, they are not part of POSIX awk.
Variables can be assigned a value with an = sign. For example:
FS = ","
Expressions using the operators +, -, /, and % (modulo) can be assigned to variables.
Arrays can be created with the split( ) function (described later), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1], …, array[n]) or with strings. Arrays subscripted by strings are called "associative arrays." (In fact, all arrays in awk are associative; numeric subscripts are converted to strings before using them as array subscripts. Associative arrays are one of awk's most powerful features.)
For example, to count the number of widgets you have, you could use the following script:
/widget/ { count["widget"]++ } Count widgets
END { print count["widget"] } Print the count
You can use the special for loop to read all the elements of an associative array:
for (item in array) process array[item]
The index of the array is available as item, while the value of an element of the array can be referenced as array[item].
You can use the operator in to test that an element exists by testing to see if its index exists (nawk only). For example:
if (index in array) …
tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].
You can also delete individual elements of the array using the delete statement (nawk only).
Within string and regular expression constants, the following escape sequences may be used.
| Sequence | Meaning | Sequence | Meaning |
|---|---|---|---|
| \a | Alert (bell) | \v | Vertical tab |
| \b | Backspace | \\ | Literal backslash |
| \f | Form feed | \nnn | Octal value nnn |
| \n | Newline | \xnn | Hexadecimal value nn |
| \r | Carriage return | \" | Literal double quote (in strings) |
| \t | Tab | \/ | Literal slash (in regular expressions) |
Note: The \x escape sequence is a common extension; it is not part of POSIX awk.
nawk allows you to define your own functions. This makes it easy to encapsulate sequences of steps that need to be repeated into a single place, and re-use the code from anywhere in your program.
The following function capitalizes each word in a string. It has one parameter, named input, and five local variables, which are written as extra parameters:
# capitalize each word in a string
function capitalize(input, result, words, n, i, w)
{
result = " "
n = split(input, words, " ")
for (i = 1; i <= n; i++) {
w = words[i]
w = toupper(substr(w, 1, 1)) substr(w, 2)
if (i > 1)
result = result " "
result = result w
}
return result
}
# main program, for testing
{ print capitalize($0) }
With this input data:
A test line with words and numbers like 12 on it.
This program produces:
A Test Line With Words And Numbers Like 12 On It.
Note: For user-defined functions, no space is allowed between the function name and the left parenthesis when the function is called.
awk functions and commands may be classified as follows:
| Functions | Commands | ||
|---|---|---|---|
| Arithmetic Functions | atan2[2] | int | sin[2] |
| cos[2] | log | sqrt | |
| exp | rand[2] | srand[2] | |
| String Functions | index | match[2] | tolower[2] |
| gensub[9] | split | toupper[2] | |
| gsub[2] | sprintf | ||
| length | sub[2] | ||
| Control Flow Statements | break | exit | return[2] |
| continue | for | while | |
| do/while[2] | if | ||
| Input/Output Processing | close[2] | next | printf |
| fflush[16] | nextfile[16] | ||
| getline[2] | |||
| Time Functions | strftime[9] | systime[9] | |
| Programming | delete[2] | function[2] | system[2] |
[2] Available in nawk.
[9] Available in gawk.
[16] Available in Bell Labs awk and gawk.
Many versions of awk have various implementation limits, on things such as:
Number of fields per record
Number of characters per input record
Number of characters per output record
Number of characters per field
Number of characters per printf string
Number of characters in literal string
Number of characters in character class
Number of files open
Number of pipes open
The ability to handle 8-bit characters and characters that are all zero (ASCII NUL)
gawk does not have limits on any of the above items, other than those imposed by the machine architecture and/or the operating system.
The following alphabetical list of keywords and functions includes all that are available in awk, nawk, and gawk. nawk includes all old awk functions and keywords, plus some additional ones (marked as {N}). gawk includes all nawk functions and keywords, plus some additional ones (marked as {G}). Items marked with {B} are available in the Bell Labs awk. Items that aren't marked with a symbol are available in all versions.
| Command | Description |
|---|---|
| atan2 | atan2(y, x) Return the arctangent of y/x in radians. {N} |
| break | break Exit from a while, for, or do loop. |
| close | close(expr) In most implementations of awk, you can only have up to ten files open simultaneously and one pipe. Therefore, nawk provides a close function that allows you to close a file or a pipe. It takes the same expression that opened the pipe or file as an argument. This expression must be identical, character by character, to the one that opened the file or pipe—even whitespace is significant. {N} |
| continue | continue Begin next iteration of while, for, or do loop. |
| cos | cos(x) Return the cosine of x, an angle in radians. {N} |
| delete | delete
array[element]
delete array Delete element from array. The brackets are typed literally. {N} The second form is a common extension, which deletes all elements of the array at one shot. {B} {G} |
| do | do
statement while (expr) Looping statement. Execute statement, then evaluate expr and if true, execute statement again. A series of statements must be put within braces. {N} |
| exit | exit [expr] Exit from script, reading no new input. The END procedure, if it exists, will be executed. An optional expr becomes awk's return value. |
| exp | exp(x) Return exponential of x (ex). |
| fflush | fflush([output-expr])
Flush any buffers associated with open output file or pipe output-expr. {B} gawk extends this function. If no output-expr is supplied, it flushes standard output. If output-expr is the null string (" "), it flushes all open files and pipes. {G} |
| for | for (init-expr;
test-expr;
incr-expr)
statement C-style looping construct. init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. All of the expressions are optional. A missing test-expr is considered to be true. A series of statements must be put within braces. |
| for | for (item
in
array)
statement Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array [item]. A series of statements must be put within braces. |
| function | function
name(parameter-list) {
statements } Create name as a user-defined function consisting of awk statements that apply to the specified list of parameters. No space is allowed between name and the left parenthesis when the function is called. {N} |
| getline | getline [var] [<
file]
command | getline [var] Read next line of input. Original awk does not support the syntax to open multiple input streams. The first form reads input from file and the second form reads the output of command. Both forms read one record at a time, and each time the statement is executed it gets the next record of input. The record is assigned to $0 and is parsed into fields, setting NF, NR and FNR. If var is specified, the result is assigned to var and $0 and NF are not changed. Thus, if the result is assigned to a variable, the current record does not change. getline is actually a function and it returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and −1 if for some reason it is otherwise unsuccessful. {N} |
| gensub | gensub(r, s, h [, t]) General substitution function. Substitute s for matches of the regular expression r in the string t. If h is a number, replace the hth match. If it is "g" or "G", substitute globally. If t is not supplied, $0 is used. Return the new string value. The original t is not modified. (Compare gsub and sub.) {G} |
| gsub | gsub(r, s [, t]) Globally substitute s for each match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return the number of substitutions. {N} |
| if | if (condition)
statement [else statement] If condition is true, do statement(s), otherwise do statement in optional else clause. Condition can be an expression using any of the relational operators <, < =, = =, !=, > =, or >, as well as the array membership operator in, and the pattern-matching operators ~ and !~ (e.g., if ($1 ~ /[Aa].*/)). A series of statements must be put within braces. Another if can directly follow an else in order to produce a chain of tests or decisions. |
| index | index(str, substr) Return the position (starting at 1) of substr in str, or zero if substr is not present in str. |
| int | int(x) Return integer value of x by truncating any fractional part. |
| length | length([arg]) Return length of arg, or the length of $0 if no argument. |
| log | log(x) Return the natural logarithm (base e) of x. |
| match | match(s, r) Function that matches the pattern, specified by the regular expression r, in the string s and returns either the position in s where the match begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH to the start and length of the match, respectively. {N} |
| next | next Read next input line and start new cycle through pattern/procedures statements. |
| nextfile | nextfile Stop processing the current input file and start new cycle through pattern/procedures statements, beginning with the first record of the next file. {B} {G} |
| print [ output-expr[ , …]] [ dest-expr ] Evaluate the output-expr and direct it to standard output followed by the value of ORS. Each comma-separated output-expr is separated in the output by the value of OFS. With no output-expr, print $0. The output may be redirected to a file or pipe via the dest-expr, which is described in the section "Output Redirections" following this table. | |
| printf | printf(format [, expr-list ]) [ dest-expr ] An alternative output statement borrowed from the C language. It has the ability to produce formatted output. It can also be used to output data without automatically producing a newline. format is a string of format specifications and constants. expr-list is a list of arguments corresponding to format specifiers. As for print, output may be redirected to a file or pipe. See the section "printf formats" following this table for a description of allowed format specifiers. |
| rand | rand() Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using srand( ). {N} |
| return | return [expr] Used within a user-defined function to exit the function, returning value of expression. The return value of a function is undefined if expr is not provided. {N} |
| sin | sin(x) Return the sine of x, an angle in radians. {N} |
| split | split(string, array [, sep]) Split string into elements of array array[1],…,array[n]. The string is split at each occurrence of separator sep. If sep is not specified, FS is used. Returns the number of array elements created. |
| sprintf | sprintf(format [, expressions]) Return the formatted value of one or more expressions, using the specified format. Data is formatted but not printed. See the section "printf formats" following this table for a description of allowed format specifiers. |
| sqrt | sqrt(arg) Return square root of arg. |
| srand | srand([expr]) Use optional expr to set a new seed for the random number generator. Default is the time of day. Return value is the old seed. {N} |
| strftime | strftime([format [,timestamp]]) Format timestamp according to format. Return the formatted string. The timestamp is a time-of-day value in seconds since Midnight, January 1, 1970, UTC. The format string is similar to that of sprintf. If timestamp is omitted, it defaults to the current time. If format is omitted, it defaults to a value that produces output similar to that of the Unix date command. {G} |
| sub | sub(r, s [, t]) Substitute s for first match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return 1 if successful; 0 otherwise. {N} |
| substr | substr(string, beg [, len]) Return substring of string at beginning position beg, and the characters that follow to maximum specified length len. If no length is given, use the rest of the string. |
| system | system(command)
Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A non-zero value indicates a failure of some sort. The documentation for the command you're running will give you the details. The output of the command is not available for processing within the awk script. Use command | getline to read the output of a command into the script. {N} |
| systime | systime( ) Return a time-of-day value in seconds since Midnight, January 1, 1970, UTC. {G} |
| tolower | tolower(str) Translate all uppercase characters in str to lowercase and return the new string.[24] {N} |
| toupper | toupper(str) Translate all lowercase characters in str to uppercase and return the new string. {N} |
| while | while (condition)
statement Do statement while condition is true (see if for a description of allowable conditions). A series of statements must be put within braces. |
[24] Very early versions of nawk don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk.
For print and printf, dest-expr is an optional expression that directs the output to a file or pipe.
> file
Directs the output to a file, overwriting its previous contents.
>> file
Appends the output to a file, preserving its previous contents. In both of these cases, the file will be created if it does not already exist.
| command
Directs the output as the input to a system command.
Be careful not to mix > and >> for the same file. Once a file has been opened with >, subsequent output statements continue to append to the file until it is closed.
Remember to call close() when you have finished with a file or pipe. If you don't, eventually you will hit the system limit on the number of simultaneously open files.
Format specifiers for printf and sprintf have the following form:
%[flag][width][.precision]letter
The control letter is required. The format conversion control letters are given in the following table.
| Character | Description |
|---|---|
| c | ASCII character. |
| d | Decimal integer. |
| i | Decimal integer. (Added in POSIX) |
| e | Floating-point format ([-]d.precisione[+-]dd). |
| E | Floating-point format ([-]d.precisionE[+-]dd). |
| f | Floating-point format ([-]ddd.precision). |
| g | e or f conversion, whichever is shortest, with trailing zeros removed. |
| G | E or f conversion, whichever is shortest, with trailing zeros removed. |
| o | Unsigned octal value. |
| s | String. |
| x | Unsigned hexadecimal number. Uses a-f for 10 to 15. |
| X | Unsigned hexadecimal number. Uses A-F for 10 to 15. |
| % | Literal %. |
The optional flag is one of the following:
The optional width is the minimum number of characters to output. The result will be padded to this size if it is smaller. The 0 flag causes padding with zeros; otherwise, padding is with spaces.
The precision is optional. Its meaning varies by control letter, as shown in this table:
| Conversion | Precision Means |
|---|---|
| %d, %i, %o, %u, %x, %X | The minimum number of digits to print. |
| %e, %E, %f | The number of digits to the right of the decimal point. |
| %g, %G | The maximum number of significant digits. |
| %s | The maximum number of characters to print. |
Copyright © 2000 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472.
The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The use of the slender loris image in association with sed & awk is a trademark of O'Reilly & Associates, Inc.
While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
The pocket reference follows certain typographic conventions, outlined here:
Constant Width
Is used for code examples, commands, directory names, filenames, and options.
Constant Width Italic
Is used in syntax and command summaries to show replaceable text; this text should be replaced with user-supplied values.
Constant Width Bold
Is used in code examples to show commands or other text that should be typed literally by the user.
Italic
Is used to show generic arguments and options; these should be replaced with user-supplied values. Italic is also used to highlight comments in examples.
$
Is used in some examples as the Bourne shell or Korn shell prompt.
[ ]
Surround optional elements in a description of syntax. (The brackets themselves should never be typed.)
A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).
This section presents the following topics:
Filenames versus patterns
List of metacharacters available to each program
Description of metacharacters
Examples
Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. The command:
$ grep [A-Z]* chap[12]
could, for example, be transformed by the shell into:
$ grep Array.c Bug.c Comp.c chap1 chap2
and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes:
$ grep "[A-Z]*" chap[12]
Double quotes suffice in most cases, but single quotes are the safest bet.
Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.
The characters in the following table have special meaning only in search patterns:
Many Unix systems allow the use of POSIX "character classes" within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.
| Class | Characters Matched |
|---|---|
| alnum | Alphanumeric characters |
| alpha | Alphabetic characters |
| blank | Space or tab |
| cntrl | Control characters |
| digit | Decimal digits |
| graph | Non-space characters |
| lower | Lowercase characters |
| Printable characters | |
| space | White-space characters |
| upper | Uppercase characters |
| xdigit | Hexadecimal digits |
The characters in the following table have special meaning only in replacement patterns.
| Character | Pattern |
|---|---|
| \ | Turn off the special meaning of the following character. |
| \n | Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left. |
| & | Reuse the text matched by the search pattern as part of the replacement pattern. |
| ~ | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ex and vi). |
| % | Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern. (ed). |
| \u | Convert first character of replacement pattern to uppercase. |
| \U | Convert entire replacement pattern to uppercase. |
| \l | Convert first character of replacement pattern to lowercase. |
| \L | Convert entire replacement pattern to lowercase. |
Some metacharacters are valid for one program but not for another.
Those that are available to a Unix program are marked by a bullet
(
) in the following table.
(This table is correct for SVR4 and Solaris and most commerical
Unix systems, but it's always a good idea to verify your
system's behavior.)
Items marked with a "P" are specified by POSIX; double
check your system's version.
Full descriptions were provided in the previous section.
| Symbol | ed | ex\vi | sed\grep | awk\egrep | Action |
|---|---|---|---|---|---|
. | ![]() | ![]() | ![]() | ![]() | Match any character. |
* | ![]() | ![]() | ![]() | ![]() | Match zero or more preceding. |
^ | ![]() | ![]() | ![]() | ![]() | Match beginning of line/string. |
$ | ![]() | ![]() | ![]() | ![]() | Match end of line/string. |
\ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
[ ] | ![]() | ![]() | ![]() | ![]() | Match one from a set. |
\( \) | ![]() | ![]() | ![]() | Store pattern for later replay.[1] | |
| \n | ![]() | ![]() | ![]() | Replay sub-pattern in match. | |
{ } | P | Match a range of instances. | |||
\{ \} | ![]() | ![]() | Match a range of instances. | ||
\ | ![]() | ![]() | Match word's beginning or end. | ||
+ | ![]() | Match one or more preceding. | |||
? | ![]() | Match zero or one preceding. | |||
| | ![]() | Separate choices to match. | |||
( ) | ![]() | Group expressions to match. |
[1] Stored sub-patterns can be "replayed" during matching. See the examples, below.
Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters above are meaningful only in a search pattern.
In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:
| Symbol | ex | vi | sed | ed | Action |
|---|---|---|---|---|---|
| \ | ![]() | ![]() | ![]() | ![]() | Escape following character. |
| \n | ![]() | ![]() | ![]() | ![]() | Text matching pattern stored in \( \). |
| & | ![]() | ![]() | ![]() | ![]() | Text matching search pattern. |
| ~ | ![]() | ![]() | Reuse previous replacement pattern. | ||
| % | ![]() | Reuse previous replacement pattern. | |||
| \u \U | ![]() | ![]() | Change character(s) to uppercase. | ||
| \l \L | ![]() | ![]() | Change character(s) to lowercase. | ||
| \E | ![]() | ![]() | Turn off previous \U or \L. | ||
| \e | ![]() | ![]() | Turn off previous \u or \l. |
When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns.
| Pattern | What Does It Match? |
|---|---|
| bag | The string bag. |
| ^bag | bag at the beginning of the line. |
| bag$ | bag at the end of the line. |
| ^bag$ | bag as the only word on the line. |
| [Bb]ag | Bag or bag. |
| b[aeiou]g | Second letter is a vowel. |
| b[^aeiou]g | Second letter is a consonant (or uppercase or symbol). |
| b.g | Second letter is any character. |
| ^…$ | Any line containing exactly three characters. |
| ^\. | Any line that begins with a dot. |
| ^\.[a-z][a-z] | Same, followed by two lowercase letters (e.g., troff requests). |
| ^\.[a-z]\{2\} | Same as previous, ed, grep and sed only. |
| ^[^.] | Any line that doesn't begin with a dot. |
| bugs* | bug, bugs, bugss, etc. |
| "word" | A word in quotes. |
| "*word"* | A word, with or without quotes. |
| [A-Z][A-Z]* | One or more uppercase letters. |
| [A-Z]+ | Same as previous, egrep or awk only. |
| [[:upper:]]+ | Same as previous, POSIX egrep or awk. |
| [A-Z].* | An uppercase letter, followed by zero or more characters. |
| [A-Z]* | Zero or more uppercase letters. |
| [a-zA-Z] | Any letter, either lower- or uppercase. |
| [^0-9A-Za-z] | Any symbol or space (not a letter or a number). |
| [^[:alnum:]] | Same, using POSIX character class. |
| egrep or awk pattern | What Does It Match? |
|---|---|
| [567] | One of the numbers 5, 6, or 7. |
| five|six|seven | One of the words five, six, or seven. |
| 80[2-4]?86 | 8086, 80286, 80386, or 80486. |
| 80[2-4]?86|(Pentium(-II)?) | 8086, 80286, 80386, 80486, Pentium, or Pentium-II. |
| compan(y|ies) | company or companies. |
| ex or vi pattern | What Does It Match? |
|---|---|
| \<the | Words like theater, there or the. |
| the\> | Words like breathe, seethe or the. |
| \<the\> | The word the. |
| ed, sed, or grep pattern | What Does It Match? |
|---|---|
| 0\{5,\} | Five or more zeros in a row. |
| [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\} | U.S. Social Security number (nnn-nn-nnnn). |
| \(why\).*\1 | A line with two occurrences of why. |
| \([[:alpha:]_][[:alnum:]_.]*\) = \1; | C/C++ simple assignment statements. |
The following examples show the metacharacters
available to sed or ex.
Note that ex commands begin with a colon.
A space is marked by a
; a tab is marked by a
.
Finally, some sed examples for transposing words. A simple transposition of two words might look like this:
s/die or do/do or die/Transpose words
The real trick is to use hold buffers to transpose variable patterns. For example:
s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/ Transpose, using
hold buffers
This section presents the following topics:
Conceptual overview of sed
Command-line syntax
Syntax of sed commands
Group summary of sed commands
Alphabetical summary of sed commands
sed is a non-interactive, or stream-oriented, editor. It interprets a script and performs the actions in the script. sed is stream-oriented because, like many Unix programs, input flows through the program and is directed to standard output. For example, sort is stream-oriented; vi is not. sed's input typically comes from a file or pipe, but it can also be directed from the keyboard. Output goes to the screen by default but can be captured in a file or sent through a pipe instead.
The Free Software Foundation has a version of sed, available from ftp://gnudist.gnu.org/gnu/sed/sed-3.02.tar.gz. The somewhat older version, 2.05, is also available.
Typical uses of sed include:
Editing one or more files automatically
Simplifying repetitive edits to multiple files
Writing conversion programs
sed operates as follows:
Each line of input is copied into a "pattern space," an internal buffer where editing operations are performed.
All editing commands in a sed script are applied, in order, to each line of input.
Editing commands are applied to all lines (globally) unless line addressing restricts the lines affected.
If a command changes the input, subsequent commands and address tests will be applied to the current line in the pattern space, not the original input line.
The original input file is unchanged because the editing commands modify a copy of each original input line. The copy is sent to standard output (but can be redirected to a file).
sed also maintains the "hold space," a separate buffer that can be used to save data for later retrieval.
The syntax for invoking sed has two forms:
sed [-n] [-e] 'command' file(s) sed [-n] -f scriptfile file(s)
The first form allows you to specify an editing command on the command line, surrounded by single quotes. The second form allows you to specify a scriptfile, a file containing sed commands. Both forms may be used together, and they may be used multiple times. If no file(s) is specified, sed reads from standard input.
The following options are recognized:
-n
Suppress the default output; sed displays only those lines specified with the p command or with the p flag of the s command.
-e cmd
Next argument is an editing command. Useful if multiple scripts or commands are specified.
-f file
Next argument is a file containing editing commands.
If the first line of the script is #n, sed behaves as if -n had been specified.
sed commands have the general form:
[address[, address]][!]command [arguments]
sed copies each line of input into the pattern space. sed instructions consist of addresses and editing commands. If the address of the command matches the line in the pattern space, then the command is applied to that line. If a command has no address, then it is applied to each input line. If a command changes the contents of the pattern space, subsequent commands and addresses will be applied to the current line in the pattern space, not the original input line.
commands consist of a single letter or symbol; they are described later, alphabetically and by group. arguments include the label supplied to b or t, the filename supplied to r or w, and the substitution flags for s. addresses are described in the next section.
A sed command can specify zero, one, or two addresses. An address can be a line number, the symbol $ (for last line), or a regular expression enclosed in slashes (/pattern/). Regular expressions are described in Section 1.3. Additionally, \n can be used to match any newline in the pattern space (resulting from the N command), but not the newline at the end of the pattern space.
| If the Command Specifies: | Then the Command Is Applied To: |
|---|---|
| No address | Each input line. |
| One address | Any line matching the address. Some commands accept only one address: a, i, r, q, and =. |
| Two comma-separated addresses | First matching line and all succeeding lines up to and including a line matching the second address. |
| An address followed by ! | All lines that do not match the address. |
| s/xx/yy/g | Substitute on all lines (all occurrences). |
| /BSD/d | Delete lines containing BSD. |
| /^BEGIN/,/^END/p | Print between BEGIN and END, inclusive. |
| /SAVE/!d | Delete any line that doesn't contain SAVE. |
| /BEGIN/,/END/!s/xx/yy/g | Substitute on all lines, except between BEGIN and END. |
Braces ({ }) are used in sed to nest one address inside another or to apply multiple commands at the same address.
[/pattern/[,/pattern/]]{
command1
command2
}
The opening curly brace must end its line, and the closing curly brace must be on a line by itself. Be sure there are no spaces after the braces.
In the lists that follow, the sed commands are grouped by function and are described tersely. Full descriptions, including syntax and examples, can be found afterward in the Section 1.4.5 section.
| a\ | Append text after a line. |
| c\ | Replace text (usually a text block). |
| i\ | Insert text before a line. |
| d | Delete lines. |
| s | Make substitutions. |
| y | Translate characters (like Unix tr). |
| = | Display line number of a line. |
| l | Display control characters in ASCII. |
| p | Display the line. |
| n | Skip current line and go to line below. |
| r | Read another file's contents into the output stream. |
| w | Write input lines to another file. |
| q | Quit the sed script (no further output). |
| h | Copy into hold space; wipe out what's there. |
| H | Copy into hold space; append to what's there. |
| g | Get the hold space back; wipe out the destination line. |
| G | Get the hold space back; append to the pattern space. |
| x | Exchange contents of the hold and pattern spaces. |
| b | Branch to label or to end of script. |
| t | Same as b, but branch only after substitution. |
| :label | Label branched to by t or b. |
| N | Read another line of input (creates embedded newline). |
| D | Delete up to the embedded newline. |
| P | Print up to the embedded newline. |
This section presents the following topics:
Conceptual overview
Command-line syntax
Patterns and procedures
Built-in variables
Operators
Variables and array assignment
User-defined functions
Group listing of functions and commands
Implementation limits
Alphabetical summary of functions and commands
awk is a pattern-matching program for processing files, especially when they are databases. The new version of awk, called nawk, provides additional capabilities. (It really isn't so new. The additional features were added in 1984, and it was first shipped with System V Release 3.1 in 1987. Nevertheless, the name was never changed on most systems.) Every modern Unix system comes with a version of new awk, and its use is recommended over old awk.
Different systems vary in what the two versions are called. Some have oawk and awk, for the old and new versions, respectively. Others have awk and nawk. Still others only have awk, which is the new version. This example shows what happens if your awk is the old one:
$ awk 1 /dev/null awk: syntax error near line 1 awk: bailing out near line 1
awk will exit silently if it is the new version.
Source code for the latest version of awk, from Bell Labs, can be downloaded starting at Brian Kernighan's home page: http://cm.bell-labs.com/~bwk. Michael Brennan's mawk is available via anonymous FTP from ftp://ftp.whidbey.net/pub/brennan/mawk1.3.3.tar.gz. Finally, the Free Software Foundation has a version of awk called gawk, available from ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz. All three programs implement "new" awk. Thus, references in the following text such as "nawk only," apply to all three. gawk has additional features.
With original awk, you can:
Think of a text file as made up of records and fields in a textual database.
Perform arithmetic and string operations.
Use programming constructs such as loops and conditionals.
Produce formatted reports.
With nawk, you can also:
Define your own functions.
Execute Unix commands from a script.
Process the results of Unix commands.
Process command-line arguments more gracefully.
Work more easily with multiple input streams.
In addition, with GNU awk (gawk), you can:
Use regular expressions to separate records, as well as fields.
Skip to the start of the next file, not just the next record.
Perform more powerful string substitutions.
Retrieve and format system time values.
The syntax for invoking awk has two forms:
awk [options] 'script' var=value file(s) awk [options] -f scriptfile var=value file(s)
You can specify a script directly on the command line, or you can store a script in a scriptfile and specify it with -f. nawk allows multiple -f scripts. Variables can be assigned a value on the command line. The value can be a literal, a shell variable ($name), or a command substitution (`cmd`), but the value is available only after the BEGIN statement is executed.
awk operates on one or more files. If none are specified (or if - is specified), awk reads from the standard input.
The recognized options are:
-Ffs
Set the field separator to fs. This is the same as setting the built-in variable FS. Original awk only allows the field separator to be a single character. nawk allows fs to be a regular expression. Each input line, or record, is divided into fields by white space (spaces or tabs) or by some other user-definable field separator. Fields are referred to by the variables $1, $2,…, $n. $0 refers to the entire record.
-v var= value
Available in nawk only. Assign a value to variable var. This allows assignment before the script begins execution.
For example, to print the first three (colon-separated) fields of each record on separate lines:
awk -F: '{ print $1; print $2; print $3 }' /etc/passwd
Numerous examples are shown later in the Section 1.5.3.3 section.
awk scripts consist of patterns and procedures:
pattern { procedure }
Both are optional. If pattern is missing, { procedure } is applied to all lines. If { procedure } is missing, the matched line is printed.
A pattern can be any of the following:
/regular expression/ relational expression pattern-matching expression BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, or any of the predefined variables described later under Section 1.5.4.
Regular expressions use the extended set of metacharacters and are described earlier in Section 1.3.
^ and $ refer to the beginning and end of a string (such as the fields), respectively, rather than the beginning and end of a line. In particular, these metacharacters will not match at a newline embedded in the middle of a string.
Relational expressions use the relational operators listed under "Operators" later in this book. For example, $2 > $1 selects lines for which the second field is greater than the first. Comparisons can be either string or numeric. Thus, depending on the types of data in $1 and $2, awk will do either a numeric or a string comparison. This can change from one record to the next.
Pattern-matching expressions use the operators ~ (match) and !~ (don't match). See "Operators" later in this book.
The BEGIN pattern lets you specify procedures that will take place before the first input line is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place after the last input record is read.
In nawk, BEGIN and END patterns may appear multiple times. The procedures are merged as if there had been one large procedure.
Except for BEGIN and END, patterns can be combined with the Boolean operators || (or), && (and), and ! (not). A range of lines can also be specified using comma-separated patterns:
pattern,pattern
Procedures consist of one or more commands, functions, or variable assignments, separated by newlines or semicolons, and are contained within curly braces. Commands fall into five groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
User-defined functions (nawk only)
Print first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern/
Print first field of lines that contain pattern:
/pattern/ { print $1 }
Select records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line. Each line is a single field:
BEGIN { FS = "\n"; RS = "" }
Print fields 2 and 3 in switched order, but only on lines whose first field matches the string URGENT:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern/ { ++x }
END { print x }
Add numbers in second column and print total:
{ total += $2 }
END { print "column total is", total}
Print lines that contain less than 20 characters:
length($0) < 20
Print each line that begins with Name: and that contains exactly seven fields:
NF == 7 && /^Name:/
Print the fields of each record in reverse order, one per line:
{
for (i = NF; i >= 1; i--)
print $i
}
All awk variables are included in nawk. All nawk variables are included in gawk.
The following table lists the operators, in order of increasing precedence, that are available in awk.
| Symbol | Meaning |
|---|---|
| = += −= *= /= %= ^= **= | Assignment. |
| ?: | C conditional expression (nawk only). |
| || | Logical OR (short-circuit). |
| && | Logical AND (short-circuit). |
| in | Array membership (nawk only). |
| ~ !~ | Match regular expression and negation. |
| < < = > > = != = = | Relational operators. |
| (blank) | Concatenation. |
| + - | Addition, subtraction. |
| * / % | Multiplication, division, and modulus (remainder). |
| + - ! | Unary plus and minus, and logical negation. |
| ^ ** | Exponentiation. |
| ++ - - | Increment and decrement, either prefix or postfix. |
| $ | Field reference. |
Note: While ** and **= are common extensions, they are not part of POSIX awk.
Variables can be assigned a value with an = sign. For example:
FS = ","
Expressions using the operators +, -, /, and % (modulo) can be assigned to variables.
Arrays can be created with the split( ) function (described later), or they can simply be named in an assignment statement. Array elements can be subscripted with numbers (array[1], …, array[n]) or with strings. Arrays subscripted by strings are called "associative arrays." (In fact, all arrays in awk are associative; numeric subscripts are converted to strings before using them as array subscripts. Associative arrays are one of awk's most powerful features.)
For example, to count the number of widgets you have, you could use the following script:
/widget/ { count["widget"]++ } Count widgets
END { print count["widget"] } Print the count
You can use the special for loop to read all the elements of an associative array:
for (item in array) process array[item]
The index of the array is available as item, while the value of an element of the array can be referenced as array[item].
You can use the operator in to test that an element exists by testing to see if its index exists (nawk only). For example:
if (index in array) …
tests that array[index] exists, but you cannot use it to test the value of the element referenced by array[index].
You can also delete individual elements of the array using the delete statement (nawk only).
Within string and regular expression constants, the following escape sequences may be used.
| Sequence | Meaning | Sequence | Meaning |
|---|---|---|---|
| \a | Alert (bell) | \v | Vertical tab |
| \b | Backspace | \\ | Literal backslash |
| \f | Form feed | \nnn | Octal value nnn |
| \n | Newline | \xnn | Hexadecimal value nn |
| \r | Carriage return | \" | Literal double quote (in strings) |
| \t | Tab | \/ | Literal slash (in regular expressions) |
Note: The \x escape sequence is a common extension; it is not part of POSIX awk.
nawk allows you to define your own functions. This makes it easy to encapsulate sequences of steps that need to be repeated into a single place, and re-use the code from anywhere in your program.
The following function capitalizes each word in a string. It has one parameter, named input, and five local variables, which are written as extra parameters:
# capitalize each word in a string
function capitalize(input, result, words, n, i, w)
{
result = " "
n = split(input, words, " ")
for (i = 1; i <= n; i++) {
w = words[i]
w = toupper(substr(w, 1, 1)) substr(w, 2)
if (i > 1)
result = result " "
result = result w
}
return result
}
# main program, for testing
{ print capitalize($0) }
With this input data:
A test line with words and numbers like 12 on it.
This program produces:
A Test Line With Words And Numbers Like 12 On It.
Note: For user-defined functions, no space is allowed between the function name and the left parenthesis when the function is called.
awk functions and commands may be classified as follows:
| Functions | Commands | ||
|---|---|---|---|
| Arithmetic Functions | atan2[2] | int | sin[2] |
| cos[2] | log | sqrt | |
| exp | rand[2] | srand[2] | |
| String Functions | index | match[2] | tolower[2] |
| gensub[9] | split | toupper[2] | |
| gsub[2] | sprintf | ||
| length | sub[2] | ||
| Control Flow Statements | break | exit | return[2] |
| continue | for | while | |
| do/while[2] | if | ||
| Input/Output Processing | close[2] | next | printf |
| fflush[16] | nextfile[16] | ||
| getline[2] | |||
| Time Functions | strftime[9] | systime[9] | |
| Programming | delete[2] | function[2] | system[2] |
[2] Available in nawk.
[9] Available in gawk.
[16] Available in Bell Labs awk and gawk.
Many versions of awk have various implementation limits, on things such as:
Number of fields per record
Number of characters per input record
Number of characters per output record
Number of characters per field
Number of characters per printf string
Number of characters in literal string
Number of characters in character class
Number of files open
Number of pipes open
The ability to handle 8-bit characters and characters that are all zero (ASCII NUL)
gawk does not have limits on any of the above items, other than those imposed by the machine architecture and/or the operating system.
The following alphabetical list of keywords and functions includes all that are available in awk, nawk, and gawk. nawk includes all old awk functions and keywords, plus some additional ones (marked as {N}). gawk includes all nawk functions and keywords, plus some additional ones (marked as {G}). Items marked with {B} are available in the Bell Labs awk. Items that aren't marked with a symbol are available in all versions.
| Command | Description |
|---|---|
| atan2 | atan2(y, x) Return the arctangent of y/x in radians. {N} |
| break | break Exit from a while, for, or do loop. |
| close | close(expr) In most implementations of awk, you can only have up to ten files open simultaneously and one pipe. Therefore, nawk provides a close function that allows you to close a file or a pipe. It takes the same expression that opened the pipe or file as an argument. This expression must be identical, character by character, to the one that opened the file or pipe—even whitespace is significant. {N} |
| continue | continue Begin next iteration of while, for, or do loop. |
| cos | cos(x) Return the cosine of x, an angle in radians. {N} |
| delete | delete
array[element]
delete array Delete element from array. The brackets are typed literally. {N} The second form is a common extension, which deletes all elements of the array at one shot. {B} {G} |
| do | do
statement while (expr) Looping statement. Execute statement, then evaluate expr and if true, execute statement again. A series of statements must be put within braces. {N} |
| exit | exit [expr] Exit from script, reading no new input. The END procedure, if it exists, will be executed. An optional expr becomes awk's return value. |
| exp | exp(x) Return exponential of x (ex). |
| fflush | fflush([output-expr])
Flush any buffers associated with open output file or pipe output-expr. {B} gawk extends this function. If no output-expr is supplied, it flushes standard output. If output-expr is the null string (" "), it flushes all open files and pipes. {G} |
| for | for (init-expr;
test-expr;
incr-expr)
statement C-style looping construct. init-expr assigns the initial value of a counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. All of the expressions are optional. A missing test-expr is considered to be true. A series of statements must be put within braces. |
| for | for (item
in
array)
statement Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array [item]. A series of statements must be put within braces. |
| function | function
name(parameter-list) {
statements } Create name as a user-defined function consisting of awk statements that apply to the specified list of parameters. No space is allowed between name and the left parenthesis when the function is called. {N} |
| getline | getline [var] [<
file]
command | getline [var] Read next line of input. Original awk does not support the syntax to open multiple input streams. The first form reads input from file and the second form reads the output of command. Both forms read one record at a time, and each time the statement is executed it gets the next record of input. The record is assigned to $0 and is parsed into fields, setting NF, NR and FNR. If var is specified, the result is assigned to var and $0 and NF are not changed. Thus, if the result is assigned to a variable, the current record does not change. getline is actually a function and it returns 1 if it reads a record successfully, 0 if end-of-file is encountered, and −1 if for some reason it is otherwise unsuccessful. {N} |
| gensub | gensub(r, s, h [, t]) General substitution function. Substitute s for matches of the regular expression r in the string t. If h is a number, replace the hth match. If it is "g" or "G", substitute globally. If t is not supplied, $0 is used. Return the new string value. The original t is not modified. (Compare gsub and sub.) {G} |
| gsub | gsub(r, s [, t]) Globally substitute s for each match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return the number of substitutions. {N} |
| if | if (condition)
statement [else statement] If condition is true, do statement(s), otherwise do statement in optional else clause. Condition can be an expression using any of the relational operators <, < =, = =, !=, > =, or >, as well as the array membership operator in, and the pattern-matching operators ~ and !~ (e.g., if ($1 ~ /[Aa].*/)). A series of statements must be put within braces. Another if can directly follow an else in order to produce a chain of tests or decisions. |
| index | index(str, substr) Return the position (starting at 1) of substr in str, or zero if substr is not present in str. |
| int | int(x) Return integer value of x by truncating any fractional part. |
| length | length([arg]) Return length of arg, or the length of $0 if no argument. |
| log | log(x) Return the natural logarithm (base e) of x. |
| match | match(s, r) Function that matches the pattern, specified by the regular expression r, in the string s and returns either the position in s where the match begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH to the start and length of the match, respectively. {N} |
| next | next Read next input line and start new cycle through pattern/procedures statements. |
| nextfile | nextfile Stop processing the current input file and start new cycle through pattern/procedures statements, beginning with the first record of the next file. {B} {G} |
| print [ output-expr[ , …]] [ dest-expr ] Evaluate the output-expr and direct it to standard output followed by the value of ORS. Each comma-separated output-expr is separated in the output by the value of OFS. With no output-expr, print $0. The output may be redirected to a file or pipe via the dest-expr, which is described in the section "Output Redirections" following this table. | |
| printf | printf(format [, expr-list ]) [ dest-expr ] An alternative output statement borrowed from the C language. It has the ability to produce formatted output. It can also be used to output data without automatically producing a newline. format is a string of format specifications and constants. expr-list is a list of arguments corresponding to format specifiers. As for print, output may be redirected to a file or pipe. See the section "printf formats" following this table for a description of allowed format specifiers. |
| rand | rand() Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using srand( ). {N} |
| return | return [expr] Used within a user-defined function to exit the function, returning value of expression. The return value of a function is undefined if expr is not provided. {N} |
| sin | sin(x) Return the sine of x, an angle in radians. {N} |
| split | split(string, array [, sep]) Split string into elements of array array[1],…,array[n]. The string is split at each occurrence of separator sep. If sep is not specified, FS is used. Returns the number of array elements created. |
| sprintf | sprintf(format [, expressions]) Return the formatted value of one or more expressions, using the specified format. Data is formatted but not printed. See the section "printf formats" following this table for a description of allowed format specifiers. |
| sqrt | sqrt(arg) Return square root of arg. |
| srand | srand([expr]) Use optional expr to set a new seed for the random number generator. Default is the time of day. Return value is the old seed. {N} |
| strftime | strftime([format [,timestamp]]) Format timestamp according to format. Return the formatted string. The timestamp is a time-of-day value in seconds since Midnight, January 1, 1970, UTC. The format string is similar to that of sprintf. If timestamp is omitted, it defaults to the current time. If format is omitted, it defaults to a value that produces output similar to that of the Unix date command. {G} |
| sub | sub(r, s [, t]) Substitute s for first match of the regular expression r in the string t. If t is not supplied, defaults to $0. Return 1 if successful; 0 otherwise. {N} |
| substr | substr(string, beg [, len]) Return substring of string at beginning position beg, and the characters that follow to maximum specified length len. If no length is given, use the rest of the string. |
| system | system(command)
Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A non-zero value indicates a failure of some sort. The documentation for the command you're running will give you the details. The output of the command is not available for processing within the awk script. Use command | getline to read the output of a command into the script. {N} |
| systime | systime( ) Return a time-of-day value in seconds since Midnight, January 1, 1970, UTC. {G} |
| tolower | tolower(str) Translate all uppercase characters in str to lowercase and return the new string.[24] {N} |
| toupper | toupper(str) Translate all lowercase characters in str to uppercase and return the new string. {N} |
| while | while (condition)
statement Do statement while condition is true (see if for a description of allowable conditions). A series of statements must be put within braces. |
[24] Very early versions of nawk don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk.
For print and printf, dest-expr is an optional expression that directs the output to a file or pipe.
> file
Directs the output to a file, overwriting its previous contents.
>> file
Appends the output to a file, preserving its previous contents. In both of these cases, the file will be created if it does not already exist.
| command
Directs the output as the input to a system command.
Be careful not to mix > and >> for the same file. Once a file has been opened with >, subsequent output statements continue to append to the file until it is closed.
Remember to call close() when you have finished with a file or pipe. If you don't, eventually you will hit the system limit on the number of simultaneously open files.
Format specifiers for printf and sprintf have the following form:
%[flag][width][.precision]letter
The control letter is required. The format conversion control letters are given in the following table.
| Character | Description |
|---|---|
| c | ASCII character. |
| d | Decimal integer. |
| i | Decimal integer. (Added in POSIX) |
| e | Floating-point format ([-]d.precisione[+-]dd). |
| E | Floating-point format ([-]d.precisionE[+-]dd). |
| f | Floating-point format ([-]ddd.precision). |
| g | e or f conversion, whichever is shortest, with trailing zeros removed. |
| G | E or f conversion, whichever is shortest, with trailing zeros removed. |
| o | Unsigned octal value. |
| s | String. |
| x | Unsigned hexadecimal number. Uses a-f for 10 to 15. |
| X | Unsigned hexadecimal number. Uses A-F for 10 to 15. |
| % | Literal %. |
The optional flag is one of the following:
The optional width is the minimum number of characters to output. The result will be padded to this size if it is smaller. The 0 flag causes padding with zeros; otherwise, padding is with spaces.
The precision is optional. Its meaning varies by control letter, as shown in this table:
| Conversion | Precision Means |
|---|---|
| %d, %i, %o, %u, %x, %X | The minimum number of digits to print. |
| %e, %E, %f | The number of digits to the right of the decimal point. |
| %g, %G | The maximum number of significant digits. |
| %s | The maximum number of characters to print. |