发信人: jerk (徐子陵), 信区: Unix
标 题: Unix Unleased -15
发信站: 饮水思源站 (Fri Nov 20 20:15:42 1998) , 站内信件
15
Awk, Awk
By Ann Marshall
Overview
Uses
Features
Brief History
Fundamentals
Entering Awk from the Command Line
Files for Input
The Program File
Specifying Output on the Command Line
Patterns and Actions
Input
Fields
Program Format
A Note on awk Error Messages
Print Selected Fields
Program Components
The Input File and Program
Patterns
BEGIN and END
Expressions
String Matching
Range Patterns
Compound Patterns
Actions
Variables
Naming
Awk in a Shell Script
Built-in Variables
Conditions (No IFs, &&s or buts)
The if Statement
The Conditional Statement
Patterns as Conditions
Loops
Increment and Decrement
The While Statement
The Do Statement
The For Statement
Loop Control
Strings
Built-In String Functions
String Constants
Arrays
Array Specialties
Arithmetic
Operators
Numeric Functions
Input and Output
Input
The Getline Statement
Output
The printf Statement
Closing Files and Pipes
Command Line Arguments
Passing Command Line Arguments
Setting Variables on the Command Line
Functions
Function Definition
Parameters
Variables
Function Calls
The Return Statement
Writing Reports
BEGIN and END Revisited
The Built-in System Function
Advanced Concepts
Multi-Line Records
Multidimensional Arrays
Summary
Further Reading
Obtaining Source Code
15
Awk, Awk
By Ann Marshall
Overview
The UNIX utility awk is a pattern matching and processing language with
considerably more power than you may realize. It searches one or more specified
files, checking for records that match a specified pattern. If awk finds a
match, the corresponding action is performed. A simple concept, but it results
in a powerful tool. Often an awk program is only a few lines long, and because
of this, an awk program is often written, used, and discarded. A traditional
programming language, such as Pascal or C, would take more thought, more lines
of code, and hence, more time. Short awk programs arise from two of its built-in
features: the amount of predefined flexibility and the number of details that
are handled by the language automatically. Together, these features allow the
manipulation of large data files in short (often single-line) programs, and make
awk stand apart from other programming languages. Certainly any time you spend
learning awk will pay dividends in improved productivity and efficiency.
Uses
The uses for awk vary from the simple to the complex. Originally awk was
intended for various kinds of data manipulation. Intentionally omitting parts of
a file, counting occurrences in a file, and writing reports are naturals for
awk.
Awk uses the syntax of the C programming language, so if you know C, you have an
idea of awk syntax. If you are new to programming or don't know C, learning awk
will familiarize you with many of the C constructs.
Examples of where awk can be helpful abound. Computer-aided manufacturing, for
example, is plagued with nonstandardization, so the output of a computer that's
running a particular tool is quite likely to be incompatible with the input
required for a different tool. Rather than write any complex C program, this
type of simple data transformation is a perfect awk task.
One real problem of computer-aided manufacturing today is that no standard
format yet exists for the program running the machine. Therefore, the output
from Computer A running Machine A probably is not the input needed for Computer
B running Machine B. Although Machine A is finished with the material, Machine B
is not ready to accept it. Production halts while someone edits the file so it
meets Computer B's needed format. This is a perfect and simple awk task.
Due to the amount of built-in automation within awk, it is also useful for rapid
prototyping or trying out an idea that could later be implemented in another
language.
Features
Reflecting the UNIX environment, awk features resemble the structures of both C
and shell scripts. Highlights include its being flexible, its predefined
variables, automation, its standard program constructs, conventional variable
types, its powerful output formatting borrowed from C, and its ease of use.
The flexibility means that most tasks may be done more than one way in awk. With
the application in mind, the programmer chooses which method to use . The
built-in variables already provide many of the tools to do what is needed. Awk
is highly automated. For instance, awk automatically retrieves each record,
separates it into fields, and does type conversion when needed without
programmer request. Furthermore, there are no variable declarations. Awk
includes the "usual" programming constructs for the control of program flow: an
if statement for two way decisions and do, for and while statements for looping.
Awk also includes its own notational shorthand to ease typing. (This is UNIX
after all!) Awk borrows the printf() statement from C to allow "pretty" and
versatile formats for output. These features combine to make awk user friendly.
Brief History
Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977.
(The name is from the creators' last initials.) In 1985, more features were
added, creating nawk (new awk). For quite a while, nawk remained exclusively the
property of AT&T, Bell Labs. Although it became part of System V for Release
3.1, some versions of UNIX, like SunOS, keep both awk and nawk due to a syntax
incompatibility. Others, like System V run nawk under the name awk (although
System V. has nawk too). In The Free Software Foundation, GNU introduced their
version of awk, gawk, based on the IEEE POSIX (Institute of Electrical and
Electronics Engineers, Inc., IEEE Standard for Information Technology, Portable
Operating System Interface, Part 2: Shell and Utilities Volume 2, ANSI approved
4/5/93), awk standard which is different from awk or nawk. Linux, PC shareware
UNIX, uses gawk rather than awk or nawk. Throughout this chapter I have used the
word awk when any of the three will do the concept. The versions are mostly
upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, then gawk as
shown below. I have used the notation version++ to denote a concept that began
in that version and continues through any later versions.
NOTE: Due to different syntax, awk code can never be upgraded to nawk. However,
except as noted, all the concepts of awk are implemented in nawk (and gawk).
Where it matters, I have specified the version.
Figure 15.1. The evolution of awk.
Refer to the end of the chapter for more information and further resources on
awk and its derivatives.
Fundamentals
This section introduces the basics of the awk programming language. Although my
discussion first skims the surface of each topic to familiarize you with how awk
functions, later sections of the chapter go into greater detail. One feature of
awk that almost continually holds true is this: you can do most tasks more than
one way. The command line exemplifies this. First, I explain the variety of ways
awk may be called from the command line—using files for input, the program file,
and possibly an output file. Next, I introduce the main construct of awk, which
is the pattern action statement. Then, I explain the fundamental ways awk can
read and transform input. I conclude the section with a look at the format of an
awk program.
Entering Awk from the Command Line
In its simplest form, awk takes the material you want to process from standard
input and displays the results to standard output (the monitor). You write the
awk program on the command line. The following table shows the various ways you
can enter awk and input material for processing.
You can either specify explicit awk statements on the command line, or, with the
-f flag, specify an awk program file that contains a series of awk commands. In
addition to the standard UNIX design allowing for standard input and output, you
can, of course, use file redirection in your shell, too, so awk < inputfile is
functionally identical to awk inputfile. To save the output in a file, again use
file redirection: awk > outputfile does the trick. Helpfully, awk can work with
multiple input files at once if they are specified on the command line.
The most common way to see people use awk is as part of a command pipe, where
it's filtering the output of a command. An example is ls -l | awk {print $3}
which would print just the third column of each line of the ls command. Awk
scripts can become quite complex, so if you have a standard set of filter rules
that you'd like to apply to a file, with the output sent directly to the
printer, you could use something like awk -f myawkscript inputfile | lp.
TIP: If you opt to specify your awk script on the command line, you'll find it
best to use single quotes to let you use spaces and to ensure that the command
shell doesn't falsely interpret any portion of the command.
Files for Input
These input and output places can be changed if desired. You can specify an
input file by typing the name of the file after the program with a blank space
between the two. The input file enters the awk environment from your workstation
keyboard (standard input). To signal the end of the input file, type Ctl + d.
The program on the command line executes on the input file you just entered and
the results are displayed on the monitor (the standard output.)
Here's a simple little awk command that echoes all lines I type, prefacing each
with the number of words (or fields, in awk parlance, hence the NF variable for
number of fields) in the line. (Note that Ctrl+d means that while holding down
the Control key you should press the d key).
$ awk '{print $NF : $0}'
I am testing my typing.
A quick brown fox jumps when vexed by lazy ducks.
Ctrl+d
5: I am testing my typing.
10: A quick brown fox jumps when vexed by lazy ducks.
$ _
You can also name more than one input file on the command line, causing the
combined files to act as one input. This is one way of having multiple runs
through one input file.
TIP: Keep in mind that the correct ordering on the command line is crucial for
your program to work correctly: files are read from left to right, so if you
want to have file1 and file2 read in that order, you'll need to specify them as
such on the command line.
The Program File
With awk's automatic type conversion, a file of names and a file of numbers
entered in the reverse order at the command line generate strange-looking output
rather than an error message. That is why for longer programs, it is simpler to
put the program in a file and specify the name of the file on the command line.
The -f option does this. Notice that this is an exception to the usual way UNIX
handles options. Usually the options occur at the end of a command; however,
here an input file is the last parameter.
NOTE: Versions of awk that meet the POSIX awk specifications are allowed to have
multiple -f options. You can use this for running multiple programs using the
same input.
Specifying Output on the Command Line
Output from awk may be redirected to a file or piped to another program (see
Chapter 4). The command awk /^5/ {print $0} | grep 3, for example, will result
in just those lines that start with the digit five (that's what the awk part
does) and also contain the digit three (the grep command). If you wanted to save
that output to a file, by contrast, you could use awk /^5/ {print $0} > results
and the file results would contain all lines prefaced by the digit 5. If you opt
for neither of these courses, the output of awk will be displayed on your screen
directly, which can be quite useful in many instances, particularly when you're
developing—or fine tuning—your awk script.
Patterns and Actions
Awk programs are divided into three main blocks; the BEGIN block, the
per-statement processing block, and the END block. Unless explicitly stated, all
statements to awk appear in the per-statement block (you'll see later where the
other blocks can come in particularly handy for programming, though).
Statements within awk are divided into two parts: a pattern, telling awk what to
match, and a corresponding action, telling awk what to do when a line matching
the pattern is found. The action part of a pattern action statement is enclosed
in curly braces ({}) and may be multiple statements. Either part of a pattern
action statement may be omitted. An action with no specified pattern matches
every record of the input file you want to search (that's how the earlier
example of {print $0} worked). A pattern without an action indicates that you
want input records to be copied to the output file as they are (i.e., printed).
The example of /^5/ {print $0} is an example of a two-part statement: the
pattern here is all lines that begin with the digit five (the ^ indicates that
it should appear at the beginning of the line: without it the pattern would say
any line that includes the digit five) and the action is print the entire line
verbatim. ($0 is shorthand for the entire line.)
Input
Awk automatically scans, in order, each record of the input file looking for
each pattern action statement in the awk program. Unless otherwise set, awk
assumes each record is a single line. (See the sections "Advanced
Concepts","Multi-line Records" for how to change this.) If the input file has
blank lines in it, the blank lines count as a record too. Awk automatically
retrieves each record for analysis; there is no read statement in awk.
A programmer may also disrupt the automatic input order in of two ways: the next
and exit statements. The next statement tells awk to retrieve the next record
from the input file and continue without running the current input record
through the remaining portion of pattern action statements in the program. For
example, if you are doing a crossword puzzle and all the letters of a word are
formed by previous words, most likely you wouldn't even bother to read that clue
but simply skip to the clue below; this is how the next statement would work, if
your list of clues were the input. The other method of disrupting the usual flow
of input is through the exit statement. The exit statement transfers control to
the END block—if one is specified—or quits the program, as if all the input has
been read; suppose the arrival of a friend ends your interest in the crossword
puzzle, but you still put the paper away. Within the END block, an exit
statement causes the program to quit.
An input record refers to the entire line of a file including any characters,
spaces, or Tabs. The spaces and tabs are called whitespace.
TIP: If you think that your input file may include both spaces and tabs, you can
save yourself a lot of confusion by ensuring that all tabs become spaces with
the expand program. It works like this: expand filename | awk { stuff }.
The whitespace in the input file and the whitespace in the output file are not
related and any whitespace you want in the output file, you must explicitly put
there.
Fields
A group of characters in the input record or output file is called a field.
Fields are predefined in awk: $1 is the first field, $2 is the second, $3 is the
third, and so on. $0 indicates the entire line. Fields are separated by a field
separator (any single character including Tab), held in the variable FS. Unless
you change it, FS has a space as its value. FS may be changed by either starting
the programfile with the following statement:
BEGIN {FS = "char" }
or by setting the -Fchar command line option where char is the selected field
separator character you want to use.
One file that you might have viewed which demonstrates where changing the field
separator could be helpful is the /etc/passwd file that defines all user
accounts. Rather than having the different fields separated by spaces or tabs,
the password file is structured with lines:
news:?:6:11:USENET News:/usr/spool/news:/bin/ksh
Each field is separated by a colon! You could change each colon to a space (with
sed, for example), but that wouldn't work too well: notice that the fifth field,
USENET News, contains a space already. Better to change the field separator. If
you wanted to just have a list of the fifth fields in each line, therefore, you
could use the simple awk command awk -F: {print $5} /etc/passwd.
Likewise, the built-in variable OFS holds the value of the output field
separator. OFS also has a default value of a space. It, too, may be changed by
placing the following line at the start of a program.
BEGIN {OFS = "char" }
If you want to automatically translate the passwd file so that it listed only
the first and fifth fields, separated by a tab, you can therefore use the awk
script:
BEGIN { FS=":" ; OFS=" " }
{ print $1, $5 }
Notice here that the script contains two blocks: the BEGIN block and the main
per-input line block. Also notice that most of the work is done automatically.
Program Format
With a few noted exceptions, awk programs are free format. The interpreter
ignores any blank lines in a programfile. Add them to improve the readability of
your program whenever you wish. The same is true for Tabs and spaces between
operators and the parts of a program. Therefore, these two lines are treated
identically by the awk interpreter.
$4 == 2 {print "Two"}
$4 == 2 { print "Two" }
If more than one pattern action line appears on a line, you'll need to separate
them with a semicolon, as shown above in the BEGIN block for the passwd file
translator. If you stick with one-command-per-line then you won't need to worry
too much about the semicolons. There are a couple of spots, however, where the
semicolon must always be used: before an else statement or when included in the
syntax of a statement. (See the "Loops" or "The Conditional Statement"
sections.) However, you may always put a semicolon at the end of a statement.
The other format restriction for awk programs is that at least the opening curly
bracket of the action half of a pattern action statement must be on the same
line as the accompanying pattern, if both pattern and action exist. Thus,
following examples all do the same thing.
The first shows all statements on one line:
$2==0 {print ""; print ""; print "";}
The second with the first statement on the same line as the pattern to match:
$2==0 { print ""
print ""
print ""}
and finally as spread out as possible:
$2==0 {
print ""
print ""
print ""
}
When the second field of the input file is equal to 0, awk prints three blank
lines to the output file.
NOTE: Notice that print "" prints a blank line to the output file, whereas the
statement print alone prints the current input line.
When you look at an awk program file, you may also find commentary within.
Anything typed from a # to the end of the line is considered a comment and is
ignored by awk. They are notes to anyone reading the program to explain what is
going on in words, not computerese.
A Note on awk Error Messages
Awk error messages (when they appear) tend to be cryptic. Often, due to the
brevity of the program, a typo is easily found. Not all errors are as obvious; I
have scattered some examples of errors throughout this chapter.
Print Selected Fields
Awk includes three ways to specify printing. The first is implied. A pattern
without an action assumes that the action is to print. The two ways of actively
commanding awk to print are print and printf(). For now, I am going to stick to
using only implied printing and the print statement. printf is discussed in a
later section ("Input/Output") and is used mainly for precise output. This
section demonstrates the first two types of printing through some step-by-step
examples.
Program Components
If I want to be sure the System Administrator spelled my name correctly in the
/etc/password file, I enter an awk command to find a match but omit an action.
The following command line puts a list on-screen.
$ awk '/Ann/' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
andhs26:0TFnZSVwcua3Y:2488:23:DeAnn O'Neal:/usr/lstudent/andhs26:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh
lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz, :/usr/lteach/lschultz:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker :/usr/bakehs59:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _
I look on the monitor and see the correct spelling.
ERROR NOTE: For the sake of making a point, suppose I had chosen the pattern
/Anne/. A quick glance above shows that there would be no matches. Entering awk
'/Anne/' /etc/passwd will therefore produce nothing but another system prompt to
the monitor. This can be confusing if you expect output. The same goes the other
way; above, I wanted the name Ann, but the names LeAnn, Annie and DeAnna
matched, too. Sometimes choosing a pattern too long or too short can cause an
unneeded headache.
TIP: If a pattern match is not found, look for a typo in the pattern you are
trying to match.
Printing specified fields of an ASCII (plain text) file is a straightforward awk
task. Because this program example is so short, only the input is in a file. The
first input file, "sales", is a file of car sales by month. The file consists of
each salesperson's name, followed by a monthly sales figure. The end field is a
running total of that person's total sales.
The Input File and Program
$cat sales
John Anderson,12,23,7,42
Joe Turner,10,25,15,50
Susan Greco,15,13,18,46
Bob Burmeister,8,21,17,46
The following command line prints the salesperson's name and the total sales for
the first quarter.
awk -F, '{print $1,$5}' sales
John Anderson 42
Joe Turner 50
Susan Greco 46
Bob Burmeister 46
A comma (,) between field variables indicates that I want OFS applied between
output fields as shown in a previous example. Remember without the comma, no
field separator will be used, and the displayed output fields (or output file)
will all run together.
TIP: Putting two field separators in a row inside a print statement creates a
syntax error with the print statement; however, using the same field twice in a
single print statement is valid syntax. For example:
awk '{print($1,$1)'
Patterns
A pattern is the first half of an awk program statement. In awk there are six
accepted pattern types. This section discusses each of the six in detail. You
have already seen a couple of them, including BEGIN, and a specified,
slash-delimited pattern, in use. Awk has many string matching capabilities
arising from patterns, and the use of regular expressions in patterns. A range
pattern locates a sequence. All patterns except range patterns may be combined
in a compound pattern.
I began the chapter by saying awk was a pattern-match and process language. This
section explores exactly what is meant by a pattern match. As you'll see, what
kind pattern you can match depends on exactly how you're using the awk pattern
specification notation.
BEGIN and END
The two special patterns BEGIN and END may be used to indicate a match, either
before the first input record is read, or after the last input record is read,
respectively. Some versions of awk require that, if used, BEGIN must be the
first pattern of the program and, if used, END must be the last pattern of the
program. While not necessarily a requirement, it is nonetheless an excellent
habit to get into, so I encourage you to do so, as I do throughout this chapter.
Using the BEGIN pattern for initializing variables is common (although variables
can be passed from the command line to the program too; see "Command Line
Arguments") The END pattern is used for things which are input-dependent such as
totals.
If I want to know how many lines are in a given program, I type the following
line:
$awk 'END {print _Total lines: _$NR}' myprogram
I see Total lines: 256 on the monitor and therefore know that the file myprogram
has 256 lines. At any point while awk is processing the file, the variable NR
counts the number of records read so far. NR at the end of a file has a value
equal to the number of lines in the file.
How might you see a BEGIN block in use? Your first thought might be to
initialize variables, but if it's a numeric value, it's automatically
initialized to zero before its first use. Instead, perhaps you're building a
table of data and want to have some columnar headings. With this in mind, here's
a simple awk script that shows you all the accounts that people named Dave have
on your computer:
BEGIN {
FS=_:_ # remember that the passwd file uses colons
OFS=_ _ # we_re setting the output to a TAB
print _Account_,_Username_
}
/Dav/ {print $1, $5}
Here's what it looks like in action (we've called this file _daves.awk_, though
the program matches Dave and David, of course):
$ awk -f daves.awk /etc/passwd
Account Username
andrews Dave Andrews
d3 David Douglas Dunlap
daves Dave Smith
taylor Dave Taylor
Note that you could also easily have a summary of the total number of matched
accounts by adding a variable that's incremented for each match, then in the END
block output in some manner. Here's one way to do it:
BEGIN { FS=_:_ ; OFS=_ _ # input colon separated, output tab separated
print _Account_,_Username_
}
/Dav/ {print $1, $5 ; matches++ }
END { print _A total of _matches_ matches._}
Here you can see how awk allows you to shorten the length of programs by having
multiple items on a single line, particularly useful for initialization. Also
notice the C increment notation: _matches++_ is functionally identical to
_matches = matches + 1_. Finally, also notice that we didn't have to initialize
the variable _matches_ to zero since it was done for us automatically by the awk
system.
Expressions
Any expression may be used with any operator in awk. An expression consists of
any operator in awk, and its corresponding operand in the form of a
pattern-match statement. Type conversion—variables being interpreted as numbers
at one point, but strings at another—is automatic, but never explicit. The type
of operand needed is decided by the operator type. If a numeric operator is
given a string operand, it is converted and vice versa.
TIP: To force a conversion, if the desired change is string to number, add (+)
0. If you wish to explicitly convert a number to a string concatenate "" (the
null string) to the variable. Two quick examples: num=3; num=num __ creates a
new numeric variable and sets it to the number three, then by appending a null
string to it, translates it to a string (e.g., the string with the character 3
within). Adding zero to that string — num=num + 0 — forces it back to a numeric
value.
Any expression can be a pattern. If the pattern, in this case the expression,
evaluates to a nonzero or nonnull value, then the pattern matches that input
record. Patterns often involve comparison. The following are the valid awk
comparison operators:
Table 15.1. Comparison Operators in awk.
Operator
Meaning
==is equal to
<less than
>greater than
<=less than or equal to
>=greater than or equal to
!=not equal to
~matched by
!~not matched by
In awk, as in C, the logical equality operator is == rather than =. The single =
compares memory location, whereas == compares values. When the pattern is a
comparison, the pattern matches if the comparison is true (non-null or
non-zero). Here's an example: what if you wanted to only print lines where the
first field had a numeric value of less than twenty? No problem in awk:
$1 < 20 {print $0}
If the expression is arithmetic, it is matched when it evaluates to a nonzero
number. For example, here's a small program that will print the first ten lines
that have exactly seven words:
BEGIN {i=0}
NF==7 { print $0 ; i++ }
/i==10/ {exit}
There's another way that you could use these comparisons too, since awk
understands collation orders (that is, whether words are greater or lesser than
other words in a standard dictionary ordering). Consider the situation where you
have a phone directory—a sorted list of names—in a file and want to print all
the names that would appear in the corporate phonebook before a certain person,
say D. Hughes. You could do this quite succinctly:
$1 >= "Hughes,D" { exit }
When the pattern is a string, a match occurs if the expression is non-null. In
the earlier example with the pattern /Ann/, it was assumed to be a string since
it was enclosed in slashes. In a comparison expression, if both operands have a
numeric value, the comparison is based on the numeric value. Otherwise, the
comparison is made using string ordering, which is why this simple example
works.
TIP: You can write more than two comparisons to a line in awk.
The pattern $2 <= $1 could involve either a numeric comparison or a string
comparison. Whichever it is, it will vary from file to file or even from record
to record within the same file.
TIP: Know your input file well when using such patterns, particularly since awk
will often silently assume a type for the variable and work with it, without
error messages or other warnings.
String Matching
There are three forms of string matching. The simplest is to surround a string
by slashes (/). No quotation marks are used. Hence /"Ann"/ is actually the
string ' "Ann" ' not the string Ann, and /"Ann"/ returns no input. The entire
input record is returned if the expression within the slashes is anywhere in the
record. The other two matching operators have a more specific scope. The
operator ~ means "is matched by," and the pattern matches when the input field
being tested for a match contains the substring on the right hand side.
$2 ~ /mm/
This example matches every input record containing mm somewhere in the second
field. It could also be written as $2 ~ "mm".
The other operator !~ means "is not matched by."
$2 !~ /mm/
This example matches every input record not containing mm anywhere in the second
field.
Armed with that explanation, you can now see that /Ann/ is really just shorthand
for the more complex statement $0 ~ /Ann/.
Regular expressions are common to UNIX, and they come in two main flavors. You
have probably used them unconsciously on the command line as wildcards, where *
matches zero or more characters and ? matches any single character. For instance
entering the first line below results in the command interpreter matching all
files with the suffix abc and the rm command deleting them.
rm *abc
Awk works with regular expressions that are similar to those used with grep,
sed, and other editors but subtly different than the wildcards used with the
command shell. In particular, . matches a character and * matches zero or more
of the previous character in the pattern (so a pattern of x*y will match
anything that has any number of the letter x followed by a y. To force a single
x to appear too, you'd need to use the regular expression xx*y instead). By
default, patterns can appear anywhere on the line, so to have them tied to an
edge, you need to use ^ to indicate the beginning of the word or line, and $ for
the end. If you wanted to match all lines where the first word ends in abc, for
example, you could use $1 ~ /abc$/. The following line matches all records where
the fourth field begins with the letter a:
$4 ~ /^a.*/
Range Patterns
The pattern portion of a pattern/action pair may also consist of two patterns
separated by a comma (,); the action is performed for all lines between the
first occurrence of the first pattern and the next occurrence of the second.
At most companies, employees receive different benefits according to their
respective hire dates. It so happens that I have a file listing all employees in
my company, including hire date. If I wanted to write an awk program that just
lists the employees hired between 1980 and 1987 I could use the following
script, if the first field is the employee's name and the third field is the
year hired. Here's how that data file might look (notice that I use : to
separate fields so that we don't have to worry about the spaces in the employee
names)
$ cat emp.data.
John Anderson:sales:1980
Joe Turner:marketing:1982
Susan Greco:sales:1985
Ike Turner:pr:1988
Bob Burmeister:accounting:1991
The program could then be invoked:
$ awk -F: '$3 > 1980,$3 < 1987 {print $1, $3}' emp.data
With the output:
John Anderson 1980
Joe Turner 1982
Susan Greco 1985
TIP: The above example works because the input is already in order according to
hire year. Range patterns often work best with pre-sorted input. This particular
data file would be a bit tricky to sort within UNIX, but you could use the
rather complex command sort -c: +3 -4 -rn emp.data > new.emp.data to sort things
correctly. (See Chapter 6 for more details on using the powerful sort command.)
Notice range patterns are inclusive—they include both the first item matched and
the end data indicated in the pattern. The range pattern matches all records
from the first occurrence of the first pattern to the first occurrence of the
second. This is a subtle point, but it has a major affect on how range patterns
work. First, if the second pattern is never found, all remaining records match.
So given the input file below:
$ cat sample.data
1
3
5
7
9
11
The following output appears on the monitor, totally disregarding that 9 and 11
are out of range.
$ awk '$1==3, $1==8' file1 sample.data
3
5
7
9
11
The end pattern of a range is not equivalent to a <= operand, though liberal use
of these patterns can alleviate the problem, as shown in the employee hire date
example above.
Secondly, as stated, the pattern matches the first range; others that might
occur later in the data file are ignored. That's why you have to make sure that
the data is sorted as you expect.
CAUTION: Range patterns cannot be parts of a larger pattern.
A more useful example of the range pattern comes from awk's ability to handle
multiple input files. I have a function finder program that finds code segments
I know exist and tells me where they are. The code segments for a particular
function X, for example, are bracketed by the phrase "function X" at the
beginning and } /* end of X at the end. It can be expressed as the awk pattern
range:
'/function functionname/,/} \/* end of functionname/'
Compound Patterns
Patterns can be combined using the following logical operators and parentheses
as needed.
Table 15.2. The Logical Operators in awk.
Operator
Meaning
!not
||or (you can also use | in regular expressions)
&&and
The pattern may be simple or quite complicated: (NF<3) || (NF >4). This matches
all input records not having exactly four fields. As is usual in awk, there are
a wide variety of ways to do the same thing (specify a pattern). Regular
expressions are allowed in string matching, but their use is not forced. To form
a pattern that matches strings beginning with a or b or c or d, there are
several pattern options:
/^[a-d].*/
/^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/
NOTE: When using range patterns: $1==2, $1==4 and $1>= 2 && $1 <=4 are not the
same ranges at all. First, the range pattern depends on the occurrence of the
second pattern as a stop marker, not on the value indicated in the range.
Secondly, as I mentioned earlier, the first pattern only matches the first
range, others are ignored.
For instance, consider the following simple input file:
$ cat mydata
1 0
3 1
4 1
5 1
7 0
4 2
5 2
1 0
4 3
The first range I try, '$1==3,$1==5, produces:
$ awk '$1==3,$1==5' mydata
3 1
4 1
5 1
Compare this to the following pattern and output.
$ awk '$1>=3 && $1<=5' mydata
3 1
4 1
5 1
4 2
5 2
4 3
Range patterns cannot be parts of a combined pattern.
Actions
The remainder of this chapter explores the action part of a pattern action
statement. As the name suggests, the action part tells awk what to do when a
pattern is found. Patterns are optional. An awk program built solely of actions
looks like other iterative programming languages. But looks are deceptive—even
without a pattern, awk matches every input record to the first pattern action
statement before moving to the second.
Actions must be enclosed in curly braces ({}) whether accompanied by a pattern
or alone. An action part may consist of multiple statements. When the statements
have no pattern and are single statements (no compound loops or conditions),
brackets for each individual action are optional provided the actions begin with
a left curly brace and end with a right curly brace. Consider the following two
action pieces:
{name = $1
print name}
and
{name = $1}
{print name},
These two produce identical output.
Variables
An integral part of any programming language are variables, the virtual boxes
within which you can store values, count things, and more. In this section, I
talk about variables in awk. Awk has three types of variables: user-defined
variables, field variables, and predefined variables that are provided by the
language automatically. The next section is devoted to a discussion of built-in
variables. Awk doesn't have variable declarations. A variable comes to life the
first time it is mentioned; in a twist on René Descarte's philosophical
conundrum, you use it, therefore it is. The section concludes with an example of
turning an awk program into a shell script.
CAUTION: Since there are no declarations, be doubly careful to initialize all
the variables you use, though you can always be sure that they automatically
start with the value zero.
Naming
The rule for naming user-defined variables is that they can be any combination
of letters, digits, and underscores, as long as the name starts with a letter.
It is helpful to give a variable a name indicative of its purpose in the
program. Variables already defined by awk are written in all uppercase. Since
awk is case-sensitive, ofs is not the same variable as OFS and capitalization
(or lack thereof) is a common error. You have already seen field
variables—variables beginning with $, followed by a number, and indicating a
specific input field.
A variable is a number or a string or both. There is no type declaration, and
type conversion is automatic if needed. Recall the car sales file used earlier.
For illustration suppose I enter the program awk -F: { print $1 * 10} emp.data,
and awk obligingly provides the rest:
0
0
0
0
0
Of course, this makes no sense! The point is that awk did exactly what it was
asked without complaint: it multiplied the name of the employee times ten, and
when it tried to translate the name into a number for the mathematical operation
it failed, resulting in a zero. Ten times zero, needless to say, is zero...
Awk in a Shell Script
Before examining the next example, review what you know about shell programming
(Chapters 10-14). Remember, every file containing shell commands needs to be
changed to an executable file before you can run it as a shell script. To do
this you should enter chmod +x filename from the command line.
Sometimes awk's automatic type conversion benefits you. Imagine that I'm still
trying to build an office system with awk scripts and this time I want to be
able to maintain a running monthly sales total based on a data file that
contains individual monthly sales. It looks like this:
cat monthly.sales
John Anderson,12,23,7
Joe Turner,10,25,15
Susan Greco,15,13,18
Bob Burmeister,8,21,17
These need to be added together to calculate the running totals for each
person's sales. Let a program do it!
$cat total.awk
BEGIN {OFS=,} #change OFS to keep the file format the same.
{print $1, " monthly sales summary: " $2+$3+$4 }
That's the awk script, so let's see how it works:
$ awk -f total.awk monthly.sales
cat sales
John Anderson, monthly sales summary: 42
Joe Turner, monthly sales summary: 50
Susan Greco, monthly sales summary: 46
Bob Burmeister, monthly sales summary: 46
CAUTION: Always run your program once to be sure it works before you make it
part of a complicated shell script!
Your task has been reduced to entering the monthly sales figures in the sales
file and editing the program file total to include the correct number of fields
(if you put a for loop for(i=2;i<+NF;i++) the number of fields is correctly
calculated, but printing is a hassle and needs an if statement with 12 else if
clauses).
In this case, not having to wonder if a digit is part of a string or a number is
helpful. Just keep an eye on the input data, since awk performs whatever actions
you specify, regardless of the actual data type with which you're working.
Built-in Variables
This section discusses the built-in variables found in awk. Because there are
many versions of awk, I included notes for those variables found in nawk, POSIX
awk, and gawk since they all differ. As before, unless otherwise noted, the
variables of earlier releases may be found in the later implementations. Awk was
released first and contains the core set of built-in variables used by all
updates. Nawk expands the set. The POSIX awk specification encompasses all
variables defined in nawk plus one additional variable. Gawk applies the POSIX
awk standards and then adds some built-in variables which are found in gawk
alone; the built-in variables noted when discussing gawk are unique to gawk.
This list is a guideline not a hard and fast rule. For instance, the built-in
variable ENVIRON is formally introduced in the POSIX awk specifications; it
exists in gawk; it is in also in the System V implementation of nawk, but SunOS
nawk doesn't have the variable ENVIRON. (See the section "'Oh man! I need
help.'"in Chapter 5 for more information on how to use man pages).
As I stated earlier, awk is case sensitive. In all implementations of awk,
built-in variables are written entirely in upper case.
Built-in Variables for Awk
When awk first became a part of UNIX, the built-in variables were the bare
essentials. As the name indicates, the variable FILENAME holds the name of the
current input file. Recall the function finder code; type the new line below:
/function functionname/,/} \/* end of functionname/' {print $0}
END {print ""; print "Found in the file " FILENAME}
This adds the finishing touch.
The value of the variable FS determines the input field separator. FS has a
space as its default value. The built-in variable NF contains the number of
fields in the current record (remember, fields are akin to words, and records
are input lines). This value may change for each input record.
What happens if within an awk script I have the following statement?
$3 = "Third field"
It reassigns $3 and all other field variables, also reassigning NF to the new
value. The total number of records read may be found in the variable NR. The
variable OFS holds the value for the output field separator. The default value
of OFS is a space. The value for the output format for numbers resides in the
variable OFMT which has a default value of %.6g. This is the format specifier
for the print statement, though its syntax comes from the C printf format
string. ORS is the output record separator. Unless changed, the value of ORS is
newline(\n).
Built-in Variables for Nawk
NOTE: When awk was expanded in 1985, part of the expansion included adding more
built-in variables.
CAUTION: Some implementations of UNIX simply put the new code in the spot for
the old code and didn't bother keeping both awk and nawk. System V and SunOS
have both available. Linux has neither awk nor nawk but uses gawk. System V has
both, but the awk uses nawk expansions. The book "awk the programming language"
by the awk authors speaks of awk throughout the book, but the programming
language it describes is called nawk on most systems.
The built-in variable ARGC holds the value for the number of command line
arguments. The variable ARGV is an array containing the command line arguments.
Subscripts for ARGV begin with 0 and continue through ARGC-1. ARGV[0] is always
awk. The available UNIX options do not occupy ARGV. The variable FNR represents
the number of the current record within that input file. Like NR, this value
changes with each new record. FNR is always <= NR. The built-in variable RLENGTH
holds the value of the length of string matched by the match function. The
variable RS holds the value of the input record separator. The default value of
RS is a newline. The start of the string matched by the match function resides
in RSTART. Between RSTART and RLENGTH, it is possible to determine what was
matched. The variable SUBSEP contains the value of the subscript separator. It
has a default value of "\034".
Built-in Variables for POSIX Awk
The POSIX awk specification introduces one new built-in variable beyond those in
nawk. The built-in variable ENVIRON is an array that holds the values of the
current environment variables. (Environment variables are discussed more
thoroughly later in this chapter.) The subscript values for ENVIRON are the
names of the environment variables themselves, and each ENVIRON element is the
value of that variable. For instance, ENVIRON["HOME"] on my PC under Linux is
"/home". Notice that using ENVIRON can save much system dependence within awk
source code in some cases but not others. ENVIRON["HOME"] at work is "/usr/anne"
while my SunOS account doesn't have an ENVIRON variable because it's not POSIX
compliant.
Here's an example of how you could work with the environment variables:
ENVIRON[EDITOR] == "vi" {print NR,$0}
This program prints my program listings with line numbers if I am using vi as my
default editor. More on this example later in the chapter.
Built-in Variables in Gawk
The GNU group further enhanced awk by adding four new variables to gawk, its
public re-implementation of awk. Gawk does not differ between UNIX versions as
much as awk and nawk do, fortunately. These built-in variables are in addition
to those mentioned in the POSIX specification as described above. The variable
CONVFMT contains the conversion format for numbers. The default value of CONVFMT
is "%.6g" and is for internal use only. The variable FIELDWIDTHS allows a
programmer the option of having fixed field widths rather than a single
character field separator. The values of FIELDWIDTHS are numbers separated by a
space or Tab (\t), so fields need not all be the same width. When the
FIELDWIDTHS variable is set, each field is expected to have a fixed width. Gawk
separates the input record using the FIELDWIDTHS values for field widths. If
FIELDWIDTHS is set, the value of FS is disregarded. Assigning a new value to FS
overrides the use of FIELDWIDTHS; it restores the default behavior.
To see where this could be useful, let's imagine that you've just received a
datafile from accounting that indicates the different employees in your group
and their ages. It might look like:
$ cat gawk.datasample
1Swensen, Tim 24
1Trinkle, Dan 22
0Mitchel, Carl 27
The very first character, you find out, indicates if they're hourly or salaried:
a value of 1 means that they're salaried, and a value of 0 is hourly. How to
split that character out from the rest of the data field? With the FIELDWIDTHS
statement. Here's a simple gawk script that could attractively list the data:
BEGIN {FIELDWIDTHS = 1 8 1 4 1 2}
{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
else print "Hourly employee "$2,$4" is "$6" years old."
}
The output would look like:
Salaried employee Swensen, Tim is 24 years old.
Salaried employee Trinkle, Dan is 22 years old.
Hourly employee Mitchel, Carl is 27 years old.
TIP: When calculating the different FIELDWIDTH values, don't forget any field
separators: the spaces between words do count in this case.
The variable IGNORECASE controls the case sensitivity of gawk regular
expressions. If IGNORECASE has a nonzero value, pattern matching ignores case
for regular expression operations. The default value of IGNORECASE is zero; all
regular expression operations are normally case sensitive.
Conditions (No IFs, &&s or buts)
Awk program statements are, by their very nature, conditional; if a pattern
matches, then a specified action or actions occurs. Actions, too, have a
conditional form. This section discusses conditional flow. It focuses on the
syntax of the if statement, but, as usual in awk, there are multiple ways to do
something.
A conditional statement does a test before it performs the action. One test, the
pattern match, has already happened; this test is an action. The last two
sections introduced variables; now you can begin putting them to practical uses.
The if Statement
An if statement takes the form of a typical iterative programming language
control structure where E1 is an expression, as mentioned in the "Patterns"
section earlier in this chapter:
if E1 S2; else S3.
While E1 is always a single expression, S2 and S3 may be either single- or
multiple-action statements (that means conditions in conditions are legal
syntax, but I am getting ahead of myself). Returns and indention are, as usual
in awk, entirely up to you. However, if S2 and the else statement are on the
same line, and S2 is a single statement, a semicolon must separate S2 from the
else statement. When awk encounters an if statement, evaluation occurs as
follows: first E1 is evaluated, and if E1 is nonzero or nonnull(true), S2 is
executed; if E1 is zero or null(false) and there's an else clause, S3 is
executed. For instance, if you want to print a blank line when the third field
has the value 25 and the entire line in all other cases, you could use a program
snippet like this:
{ if $3 == 25
print ""
else
print $0 }
The portion of the if statement involving S is completely optional since
sometimes your choice is limited to whether or not to have awk execute S2:
{ if $3 == 25
print "" }
Although the if statement is an action, E1 can test for a pattern match using
the pattern-match operator ~. As you have already seen, you can use it to look
for my name in the password file another way. The first way is shorter, but they
do the same thing.
$awk '/Ann/'/etc/passwd
$awk '{if ($0 ~ /Ann/) print $0}' /etc/passwd
One use of the if statement combined with a pattern match is to further filter
the screen input. For example here I'm going to only print the lines in the
password file that contain both Ann and a capital m character:
$ awk '/Ann/ { if ($0 ~ /M/) print}' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh
Either S2 or S3 or both may consist of multiple-action statements. If any of
them do, the group of statements is enclosed in curly braces. Curly braces may
be put wherever you wish as long as they enclose the action. The rule of thumb:
if it's one statement, the braces are optional. More than one and it's required.
You can also use multiple else clauses. The car sales example gets one field
longer each month. The first two fields are always the salesperson's name and
the last field is the accumulated annual total, so it is possible to calculate
the month by the value of NF:
if(NF=4) month="Jan."
else if(NF=5) month="Feb"
else if(NF=6) month="March"
else if(NF=7) month="April"
else if(NF=8) month="May" # and so on
NOTE: Whatever the value of NF, the overall block of code will execute only
once. It falls through the remaining else clauses.
The Conditional Statement
Nawk++ also has a conditional statement, really just shorthand for an if
statement. It takes the format shown and uses the same conditional operator
found in C:
E1 ? S2 : S3
Here, E1 is an expression, and S2 and S3 are single-action statements. When it
encounters a conditional statement, awk evaluates it in the same order as an if
statement: first E1 is evaluated; if E1 is nonzero or nonnull (true), S2 is
executed; if E1 is zero or null (false), S3 is executed. Only one statement, S2
or S3, is chosen, never both.
The conditional statement is a good place for the programmer to provide error
messages. Return to the monthly sales example. When we wanted to differentiate
between hourly and salaried employees, we had a big if-else statement:
{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
else print "Hourly employee "$2,$4" is "$6" years old."
}
In fact, there's an easier way to do this with conditional statements:
{ print ($1==1? "Salaried":"Hourly") "employee "$2,$4" is "$6" years old." }
CAUTION: Remember the conditional statement is not part of original awk!
At first glance, and for short statements, the if statement appears identical to
the conditional statement. On closer inspection, the statement you should use in
a specific case differs. Either is fine for use when choosing between either of
two single statements, but the if statement is required for more complicated
situations, such as when E2 and E3 are multiple statements. Use if for multiple
else statements (the first example), or for a condition inside a condition like
the second example below:
{ if (NR == 100)
{ print \$(NF-1)\{""
print "This is the 100th record"
print $0
print
}
}
{ if($1==0)
if(name~/Fred/
print "Fred is broke" }
Patterns as Conditions
As if that does not provide ample choice, notice that the program relying on
pattern-matching (had I chosen that method) produces the same output. Look at
the program and its output.
$ cat lowsales.awk}
BEGIN {OFS=\\t\{"\t"}}
$(NF-1) <= 7 {print $1, $(NF-1),\,\"Check \Attendance"\ {Sales"} }
$(NF-1) > 7 {print $1, $(NF-1) } # Next to last field
{$ awk -f lowsales.awk emp.data}
John Anderson 7 \check attendance\ {Check Sales}
Joe Turner 15
Susan Greco 18
Bob Burmeister 17
Since the two patterns above are nonoverlapping and one immediately follows the
other, the two programs accomplish the same thing. Which to use is a matter of
programming style. I find the conditional statement or the if statement more
readable than two patterns in a row. When you are choosing whether to use the
nawk conditional statement or the if statement because you're concerned about
printing two long messages, using the if statement is cleaner. Above all, if you
chose to use the conditional statement, keep in mind you can't use awk; you must
use nawk or gawk.
Loops
People often write programs to perform a repetitive task or several repeated
tasks. These repetitions are called loops. Loops are the subject of this
section. The loop structures of awk very much resemble those found in C. First,
let's look at a shortcut in counting with 1 notation. Then I'll show you the
ways to program loops in awk. The looping constructs of awk are the do(nawk),
for, and while statements. As with multiple-action groups in an if statement,
curly braces({}) surround a group of action statements associated in a loop.
Without curly braces, only the statement immediately following the keyword is
considered part of the loop.
TIP: Forgetting curly braces is a common looping error.
The section concludes with a discussion of how (and some examples of why) to
interrupt a loop.
Increment and Decrement
As stated earlier, assignment statements take the form x = y, where the value y
is being assigned to x. Awk has some shorthand methods of writing this. For
example, to add a monthly sales total to the car sales file, you'll need to add
a variable to keep a running total of the sales figures. Call it total . You
need to start total at zero and add each $(NF-1) as read. In standard
programming practice, that would be written total = total + $(NF -1). This is
okay in awk, too. However, a shortened format of total += $(NF-1) is also
acceptable.
There are two ways to indicate line+= 1 and line -=1 (line =line+1 and
line=line-1 in awk shorthand). They are called increment and decrement,
respectively, and can be further shortened to the simpler line++ and line—. At
any reference to a variable, you can not only use this notation but even vary
whether the action is performed immediately before or after the value is used in
that statement. This is called prefix and postfix notation, and is represented
by ++line and line++.
For clarity's sake, focus on increment for a moment. Decrement functions the
same way using subtraction. Using the ++line notation tells awk to do the
addition before doing the operation indicated in the line. Using the postfix
form says to do the operation in the line, then do the addition. Sometimes the
choice does not matter; keeping a counter of the number of sales people (to
later calculate a sales average at the end of the month) requires a counter of
names. The statements totalpeople++ and ++totalpeople do the same thing and are
interchangeable when they occupy a line by themselves. But suppose I decide to
print the person's number along with his or her name and sales. Adding either of
the second two lines below to the previous example produces different results
based on starting both at totalpeople=1.
$ cat awkscript.v1
BEGIN { totalpeople = 1 }
{print ++totalpeople, $1, $(NF-1) }
$ cat awkscript.v2
BEGIN { totalpeople = 1 }
{print totalpeople++, $1, $(NF-1) }
The first example will actually have the first employee listed as #2, since the
totalpeople variable is incremented before it's used in the print statement. By
contrast, the second version will do what we want because it'll use the variable
value, then afterwards increment it to the next value.
TIP: Be consistent. Either is fine, but stick with one numbering system or the
other, and there is less likelihood that you will accidently enter a loop an
unexpected number of times.
The While Statement
Awk provides the while statement for general looping. It has the following form:
while(E1)
S1
Here, E1 is an expression (a condition), and S1 is either one action statement
or a group of action statements enclosed in curly braces. When awk meets a while
statement, E1 is evaluated. If E1 is true, S1 executes from start to finish,
then E1 is again evaluated. If E1 is true, S1 again executes. The process
continues until E1 is evaluated to false. When it does, execution continues with
the next action statement after the loop. Consider the program below:
{ while ($0~/M/)
print
}
Typically the condition (E1) tests a variable, and the variable is changed in
the while loop.
{ i=1
while (i<20)
{ print i
i++
}
}
This second code snippet will print the numbers from 1 to 19, then once the
while loop tests with i=20, the condition of i<20 will become false and the loop
will be done.
The Do Statement
Nawk++ provides the do statement for looping in addition to the while statement.
The do statement takes the following form:
do
S
while .
Here, S is either a single statement or a group of action statements enclosed in
curly braces, and E is the test condition. When awk comes to a do statement, S
is executed once, and then condition E is tested. If E evaluates to nonzero or
nonnull, S executes again, and so on until the condition E becomes false. The
difference between the do and the while statement rests in their order of
evaluation. The while statement checks the condition first and executes the body
of the loop if the condition is true. Use the while statement to check
conditions that may be initially false. For instance, while (not
end-of-file(input)) is a common example. The do statement executes the loop
first and then checks the condition. Use the do statement when testing a
condition which depends on the first execution to meet the condition.
The do statement can be initiated using the while statement. Put the code that
is in the loop before the condition as well as in the body of the loop.
The For Statement
The for statement is a compacted while loop designed for counting. Use it when
you know ahead of time that S is a repetitive task and the number of times it
executes can be expressed as a single variable. The for loop has the following
form:
for(pre-loop-statements;TEST:post-loop-statements)
Here, pre-loop-statements usually initialize the counting variable; TEST is the
test condition; and post-loop-statements indicate any loop variable increments.
For example,
{ for(i=1; i<=30; i++) print i.}
This is a succinct way of saying initialize i to 1, then continue looping while
i<=30, and incrementing i by one each time through. The statement executed each
time simply prints the value of i. The result of this statement is a list of the
numbers 1 through 30.
TIP: The condition test should either be < 21 or <= 20 to execute the loop 20
times. The equality operator == is not a good test condition. Changing the loop
to the line below illustrates why.
{ for (i=1;i==20;i+2) print i }
Each iteration of the loop adds 2 to the value of i. i goes to 3 to 5 to 7_ to
19 to 21—never having a value of 20. Consequently, you have an infinite loop; it
never stops.
The for loop can also be used involving loops of unknown size:
for (i=1; i<=NF; i++)
print $i
This prints each field on a unique line. True, you don't know what the number of
fields will be, but you do know NF will contain that number.
The for loop does not have to be incremented; it could be decremented instead:
$awk -F: '{ for (i = NF; i > 0; —i) print $i }' sales.data
This prints the fields in reverse order, one per line.
Loop Control
The only restriction of the loop control value is that it must be an integer.
Because of the desire to create easily readable code, most programmers try to
avoid branching out of loops midway. Awk offers two ways to do this; however, if
you need it: break and continue. Sometimes unexpected or invalid input leaves
little choice but to exit the loop or have the program crash—something a
programmer strives to avoid. Input errors are one accepted time to use the break
statement. For instance, when reading the car sales data into the array name, I
wrote the program expecting five fields on every line. If something happens and
a line has the wrong number of fields, the program is in trouble. A way to
protect your program from this is to have code like:
{ for(i=1; i<=NF; i++)
if (NF != 5) {
print "Error on line " NR invalid input...leaving loop."
break }
else
continue with program code...
The break statement terminates only the loop. It is not equivalent to the exit
statement which transfers control to the END statement of the program. I handle
the problem as shown on the CD-ROM in file LIST15_1.
TIP: The ideal error message depends, of course, on your application, the
knowledge of the end users, and the likelihood they will be able to correct the
error.
As another use for the break statement consider do S while (1). It is an
infinite loop depending on another way out. Suppose your program begins by
displaying a menu on screen. (See the LIST 15_2 file on the CD-ROM.)
The above example shows an infinite loop controlled with the break statement
giving the end user a way out.
NOTE: The built-in nawk function getline does what it seems. For the point of
the example take it on faith that it returns a character.
The continue statement causes execution to skip the current iteration remaining
in both the do and the while statements. Control transfers to the evaluation of
the test condition. In the for loop control goes to post-loop-instructions. When
is this of use? Consider computing a true sales ratio by calculating the amount
sold and dividing that number by hours worked.
Since this is all kept in separate files, the simplest way to handle the task is
to read the first list into an array, calculate the figure for the report, and
do whatever else is needed.
FILENAME=="total" read each $(NF-1) into monthlytotal[i]
FILENAME=="per" with each i
monthlytotal[i]/$2
whatever else
But what if $2 is 0? The program will crash because dividing by 0 is an illegal
statement. While it is unlikely that an employee will miss an entire month of
work, it is possible. So, it is good idea to allow for the possibility. This is
one use for the continue statement. The above program segment expands to Listing
15.1.
Listing 15.1. Using the continue statement.
BEGIN { star = 0
other stuff...
}
FILENAME=="total" { for(i=1;NF;i++)
monthlyttl[i]=$(NF-1)
}
FILENAME=="per" { for(i=1;NF;i++)
if($2 == 0) {
print "*"
star++
continue }
else
print monthlyttl[i]/$2
whatever else
}
END { if(star>=1)
print "* indicates employee did not work all month."
else
whatever
}
The above program makes some assumptions about the data in addition to assuming
valid input data. What are these assumptions and more importantly, how do you
fix them? The data in both files is assumed to be the same length, and the names
are assumed to be in the same order.
Recall that in awk, array subscripts are stored as strings. Since each list
contains a name and its associated figure, you can match names. Before running
this program, run the UNIX sort utility to insure the files have the names in
alphabetical order (see "Sorting Text Files" in Chapter 6). After making
changes, use file LIST15_4 on the CD-ROM.
Strings
There are two primary types of data that awk can work with—numeric values or
sequences of characters and digits that comprise words, phrases or sentences.
The latter are called strings within awk and most other programming languages.
For instance, "now is the time for all good men" is a string. A string is always
enclosed in double quotes(""). It can be almost any length (the exact number
varies from UNIX version to version).
One of the important string operations is called concatenation. The word means
putting together. When you concatenate two strings you are creating a third
string that is the combination of string1, followed immediately by string2. To
perform concatenation in awk simply leave a space between two strings.
print "My name is" "Ann."
This prints the line:
My name isAnn.
(To ensure that a space is included you can either use a comma in the print
statement or simply add a space to one of the strings: print "My name is "
"Ann").
Built-In String Functions
As a rule, awk returns the leftmost, longest string in all its functions. This
means that it will return the string occurring first (farthest to the left).
Then, it collects the longest string possible. For instance, if the string you
are looking for is "y*" in the string "any of the guyys knew it" then the match
returns "yy" over "y" even though the single y appears earlier in the string.
Let's consider the different string functions available, organized by awk
version.
Awk
The original awk contained few built-in functions for handling strings. The
length function returns the length of the string. It has an optional argument.
If you use the argument, it must follow the keyword and be enclosed in
parentheses: length(string). If there is no argument, the length of $0 is the
value. For example, it is difficult to determine from some screen editors if a
line of text stops at 80 characters or wraps around. The following invocation of
awk aids by listing just those lines that are longer than 80 characters in the
specified file.
$ awk '{ if (length > 80) { print NR ": " $0}' file-with-long-lines
The other string function available in the original awk is substring, which
takes the form substr(string,position,len) and returns the len length substring
of the string starting at position.
NOTE: A disagreement exists over which functions originated in awk and which
originated in nawk. Consult your system for the final word on awk string
functions. The functions in nawk are fairly standard.
Nawk
When awk was expanded to nawk, many built-in functions were added for string
manipulation while keeping the two from awk. The function gsub(r, s, t)
substitutes string s into target string t every time the regular expression r
occurs and returns the number of substitutions. If t is not given gsub() uses
$0. For instance, gsub(/l/, "y","Randall") turns Randall into Randayy. The g in
gsub means global because all occurrences in the target string change.
The function sub(r, s, t) works like gsub(), except the substitution occurs only
once. Thus sub(/l/, "y","Randall") returns "Randayl". The place the substring t
occurs in string s is returned with the function index(s, t): index("i",
"Chris")) returns 4. As you'd expect the return value is zero if substring t is
not found. The function match(s, r) returns the position in s where the regular
expression r occurs. It returns the index where the substring begins or 0 if
there is no substring. It sets the values of RSTART and RLENGTH.
The split function separates a string into parts. For example, if your program
reads in a date as 5-10-94, and later you want it written May 10, 1994 the first
step is to divide the date appropriately. The built-in function split does this:
split("5-10-94", store, "-") divides the date, and sets store["1"] = "5",
store["2"] = "10" and store["3"] = 94. Notice that here the subscripts start
with "1" not "0".
POSIX Awk
The POSIX awk specification added two built-in functions for use with strings.
They are tolower(str) and toupper(str). Both functions return a copy of the
string str with the alphabetic characters converted to the appropriate case.
Non-alphabetic characters are left alone.
Gawk
Gawk provides two functions returning time-related information. The systime()
function returns the current time of day in seconds since Midnight UTC
(Universal Time Coordinated, the new name for Greenwich Mean Time), January 1970
on POSIX systems. The function strftime(f, t), where f is a format and t is a
timestamp of the same form as returned by system(), returns a formatted
timestamp similar to the ANSI C function strftime().
String Constants
String constants are the way awk identifies a non-keyboard, but essential,
character. Since they are strings, when you use one, you must enclose it in
double quotes (""). These constants may appear in printing or in patterns
involving regular expressions. For instance, the following command prints all
lines less than 80 characters long that don't begin with a tab. See Table 15.3.
awk 'length < 80 && /\t/' another-file-with-long-lines
Table 15.3. Awk string constants.
Expression
Meaning
\\The way of indicating to print a backslash.
\aThe "alert" character; usually the ASCII BEL.
\bA backspace character.
\fA formfeed character.
\nA newline character.
\rCarriage return character.
\tHorizontal tab character.
\vVertical tab character.
\xIndicates the following value is a hexidecimal number.
\0Indicates the following value is an octal number.
Arrays
An array is a method of storing pieces of similar data in the computer for later
use. Suppose your boss asks for a program that reads in the name, social
security number, and a bunch of personnel data to print check stubs and the
detachable check. For three or four employees keeping name1, name2, etc. might
be feasible, but at 20, it is tedious and at 200, impossible. This is a use for
arrays! See file LIST15_5 on the CD-ROM.
NOTE: Since the first input record is the checkdate, the total lines (NR) is not
the number of checks to issue. I could have used NR-1, but I chose clarity over
brevity.
Much easier, cleaner, and quicker! It also works for any number of employees
without code changes. Awk only supports single-dimension arrays. (See the
section "Advanced Concepts" for how to simulate multiple-dimensional arrays.)
That and a few other things set awk arrays apart from the arrays of other
programming languages. This section focuses on arrays; I will explain their use,
then discuss their special property. I conclude by listing three features of awk
(a built-in function, a built-in variable, and an operator) designed to help you
work with arrays.
Arrays in awk, like variables, don't need to be declared. Further, no indication
of size must be given ahead of time; in programming terms, you'd say arrays in
awk are dynamic. To create an array, give it a name and put its subscript after
the name in square brackets ([]), name[2] from above, for instance. Array
subscripts are also called the indices of the array ; in name[2], 2 is the index
to the array name, and it accesses the one name stored at location 2.
NOTE: One peculiarity in awk is that elements are not stored in the order they
are entered. This bug is fixed in nawk++.
Awk arrays are different from those of other programming languages because in
awk, array subscripts are stored as strings, not numbers. Technically, the term
is associative arrays and it's unusual in programming languages. Be aware that
the use of strings as subscripts can confuse you if you think purely in numeric
terms. Since "3" > "15", an array element with a subscript 15 is stored before
one with subscript of "3", even though numerically 3 > 15.
Since subscripts are strings, a subscript can be a field value. grade[$1]=$2 is
a valid statement, as is salary["John"].
Array Specialties
Nawk++ has additions specifically intended for use with arrays. The first is a
test for membership. Suppose Mark Turner enrolled late in a class I teach, and I
don't remember if I added his name to the list I keep on my computer. The
following program checks the list for me.
BEGIN {i=1}
{ name [i++] = $1 }
END { if ("Mark Turner" in name)
print "He's enrolled in the course!"
}
The delete function is a built-in function to remove array elements from
computer memory. To remove an element, for example, you could use the command
delete name[1].
CAUTION: Once you remove an element from memory, it's gone, and it ain't coming
back! When in doubt, keep it.
Although technology is advancing and memory is not the precious commodity it
once was considered to be, it is still a good idea to clean up after yourself
when you write a program. Think of the check printing program above. Two hundred
names won't fill the memory. But if your program controls personnel activity, it
writes checks and checkstubs; adds and deletes employees; and charts sales. It's
better to update each file to disk and remove the arrays not in use. For one
thing, there is less chance of reading obsolete data. It also consumes less
memory and minimizes the chance of using an array of old data for a new task.
The clean-up can be most easily done:
END {i= totalemps
while(i>0) {
delete name[i]
delete data[i—] }
}
Nawk++ creates another built-in variable for use when simulating
multidimensional arrays. More on its use appears later, in the section "Advanced
Concepts." It is called SUBSEP and has a default value of "\034". To add this
variable to awk, just create it in your program:
BEGIN { SUBSEP = "\034" }
Recall that in awk, array subscripts are stored as strings. Since each list
contains a name and its associated figure, you can match names and hence match
files. Here are the answers to the question about using two files and assuring
they have the same order (from the car sales example earlier). Before running
this program, run the UNIX sort utility to insure the files have the names in
alphabetical order. (See "Sorting Text Files" in Chapter 6.) After making
changes, use the program in file LIST15_6 on the CD-ROM.
Arithmetic
Although awk is primarily a language for pattern matching, and hence, text and
strings pop into mind more readily than math and numbers, awk also has a good
set of math tools. In this section, first I show the basics, then we look at the
math functions built into awk.
Operators
Awk supports the usual math operations. The expression x^y is x superscript y,
that is, x to the y power. The % operator calculates remainders in awk: x%y is
the remainder of x divided by y, and the result is machine-dependent. All math
uses, floating point, and numbers are equivalent no matter which format they are
expressed in so 100 = 1.00e+02.
The math operators in awk consist of the four basic functions: + (addition), -
(subtraction), / (division), and * (multiplication), plus ^ and % for
exponential and remainder.
As you saw earlier in the most recent sales example, fields can be used in
arithmetic too. If, in the middle of the month, my boss asks for a list of the
names and latest monthly sales totals, I don't need to panic over the discarded
figures; I can just print a new list. My first shot seems simple enough (Listing
15.2).
Listing 15.2. Print sales totals for May.
BEGIN {OFS="\t"}
{ print $1, $2, $6 } # field #6 = May
Then a thought hits. What if my boss asks for the same thing next month? Sure,
changing a field number each month is not a big deal but is it really
necessary??
I look at the data. No matter what month it is, the current month's totals are
always the next to last field. I start over with the program in Listing 15.3.
Listing 15.3. Printing the previous month's sales totals.
BEGIN {OFS= _\t_}
{ print $1,$2, $(NF-1) }
TIP: Again, watch yourself because awk lets you get away with murder. If I
forgot the parentheses on the last statement above, rather than get a monthly
total, I would print a list of the running total97Ä1! Also, rather than
generate an error, if I mistype $(NF-1) and get $(NF+1) (not hard to do using
the number pad), awk assigns nonexistent variables (here the number of fields +
1) to the null string. In this case, it prints blank lines.
Another use for arithmetic concerns assignment. Field variables may be changed
by assignment. Given the following file, the statement $3 = 7 is a valid
statement and produces the results below:
$ cat inputfile
1 2
3 4
5 6
7 8
9 10
$ awk '{$3 = 7}' inputfile
1 2 7
3 4 7
5 6 7
7 8 7
9 10 7
NOTE: The above statement forces $0 and NF values to change. Awk recalculates
them as it runs.
If I run the following program, four lines appear on the monitor, showing the
new values.
{ if(NR==1)
print $0, NF }
{ if (NR >= 2 && NR <= 4) { $3=7; print $0, NF } }
END {print $0, NF }
Now when we run the data file through awk here's what we see:
$awk -f newsample.awk inputfile
1 2 2
3 4 7 3
5 6 7 3
7 8 7 3
Numeric Functions
Awk has a well-rounded selection of built-in numeric functions. As before in the
sections on "Built-in Variables" and "Strings," the functions build on each
other beginning with those found in awk.
Awk
To start, awk has built-in functions exp(exp), log(exp), sqrt(exp), and int(exp)
where int() truncates its argument to an integer.
Nawk
Nawk added further arithmetic functions to awk. It added atan2(y,x) which
returns the arctangent of y/x. It also added two random number generator
functions: rand() and srand(x). There is also some disagreement over which
functions originated in awk and which in nawk. Most versions have all the
trigonometric functions in nawk, regardless of where they first appeared.
Input and Output
This section takes a closer look at the way input and output function in awk. I
examine input first and look briefly at the getline function of nawk++ . Next, I
show how awk output works, and the two different print statements in awk: print
and printf.
Input
Awk handles the majority of input automatically—there is no explicit read
statement, unlike most programming languages. Each line of the program is
applied to each input record in the order the records appear in the input file.
If the input file has 20 records then the first pattern action statement in the
program looks for a match 20 times. The next statement causes the input to skip
to the next program statement without trying the rest of the input against that
pattern action statement. The exit statement acts as if all input has been
processed. When awk encounters an exit statement, if there is one, the control
goes to the END pattern action statement.
The Getline Statement
One addition, when awk was expanded to nawk, was the built-in function getline.
It is also supported by the POSIX awk specification. The function may take
several forms. At its simplest, it's written getline. When written alone,
getline retrieves the next input record and splits it into fields as usual,
setting FNR, NF and NR. The function returns 1 if the operation is successful, 0
if it is at the end of the file (EOF), and -1 if the function encounters an
error. Thus,
while (getline == 1)
simulates awk's automatic input.
Writing getline variable reads the next record into variable (getline char from
the earlier menu example, for instance). Field splitting does not take place,
and NF remains 0; but FNR and NR are incremented. Either of the above two may be
written using input from a file besides the one containing the input records by
appending < "filename" on the end of the command. Furthermore, getline char <
"stdin" takes the input from the keyboard. As you'd expect neither FNR nor NR
are affected when the input is read from another file. You can also write either
of the two above forms, taking the input from a command.
Output
There are two forms of printing in awk: the print statement and the printf
statement. Until now, I have used the print statement. It is the fallback. There
are two forms of the print statement. One has parentheses; one doesn't. So,
print $0 is the same as print($0). In awk shorthand, the statement print by
itself is equivalent to print $0. As shown in an earlier example, a blank line
is printed with the statement print "". Use the format you prefer.
NOTE: print() is not accepted shorthand; it generates a syntax error.
Nawk requires parentheses, if the print statement involves a relational
operator.
For a simple example consider file1:
$cat file1
1 10
3 8
5 6
7 4
9 2
10 0
The command line
$ nawk 'BEGIN {FS="\t"}; {print($1>$2)}' file1
shows
0
0
0
1
1
1
on the monitor.
Knowing that 0 indicates false and 1 indicates true, the above is what you'd
expect, but most programming languages won't print the result of a relation
directly. Nawk will.
NOTE: This requires nawk or later. Trying the above in awk results in a syntax
error.
Nawk prints the results of relations with both print and printf. Both print and
printf require the use of parentheses when a relation is involved, however, to
distinguish between > meaning greater than and > meaning the redirection
operator.
The printf Statement
printf is used when the use of formatted output is required. It closely
resembles C's printf. Like the print statement, it comes in two forms: with and
without parentheses. Either may be used, except the parentheses are required
when using a relational operator. (See below.)
printf format-specifier, variable1,variable2, variable3,..variablen
printf(format-specifier, variable1,variable2, variable3,..variablen)
The format specifier is always required with printf. It contains both any
literal text, and the specific format for displaying any variables you want to
print. The format specifier always begins with a %. Any combination of three
modifiers may occur: a - indicates the variable should be left justified within
its field; a number indicates the total width of the field should be that
number, if the number begins with a 0: %-05 means to make the variable 5 wide
and pad with 0s as needed; the last modifier is .number the meaning depends on
the type of variable, the number indicates either the maximum number string
width, or the number of digits to follow to the right of the decimal point.
After zero or more modifiers, the display format ends with a single character
indicating the type of variable to display.
TIP: And yes, numbers can be displayed as characters and nondigit strings can be
displayed as a number. With printf anything goes!
Remember the format specifier has a string value and since it does, it must
always be enclosed in double quotes("), whether it is a literal string such as
printf("This is an example of a string in the display format.")
or a combination,
printf("This is the %d example", occurrence)
or just a variable
printf("%d", occurrence).
NOTE: The POSIX awk specification (and hence gawk) supports the dynamic field
width and precision modifiers like ANSI C printf() routines do. To use this
feature, place an * in place of either of the actual display modifiers and the
value will be substituted from the argument list following the format string.
Neither awk or nawk have this feature.
Before I go into detail about display format modifiers, I will show the
characters used for display types. The following list shows the format specifier
types without any modifiers.
Table l5.8. The format specifiers in awk.
Format
Meaning
%cAn ASCII character
%dA decimal number (an integer, no decimal point involved)
%iJust like %d (Remember i for integer)
%eA floating point number in scientific notation (1.00000E+01)
%fA floating point number (10001010.434)
%gawk chooses between %e or %f display format, the one producing a
shorter string is selected. Nonsignificant zeros are not printed.
%oAn unsigned octal (base 8) number
%sA string
%xAn unsigned hexadecimal (base 16) number
%XSame as %x but letters are uppercase rather than lowercase.
NOTE: If the argument used for %c is numeric, it is treated as a character and
printed. Otherwise, the argument is assumed to be a string and only the first
character of that string is printed.
Look at some examples without display modifiers. When the file file1 looks like
this:
$ cat file1
34
99
-17
2.5
-.3
the command line
awk '{printf("%c %d %e %f\n", $1, $1, $1, $1)}' file1
produces the following output:
" 34 3.400000e+01 34.000000
c 99 9.900000e+01 99.000000
_ -17 -1.700000e+01 -17.000000
_ 2 2.500000e+00 2.500000
0 -3.000000e-01 -0.300000
By contrast, a slightly different format string produces dramatically different
results with the same input:
$ awk '{printf("%g %o %x", $1)}' file1
34 42 22
99 143 63
-17 37777777757 ffffffef
2.5 2 2
-0.3 0 0
Now let's change file1 to contain just a single word:
$cat file1
Example
The string above has seven characters. For clarity, I have used * instead of a
blank space so the total field width is visible on paper.
printf("%s\n", $1)
Example
printf("%9s\n", $1)
**Example
printf("%-9s\n", $1)
Example**
printf("%.4s\n", $1)
Exam
printf("%9.4s\n", $1)
*****Exam
printf("%-9.4s\n", $1)
Exam*****
One topic pertaining to printf remains. The function printf was written so that
it writes exactly what you tell it to write—and how you want it written, no more
and no less. That is acceptable until you realize that you can't enter every
character you may want to use from the keyboard. Awk uses the same escape
sequences found in C for nonprinting characters. The two most important to
remember are \n for a carriage return and \t for a tab character.
TIP: There are two ways to print a double quote; neither of which is that
obvious. One way around this problem is to use the printf variable by its ASCII
value:
doublequote = 34
printf("%c", doublequote)
The other strategy is to use a backslash to escape the default interpretation of
the double quote as the end of the string:
printf("Joe said \"undoubtedly\" and hurried along.\n")
This second approach doesn't always work, unfortunately.
Closing Files and Pipes
Unlike most programming languages there is no way to open a file in awk; opening
files is implicit. However, you must close a file if you intend to read from it
after writing to it. Suppose you enter the command cat file1 < file2 in your awk
program. Before you can read file2 you must close the pipe. To do this, use the
statement close(cat file1 < file2). You may also do the same for a file:
close(file2).
Command Line Arguments
As you have probably noticed, awk presents a programmer with a variety of ways
to accomplish the same thing. This section focuses on the command line. You will
see how to pass command line arguments to your program from the command line and
how to set the value of built-in variables on the command line. A summary of
command line options concludes the section.
Passing Command Line Arguments
Command line arguments are available in awk through a built-in array called, as
in C, ARGV. Again echoing C semantics, the value of the built-in ARGC is one
less than the number of command line arguments. Given the command line awk -f
programfile infile1, ARGC has a value of 2. ARGV[0] = awk and ARGV[1] = infile1.
NOTE: The subscripts for ARGV start with 0 not 1.
programfile is not considered an argument—no option argument is. Had -F been in
the command line, ARGV would not contain a comma either. Note that this behavior
is very different to how argv and argc are interpreted in C programs too.
Setting Variables on the Command Line
It is possible to pass variable values from the command line to your awk program
just by stating the variable and its value. For example, for the command line,
awk -f programfile infile x=1 FS=,. Normally, command line arguments are
filenames, but the equal sign indicates an assignment. This lets variables
change value before and after a file is read. For instance, when the input is
from multiple files, the order they are listed on the command line becomes very
important since the first named input file is the first input read. Consider the
command line awk -f program file2 file1 and this program segment.
BEGIN { if ( FILENAME = "foo") {
print 'Unexpected input...Abandon ship!"
exit
}
}
The programmer has written this program to accept one file as first input and
anything else causes the program to do nothing except print the error message.
awk -f program x=1 file1 x=2 file2
The change in variable values above can also be used to check the order of
files. Since you (the programmer) know their correct order, you can check for
the appropriate value of x.
TIP: Awk only allows two command line options. The -f option indicates the file
containing the awk program. When no -f option is used, the program is expected
to be a part of the command line. The POSIX awk specification adds the option of
using more than one -f option. This is useful when running more than one awk
program on the same input. The other option is the -Fchar option where char is
the single character chosen as the input field separate. Without a specified -F
option, the input field separator is a space, until the variable FS is otherwise
set.
Functions
This section discusses user-defined functions, also known in some programming
languages as subroutines. For a discussion of functions built into awk see
either "Strings" or "Arithmetic" as appropriate.
The ability to add, define, and use functions was not originally part of awk. It
was added in 1985 when awk was expanded. Technically, this means you must use
either nawk or gawk, if you intend to write awk functions; but again, since some
systems use the nawk implementation and call it awk, check your man pages before
writing any code.
Function Definition
An awk function definition statement appears like the following:
function functionname(list of parameters) {
the function body
}
A function can exist anywhere a pattern action statement can be. As most of awk
is, functions are free format but must be separated with either a semicolon or a
newline. Like the action part of a pattern action statement, newlines are
optional anywhere after the opening curly brace. The list of parameters is a
list of variables separated by commas that are used within the function. The
function body consists of one or more pattern action statements.
A function is invoked with a function call from inside the action part of a
regular pattern action statement. The left parenthesis of the function call must
immediately follow the function name, without any space between them to avoid a
syntactic ambiguity with the concatenation operator. This restriction does not
apply to the built-in functions.
Parameters
Most function variables in awk are given to the function call by value. Actual
parameters listed in the function call of the program are copied and passed to
the formal parameters declared in the function. For instance, let's define a new
function called isdigit, as shown:
function isdigit(x) {
x=8
}
{ x=5
print x
isdigit(x)
print x
}
Now let's use this simple program:
$ awk -f isdigit.awk
5
5
The call isdigit(x) copies the value of x into the local variable x within the
function itself. The initial value of x here is five, as is shown in the first
print statement, and is not reset to a higher value after the isdigit function
is finished. Note that if there was a print statement at the end of the isdigit
function itself, however, the value would be eight, as expected. Call by value
ensures you don't accidently clobber an important value.
Variables
Local variables in a function are possible. However, as functions were not a
part of awk until awk was expanded, handling local variables in functions was
not a concern. It shows: local variables must be listed in the parameter list
and can't just be created as used within a routine. A space separates local
variables from program parameters. For example, function isdigit(x a,b)
indicates that x is a program parameter, while a and b are local variables; they
have life and meaning only as long as isdigit is active.
Global variables are any variables used throughout the program, including inside
functions. Any changes to global variables at any point in the program affect
the variable for the entire program. In awk, to make a variable global, just
exclude it from the parameter list entirely.
Let's see how this works with an example script:
function isdigit(x) {
x=8
a=3
}
{ x=5 ; a = 2
print "x = " x " and a = " a
isdigit(x)
print "now x = " x " and a = " a
}
The output is:
x = 5 and a = 2
x = 5 and a = 3
Function Calls
Functions may call each other. A function may also be recursive (that is, a
function may call itself multiple times). The best example of recursion is
factorial numbers: factorial(n) is computed as n * factorial(n-1) down to n=1,
which has a value of one. The value factorial(5) is 5 * 4 * 3 * 2 * 1 = 120 and
could be written as an awk program:
function factorial(n) {
if (n == 1) return 1;
else return ( n * factorial(n-1) )
}
For a more in-depth look at the fascinating world of recursion I recommend you
see either a programming or data structures book.
Gawk follows the POSIX awk specification in almost every aspect. There is a
difference, though, in function declarations. In gawk, the word func may be used
instead of the word function. The POSIX2 spec mentions that the original awk
authors asked that this shorthand be omitted, and it is.
The Return Statement
A function body may (but doesn't have to) end with a return statement. A return
statement has two forms. The statement may consist of the direction alone:
return. The other form is return E, where E is some expression. In either case,
the return statement gives control back to the calling function. The return E
statement gives control back, and also gives a value to the function.
TIP: Be careful: if the function is supposed to return a value and doesn't
explicitly use the return statement, the results returned to the calling program
are undefined.
Let's revisit the isdigit() function to see how to make it finally ascertain
whether the given character is a digit or not:
function isdigit(x) {
if (x >= "0" && x <= "9")
return 1;
else
return 0
}
As with C programming, I use a value of zero to indicate false, and a value of 1
indicates true. A return statement often is used when a function cannot continue
due to some error. Note also that with inline conditionals—as explained
earlier—this routine can be shrunk down to a single line: function isdigit(x) {
return (x >= "0" && x <= "9") }
Writing Reports
This section discusses writing reports. Before continuing with this section, it
would be a good idea to be sure you are familiar with both the UNIX sort command
(see section "Sorting Text Files" in Chapter 6) and the use of pipes in UNIX
(see section "Pipes" in Chapter 4). Generating a report in awk is a sequence of
steps, with each step producing the input for the next step. Report writing is
usually a three step process: pick the data, sort the data, make the output
pretty.
BEGIN and END Revisited
The section on "Patterns" discussed the BEGIN and END patterns as pre- and
post-input processing sections of a program. Along with initializing variables,
the BEGIN pattern serves another purpose: BEGIN is awk's provided place to print
headers for reports. Indeed, it is the only chance. Remember the way awk input
works automatically. The lines:
{ print " Total Sales"
print " Salesperson for the Month"
print " ———————————————" }
would print a header for each input record rather than a single header at the
top of the report! The same is true for the END pattern, only it follows the
last input record. So,
{print "———————————————"
print " Total sales",ttl" }
should only be in the END pattern.
Much better would be:
BEGIN { print " Total Sales"
print " Salesperson for the Month"
print " ————————————————" }
{ per person processing statements }
{print "———————————————"
print " Total sales",ttl" }
The Built-in System Function
While awk allows you to accomplish quite a few tasks with a few lines of code,
it's still helpful sometimes to be able to tie in the many other features of
UNIX. Fortunately almost all versions of nawk++ have the built-in function
system(value) where value is a string that you would enter from the UNIX command
line.
NOTE: The original awk does NOT have the system function.
The text is enclosed in double quotes and the variables are written using a
space for concatenating. For example, if I am making a packet of files to e-mail
to someone, and I create a list of the files I wish to send, I put a file list
in a file called sendrick:
$cat sendrick
/usr/anne/ch1.doc
/usr/informix/program.4gl
/usr/anne/pics.txt
then awk can build the concatenated file with:
$ nawk '{system("cat" $1)}' sendrick > forrick
creates a file called forrick containing a full copy of each file. Yes, a shell
script could be written to do the same thing, but shell scripts don't do the
pattern matching that awk does, and they are not great at writing reports
either.
UNIX users are split roughly in half over which text editor they use—vi or
emacs. I began using UNIX and the vi editor, so I prefer vi. The vi editor has
no way to set off a block of text and do some operation, such as move or delete,
to the block, and so falls back on the common measure, the line; a specified
number of lines are deleted or copied.
When dealing with long programs, I don't like to guess about the line numbers in
a block_or take the time to count them either! So I have a short script which
adds line numbers to my printouts for me. It is centered around the following
awk program. See file LST15_10 on the CD-ROM.
Advanced Concepts
As you spend more time with awk, you might yearn to explore some of the more
complex facets of the programming language. I highlight some of the key ones
below.
Multi-Line Records
By default, the input record separator RS recognizes a newline as the marker
between records. As is the norm in awk, this can be changed to allow for
multi-line records. When RS is set to the null string, then the newline
character always acts as a field separator, in addition to whatever value FS may
have.
Multidimensional Arrays
While awk does not directly support multidimensional arrays, it can simulate
them using the single dimension array type awk does support. Why do this? An
array may be compared to a bunch of books. Different people access them
different ways. Someone who doesn't have many may keep them on a shelf in the
room—consider this a single dimension array with each book at location[i]. Time
passes and you buy a bookcase. Now each book is in location[shelf,i]. The
comparison goes as far as you wish—consider the intercounty library with each
book at location[branchnum, floor, room, bookcasenum, shelf, i]. The appropriate
dimensions for the array depend very much on the type of problem you are
solving. If the intercounty library keeps track of all their books by a catalog
number rather than location; a single dimension of book[catalog_num] = title
makes more sense than location[branchnum, floor, room, bookcasenum, shelf, i] =
title. Awk allows either choice.
Awk stores array subscripts as strings rather than as numbers, so adding another
dimension is actually only a matter of concatenating another subscript value to
the existing subscript. Suppose you design a program to inventory jeans at
Levi's. You could set up the inventory so that item[inventorynum]=itemnum or
item[style, size, color] = itemnum. The built-in variable SUBSEP is put between
subscripts when a comma appears between subscripts. SUBSEP defaults to the value
\034, a value with little chance of being in a subscript. Since SUBSEP marks the
end of each subscript, subscript names do not have to be the same length. For
example,
item["501","12w","stone washed blue"],
item["dockers","32m","black"]
item["relaxed fit", "9j", "indigo"]
are all valid examples of the inventory. Determining the existence of an element
is done just as it is for a single dimension array with the addition of
parentheses around the subscript. Your program should reorder when a certain
size gets low.
if (("501",,) in item) print a tag.
NOTE: The in keyword is nawk++ syntax.
The price increases on 501s, and your program is responsible for printing new
price tags for the items which need a new tag:
for ("501" in item)
print a new tag.
Recall the string function split; split("501", ,SUBSEP) will retrieve every
element in the array with "501" as its first subscript.
Summary
In this chapter I have covered the fundamentals of awk as a programming language
and as a tool. In the beginning of the chapter I gave an introduction to the key
concepts, an overview of what you would need to know to get started writing and
using awk. I spoke about patterns, a feature that sets awk apart from other
programming languages. Two sections were devoted to variables, one on user
defined variables and one on built-in variables.
The later part of the chapter talks about awk as a programming language. I
discussed conditional statements, looping, arrays, input output, and user
defined functions. I close with a brief section on writing reports.
The next chapter is about Perl, a language very related to awk.
Table 15.4. Built-in Variables in Awk
V is the first implementation using the variable. A = awk G = gawk P
= POSIX awk N = nawk
V Variable
Meaning
Default(if any)
N ARGCThe number of command line arguments
N ARGVAn array of command line arguments
A FSThe input field separatorspace
A NFThe number of fields in the current record
G CONVFMTThe conversion format for numbers%.6g
G FIELDWIDTHSA white-space separated
G IGNORECASEControls the case sensitivityzero (case sensitive)
P FNRThe current record number
A FILENAMEThe name of the current input file
A NRThe number of records already read
A OFSThe output field separatorspace
A ORSThe output record separatornewline
A OFMTThe output format for numbers%.6g
N RLENGTHLength of string matched by match function
A RSInput record separatornewline
N RSTARTStart of string matched by match function
N SUBSEPSubscript separator"\034"
Further Reading
For further reading:
Aho, Alfred V., Brian W. Kernighan and Peter J. Weinberger, The awk Programming Language. Reading, Mass.: Addison-Wesley,1988 (copyright AT&T Bell Lab.)
IEEE Standard for Information Technology, Portable Operating System Inferface (POSIX), Part 2: Shell and Utilities, Volume 2. Std. 1003.2-1992. New York: IEEE, 1993.
See also the man pages for awk, nawk, or gawk on your system.
Obtaining Source Code
Awk comes in many varieties. I recommend either gawk or nawk. Nawk is the more
standard whereas gawk has some non-POSIX extensions not found in nawk. Either
version is a good choice.
To obtain nawk from AT&T: nawk is in the UNIX Toolkit. The dialup number in the
United States is 908-522-6900, login as guest.
To obtain gawk: contact the Free Software Foundation, Inc. The phone number is
617-876-3296.
--
隋末风云起,双龙走天下。
尽数天下英豪,唯我独尊!
※ 来源:·饮水思源站 bbs.sjtu.edu.cn·[FROM: 202.120.5.209]
--
※ 修改:.fzx 于 Aug 1 12:23:00 修改本文.[FROM: heart.hit.edu.cn]
※ 转寄:.紫 丁 香 bbs.hit.edu.cn.[FROM: chen.hit.edu.cn]
--
☆ 来源:.哈工大紫丁香 bbs.hit.edu.cn.[FROM: jmm.bbs@bbs.hit.edu.]
Powered by KBS BBS 2.0 (http://dev.kcn.cn)
页面执行时间:1,031.922毫秒