Awk as tool and scripting language

Slashdot it! Delicious Share on Facebook Tweet! Digg!
warangkana bunyarittongchai, 123RF.com

warangkana bunyarittongchai, 123RF.com

Text Master

In Linux, the need for text processing programs is a big one: System configuration, system management, and data exchange all use text-based files. With Awk, you have a powerful tool at your fingertips for text editing and targeted modification.

Awk is a scripting language specially conceived for editing and evaluating text files, and it provides a very useful tool for system administrators.

With its help, for example, a modified data path can be integrated system-wide in a single step. Many more possible applications exist, however, including evaluating text files with data recorded in tabular format.

An Awk script consists of single commands or command blocks that are executed under certain conditions or for all input rows. You can call the Awk commands directly from the command line or save them in a file as a script. You can find all sample data for this article online [1]. To try the examples, unpack the archive into an empty directory. The archive will include all the files for your exercises. Some examples include Linux commands along with Awk commands.

The first example uses the $USER system variable and takes you on a little tour of potential Awk applications:

$ echo $USER | awk '{print "Hello " $1 "!"}'

Awk takes the content of $USER as a command parameter. In processing the input, Awk stores all data fields in the variables $1 through $NF . The NF variable contains the number of data fields. Awk extracts the first command parameter from standard output and uses it with print and the word "Hello" to output the message to its own standard output. Multiple commands executed in succession are separated by a semicolon. To output only the third and fourth field of an input row, use:

$ echo "A B C D" | awk '{print $3 " " $4}'

Of course, a tool like Awk can also handle multiple-row input. I'll show the cal Unix program, which prints out a calendar month to the terminal, for example. Take a look at Figure 1. Awk prints the row number (NR ), the number of fields (NF ), and then the whole row ($0 ) as a control. The fields in this first example are separated by spaces or tabs. Later, I'll show how to use other separators.

Figure 1: Awk can handle multiple-row input and separate data into fields.

Awk can use many other variables apart from NF and NR when interpreting input and controlling the output. The Awk manual [2] includes an entire summary, as does the program's man page. Using the --dump-variables option, you can save the important variables from an Awk command in a file for later use.

Analyzing Files

Most practical applications read data from files and process it row by row, often reading from multiple files simultaneously. The awk '{print}' /etc/group command returns all rows in the /etc/group file, which contains rows with the data fields Group_name , Group_password , Group_ID and User_list (all the users in the group).

In this case, the data fields are separated by colons instead of spaces or tabs. In reading group names or other data fields, you must use the field separator variable (FS ), as in the following:

$ awk -F':' '{print $1}' /etc/group

You can also use Posix character classes or regular expressions as separators. If you have a file with fields separated by commas, tabs or spaces, for example, use the following command:

$ awk -F'[,[:blank:]]' '{print $1 $2 $3}' data2.txt

With Awk, you can change the order of the fields on output. If you want to output the ID first, then the group name and the users, you can do so as follows:

$ awk -F':' '{print $3 "  " $1 " " $4}' /etc/group

The print command is suitable only for formatted output. You can get a more comprehensive display of data with the printf() command that works as in C. Formatting options for numbers and character strings are described in detail in the Awk manual [2]. The following command formats the ID and user list:

$ awk -F':' '{printf("%5s %s\n",$3,$4)}' /etc/group

You will often want to exclude certain fields from the output. You can do this by assigning a null string to the field, as follows:

$ awk -F':' '{$2=""; print}' /etc/group

The output of this example no longer has the fields separated by colons, because Awk knows the separator for the output as well as the input. You can set the output separator values with the OFS variable.

Patterns and Pattern Ranges

So far, you have applied all the Awk commands to all input rows. In most cases, only selected rows need to be processed. The following example looks for group names in the /etc/group file that have no members. The condition is that the last field of the data row must be empty, and only those rows are read.

$ awk -F':' '$NF=="" {print $0}' /etc/group

An Awk script consists of three main blocks: BEGIN , main, and END , with at least one block present. The main block includes pairs of conditions and actions, with the action blocks depending on the conditions applied on all rows of the file.

condition1 { actionblock1 }
condition2 { actionblock2 }

The conditions are also called patterns in Awk. Patterns can consist of strings, regular expressions, logical expressions, and ranges. For example, you can output all the groups that contain your user name with the following command:

$ awk -F':' '/'"$USER"'/ {print $1}' /etc/group

Here the shell expands the content of the $USER system variable and uses it as a search pattern. Unlike in the previous examples, Awk searches only those rows that fit the pattern.

Regular expressions are text patterns used in Perl, Awk, and many other programs and programming languages. You can find an introduction to regular expressions in the fifth edition of our Linux Shell Handbook [3], and a standard reference is Jeffrey Friedl's book [4].

Awk uses regular expressions as search patterns in the format /<regex>/ . To search for all groups beginning with the word post in /etc/group , use the following command:

awk -F':' '/^post/ {print $1}' /etc/group

You can combine multiple regular expressions as search patterns by using the logical operators && (logical AND ) and || (logical OR ). The following command finds all rows beginning with root or post :

awk -F':' '/^post/ || /^root/ {print $1}' /etc/group

Using the combination /^ABC'/ && /DEF$/ , you can find all rows in a file that begin with ABC and end with DEF . You can define pattern ranges with beginning and ending conditions separated by a comma. The following two commands search rows 5 through 7 and then all groups from root through lp in the /etc/group input file:

$ awk -F':' 'NR==5,NR==7 {print $0}' /etc/group
$ awk -F':' '/root/,/lp/ {print $1}' /etc/group

Pattern ranges are useful for processing data beginning with a certain expression and ending with others. Awk processes only the data rows determined by the pattern range and ignores all others. Another use of pattern ranges is when data is in blocks of rows separated by spaces. The following pattern range handles this condition:

awk '/main_criteria/,/^$/ {actions}

The range begins with a main criteria and ends with a blank row. The end condition uses the usual regular expression of the caret (^ ), which translates to "the beginning of the row," followed by the dollar sign ($ ), which translates to "the end of a row." The beginning of a row followed immediately by the end of a row, with no characters in between means "look for an empty row."

Buy this article as PDF

Express-Checkout as PDF

Pages: 5

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content