Awk as tool and scripting language

warangkana bunyarittongchai, 123RF.com

warangkana bunyarittongchai, 123RF.com

Text Master

In Linux, the need for text processing programs is a big one: System configuration, system management, and data exchange all use text-based files. With Awk, you have a powerful tool at your fingertips for text editing and targeted modification.

Awk is a scripting language specially conceived for editing and evaluating text files, and it provides a very useful tool for system administrators.

With its help, for example, a modified data path can be integrated system-wide in a single step. Many more possible applications exist, however, including evaluating text files with data recorded in tabular format.

An Awk script consists of single commands or command blocks that are executed under certain conditions or for all input rows. You can call the Awk commands directly from the command line or save them in a file as a script. You can find all sample data for this article online [1]. To try the examples, unpack the archive into an empty directory. The archive will include all the files for your exercises. Some examples include Linux commands along with Awk commands.

The first example uses the $USER system variable and takes you on a little tour of potential Awk applications:

$ echo $USER | awk '{print "Hello " $1 "!"}'

Awk takes the content of $USER as a command parameter. In processing the input, Awk stores all data fields in the variables $1 through $NF . The NF variable contains the number of data fields. Awk extracts the first command parameter from standard output and uses it with print and the word "Hello" to output the message to its own standard output. Multiple commands executed in succession are separated by a semicolon. To output only the third and fourth field of an input row, use:

$ echo "A B C D" | awk '{print $3 " " $4}'

Of course, a tool like Awk can also handle multiple-row input. I'll show the cal Unix program, which prints out a calendar month to the terminal, for example. Take a look at Figure 1. Awk prints the row number (NR ), the number of fields (NF ), and then the whole row ($0 ) as a control. The fields in this first example are separated by spaces or tabs. Later, I'll show how to use other separators.

Figure 1: Awk can handle multiple-row input and separate data into fields.

Awk can use many other variables apart from NF and NR when interpreting input and controlling the output. The Awk manual [2] includes an entire summary, as does the program's man page. Using the --dump-variables option, you can save the important variables from an Awk command in a file for later use.

Analyzing Files

Most practical applications read data from files and process it row by row, often reading from multiple files simultaneously. The awk '{print}' /etc/group command returns all rows in the /etc/group file, which contains rows with the data fields Group_name , Group_password , Group_ID and User_list (all the users in the group).

In this case, the data fields are separated by colons instead of spaces or tabs. In reading group names or other data fields, you must use the field separator variable (FS ), as in the following:

$ awk -F':' '{print $1}' /etc/group

You can also use Posix character classes or regular expressions as separators. If you have a file with fields separated by commas, tabs or spaces, for example, use the following command:

$ awk -F'[,[:blank:]]' '{print $1 $2 $3}' data2.txt

With Awk, you can change the order of the fields on output. If you want to output the ID first, then the group name and the users, you can do so as follows:

$ awk -F':' '{print $3 "  " $1 " " $4}' /etc/group

The print command is suitable only for formatted output. You can get a more comprehensive display of data with the printf() command that works as in C. Formatting options for numbers and character strings are described in detail in the Awk manual [2]. The following command formats the ID and user list:

$ awk -F':' '{printf("%5s %s\n",$3,$4)}' /etc/group

You will often want to exclude certain fields from the output. You can do this by assigning a null string to the field, as follows:

$ awk -F':' '{$2=""; print}' /etc/group

The output of this example no longer has the fields separated by colons, because Awk knows the separator for the output as well as the input. You can set the output separator values with the OFS variable.

Patterns and Pattern Ranges

So far, you have applied all the Awk commands to all input rows. In most cases, only selected rows need to be processed. The following example looks for group names in the /etc/group file that have no members. The condition is that the last field of the data row must be empty, and only those rows are read.

$ awk -F':' '$NF=="" {print $0}' /etc/group

An Awk script consists of three main blocks: BEGIN , main, and END , with at least one block present. The main block includes pairs of conditions and actions, with the action blocks depending on the conditions applied on all rows of the file.

condition1 { actionblock1 }
condition2 { actionblock2 }

The conditions are also called patterns in Awk. Patterns can consist of strings, regular expressions, logical expressions, and ranges. For example, you can output all the groups that contain your user name with the following command:

$ awk -F':' '/'"$USER"'/ {print $1}' /etc/group

Here the shell expands the content of the $USER system variable and uses it as a search pattern. Unlike in the previous examples, Awk searches only those rows that fit the pattern.

Regular expressions are text patterns used in Perl, Awk, and many other programs and programming languages. You can find an introduction to regular expressions in the fifth edition of our Linux Shell Handbook [3], and a standard reference is Jeffrey Friedl's book [4].

Awk uses regular expressions as search patterns in the format /<regex>/ . To search for all groups beginning with the word post in /etc/group , use the following command:

awk -F':' '/^post/ {print $1}' /etc/group

You can combine multiple regular expressions as search patterns by using the logical operators && (logical AND ) and || (logical OR ). The following command finds all rows beginning with root or post :

awk -F':' '/^post/ || /^root/ {print $1}' /etc/group

Using the combination /^ABC'/ && /DEF$/ , you can find all rows in a file that begin with ABC and end with DEF . You can define pattern ranges with beginning and ending conditions separated by a comma. The following two commands search rows 5 through 7 and then all groups from root through lp in the /etc/group input file:

$ awk -F':' 'NR==5,NR==7 {print $0}' /etc/group
$ awk -F':' '/root/,/lp/ {print $1}' /etc/group

Pattern ranges are useful for processing data beginning with a certain expression and ending with others. Awk processes only the data rows determined by the pattern range and ignores all others. Another use of pattern ranges is when data is in blocks of rows separated by spaces. The following pattern range handles this condition:

awk '/main_criteria/,/^$/ {actions}

The range begins with a main criteria and ends with a blank row. The end condition uses the usual regular expression of the caret (^ ), which translates to "the beginning of the row," followed by the dollar sign ($ ), which translates to "the end of a row." The beginning of a row followed immediately by the end of a row, with no characters in between means "look for an empty row."

Built-In Functions

Awk has many useful functions that expand upon those already mentioned. The most important can be used to process text strings. You can concatenate, split up, search through, and replace text strings as you wish. Figure 2 shows a few of these functions, which are also included in the stringfunctions.sh files. You can copy the examples into shell and experiment with the functions before implementing them later in scripts.

Figure 2: Awk provides a series of useful functions for processing text strings.

Another useful function is getline , with which the output of a called Unix command can be piped in as an Awk variable (see the systemcalls.awk example in Listing 1). Using getline , you can also read data from files instead of directly from the command line. In this way, Awk can access data for formatting or comparing, for example.

Listing 1

Using getline with Awk

01 # Integrating Unix command
02 BEGIN {
03    "date +%x"   | getline day;
04    "date +%T"   | getline time;
05    printf("\nToday is %s and it is %s o'clock.\n", day, time);
06 }

Begin and End

Along with the conditions previously mentioned, you have the special patterns BEGIN and END . These instruction blocks serve an important purpose in evaluating files. Awk processes the commands in the BEGIN block before it reads in any data. The commands serve in initializing the script and setting variables. For example, if you want to change the delimiters for input and output, you can include the OFS variable within the block. The BEGIN block can also be used to output table headers.

The commands in the END block are parsed after reading the last row of data. They are often used to evaluate the input data. For example, you can calculate sums and averages or add footnotes to output tables.

If both blocks are in an Awk script, it's often worth saving the script in a separate file. You can create these files with a text editor such as Emacs or Vi. Most editors provide syntax highlighting to help in pairing curly braces for Awk.

Interpreting Log Files

Many larger network printers and print servers log print jobs in a text file. Logs usually include the print job originators, page sizes and counts, and other free format data fields for such things as project cost centers and other information. Each row is a print job record. In Listing 2, you can see the table headers and some sample records from a printlog.txt file.

Listing 2

Sample printlog.txt File

Document        User    Device  Format  Medium  col     b/w     costctr
C2.sxw         LAGO    pr04    DIN_A4  Normal  1       10      P01
prop.pdf       LEHM    pr03    DIN_A4  Normal  0       10      P01
offer.doc      LOHN    pr01    DIN_A4  Normal  3       0       P02
...

The Awk scripts for processing these kinds of files take up about six to eight lines, depending on the format. They are included in the examples you can download from the Ubuntu User site. You can also copy Listing 3 into a text editor and save it as eval1.awk . You can start by evaluating the total number of black-and-white or color printed pages. Each of the sums is handled by a variable.

Listing 3

Print log evaluation 1

01 #Evaluating the number of printed pages
02 NR==1 {
03    next;
04 }
05 {
06  sum_color+=$6;
07  sum_bw+=$7;
08 }
09 END {
10     print sum_color " Printed in color";
11     print sum_bw " Printed in B&W";
12 }

Invoke the script with awk -f eval1.awk printlog.txt , which first loads the script file, then executes the commands in Awk on printlog.txt . The script starts by skipping a row if NR==1 , which essentially ignores the table header row. It would also be possible to store this line in a variable and output it at the end.

Awk increments the sum variables depending on number of color ($6 ) and B&W ($7 ) pages and prints both totals. Processing this print job for a cost center is a bit more complicated, but Awk can handle the task quite effectively (see Listing 4 and eval2.awk ).

Listing 4

Print log evaluation 2

01 # Evaluating the number of B&W printed pages
02 # for a cost center
03 NR==1 {
04    next;
05 }
06 {printer[$8]+=$7}
07 END {
08   print "costctr. totals";
09   for (F in printer) {print F " " printer[F]}
10 }

Here again, the next command skips the first row. The remaining rows are to be evaluated to count the number of B&W pages for the cost center. The printer[$8]+=$7 command increments page counts in the printer[] array for the cost centers. Once all the datasets are read, the loop evaluates the printer data field in the END block and outputs the totals for each cost center.

The printer[] array represents the names of the cost centers. In this article's examples, you will also find the eval3.awk script, which sums up all the printings for each user by using two data fields.

Evaluating Number Values

The following example prepares some simple number values in a time column and five columns with floating point values (Listing 5). Because they are floating points, be sure to set the LC_ALL=C variable in shell before you try this example.

Listing 5

Sample Number Values

t               Val1            Val2            Val3            Val4            Val5
0.100000        0.194000        0.166000        0.162000        0.155000        0.194200
0.200000        0.440000        0.388000        0.359000        0.392000        0.400000
...

Listing 5 shows the table headers and the first few rows with sample data of the measureddata1.txt file. Calculating the average values is quite simple in Awk through the average.awk script in Listing 6.

Listing 6

average.awk

01 # Averaging the values
02 NR==1 {next;}
03 {sum = 0;
04  for (i=2; i<=NF; i++) {
05       sum+=$i;
06   }
07   average = sum/(NF-1)
08   printf("%6s %8.2f\n",$1,average);
09 }

You also often have to find the minimum and maximum values from a list. You can do this by sorting the rows in a column.

Listing 7 shows the numbervalues1.awk file that evaluates the minimum and maximum values. The numbers are first stored in an array ($1 through $NF ). After sorting, the first value has the minimum and the last value has the maximum.

Listing 7

numbervalues1.awk

01 # Evaluating number values
02 BEGIN { print("     t         MIN        MAX"); }
03 NR==1 { next; }
04 {
05     t=0; n=NF;
06     /* Store the number values in an array: */
07     for (i=2; i<=n; i++) {
08          y[i-2] = $i;
09     }
10     /* Sort the values: */
11     for (i = 0; i <= n-2; i++) {
12        for (j = i; j > 0 && y[j-1] > y[j]; j--) {
13              t = y[j]; y[j] = y[j-1]; y[j-1] = t;
14        }
15     }
16     /* Output: t, MIN, MAX */
17     printf("%6s %8.6f %8.6f ",$1, y[0], y[n-2]);
18     printf("\n");
19 }

Sorting the values works through a cut-and-paste insertion sort method that is appropriate for limited number of values. You can find more efficient algorithms in Jon Bentley's excellent book [5], for example. More complicated sorting algorithms require Awk functions.

Conclusion

Awk can address many everyday user or administrator tasks quickly and effortlessly, and it is easier to learn than many other programming languages. This article covers the basics and provides a few more complex examples. You can find support and countless other examples in news groups and on many websites.

The free documentation [2] makes learning Awk easy and includes many more examples. The book by the three Awk inventors – Alfred Aho, Peter Weinberger, and Brian Kernighan [6] – is recommended for more advanced users.l

Infos

  1. Code examples from this article: ftp://ftp.linux-magazine.com/pub/listings/ubuntu-user.com/25/AWK/
  2. Robbins, Arnold. The GNU Awk User's Guide 2003: http://www.gnu.org/software/gawk/manual/gawk.html
  3. "Regular Expressions" by Martin Streicher, Linux Shell Handbook 5 : http://www.sparkhaus-shop.com/eu/magazines/special-editions/eh32040.html
  4. Friedl, Jeffrey. Mastering Regular Expressions , O'Reilly, 2006.
  5. Bentley, Jon. Programming Pearls , Addison-Wesley, 2000.
  6. Aho, Kernighan, and Weinberger. The AWK Programming Language , Addison-Wesley, 1988.