Accelerated command processing with GNU Parallel

Lead Image © Ilka Burckhardt, Fotolia.com

Lead Image © Ilka Burckhardt, Fotolia.com

Multiple Personalities

With the snazzy little program GNU Parallel, you can make use of the full power of your multicore CPUs through scripts.

When you get back from vacation, you probably have tons of snapshots stored on your camera. If you want to reduce the resolution of photos so you can upload your pics to a web gallery, the following one-liner for Mogrify from the ImageMagick package is usually sufficient:

$ for i in *.tif; do mogrify -resize 50% $i; done

The command combs through all files with the .tif ending in the current directory (for i in *.tif ) and has Mogrify reduce their size by a half (mogrify -resize 50% ). Because the command processes the files sequentially, a modern core processor running at full speed is still basically twiddling its thumbs. It would be much more effective and faster to process multiple photos simultaneously. This is where the somewhat unjustly overlooked tool called GNU Parallel comes in.

Wrong Twin

Even though GNU Parallel has officially been part of the GNU [1] tool collection since 2010, it is seldom preinstalled and you still have to install it via the package manager. Many large distributions, such as openSUSE 12.2, are missing it altogether. To make matters worse, a program called Parallel is also part of the moreutils [2] package, but it has nothing in common with the GNU Parallel package presented here.

Thus, you need to pay special attention when grabbing the GNU Parallel package to make sure you get the correct one. In Ubuntu, the package is simply called parallel , and you can install it by entering

apt-get update
apt-get install parallel

from the command line.

If you are using something different from Ubuntu and can't find it in the package manager, you will have to compile it yourself. For that you'll need Make, a C compiler, and the current source code archive [3]. You can then unzip and install GNU Parallel using

./configure && make && make install

As soon as GNU Parallel is installed, test whether gnu.org and ubuntu-user.com are accessible:

$ parallel ping -c2 ::: gnu.org 192.168.1.102 ubuntu-user.com

This executes the ping -c2 command three times, once for gnu.org and once for ubuntu-user.com , but both programs run simultaneously. If ubuntu-user.com responds more quickly than the GNU server, you'll see that result in the output first (Figure 1).

Figure 1: The computer 192.168.1.102 in the local net responds to the ping more quickly than its colleague gnu.org on the Internet.

The three colons in the command belong to GNU Parallel and separate the command from its parameters. By the way, in GNU Parallel parlance, commands or programs are called "executable jobs" instead.

Retention Bucket

Normally, GNU Parallel collects all output from the program and shows it at the end of the execution. This approach has the advantage that the two aforementioned ping instances won't "talk over" each other, but you also won't see any intermediate results. However, you can use the -u option to show what's going on while the program is running (Figure  3).

Figure 3: The -u (--ungroup) option displays a mixture of output from jobs, which can be confusing, as you can see at the beginning of the example.

The -k (--keep-order ) option shows the GNU Parallel results in the exact order of processing. In the previous example, you would always see the ping -c2 gnu.org output first even though another ping is occurring faster. For a better oversight, use the -v (verbose) option to compel GNU Parallel to write the name of the executable job just before each output.

Job Center

Even with GNU Parallel working as expected, image processing can still be sluggish. This command can help:

$ parallel mogrify -resize 50% ::: *.tif

During testing, this command cuts the processing time in half. To avoid having to list all the processed files after the triple colons, the command uses the shell, which groups all available *.tif files in a separate TIFF directory in advance. GNU then calls up mogrify -resize 50% for each file.

With this, GNU Parallel tries to run as many jobs simultaneously as possible – one per core by default. With a quad-core processor, the command resizes four photos at the same time. The additional -j (--jobs ) option specifies the number of simultaneous jobs (known as jobslots) that you want done. The following command processes two photos at a time:

$ parallel -j 2 mogrify -resize 50% ::: *.tif

You can also specify the number after -j in a parameters file that has the value set, with no line break after the number:

$ parallel -j /tmp/jobs.txt...

GNU Parallel reads in this file with each new job. This approach has the advantage that you can change the number of simultaneous jobs in the middle of processing if you have a large collection of images. GNU Parallel shows you how many and which jobs are currently running with the standard error output (STDERR ) as soon as you send it the SIGUSR1 signal (Figure 4).

Figure 4: By sending the SIGUSR1 signal, GNU Parallel shows the currently running jobs, which in this example, are two photos being resized.

For larger image collections, GNU Parallel shows how far the processing has advanced when you use the --progress option. To cancel processing, simply press Ctrl+C or send a TERM signal (e.g., with killall -TERM parallel ). In the latter case, GNU Parallel waits to complete its work first.

Transfer of Ownership

The examples so far have used three colons to separate the commands from the images. As an alternative, you can pass the parameters by using a pipe (line 1 of Listing 1) or by collecting all parameters  – the file names in this case – in a text file (line 2).

Listing 1

Passing Parameters to Parallel via Pipe

01 $ find . -name '*.tif' | parallel mogrify -resize 50%
02 $ parallel -a dateinamen.txt mogrify -resize 50%
03 $ find . -name '*.tif' | parallel -0 mogrify -resize 50%

In both cases, GNU Parallel assumes that each parameter is separated by a newline. GNU Parallel doesn't break a file name containing spaces into separate values but passes it on to Mogrify as one, as it should. However, if someone happened to break a file name entry between lines, you can use the -0 (--null ) option (line 3) to specify separating the parameter values with the null (\0 ) instead of newline (\n ) character.

Building Blocks

In the previous commands Mogrify replaces the original image with a resized one. Better yet, you can save the edited version in a separate preview/ directory:

$ mogrify -resize 50% photo.tif preview/photo.tif

However, here you have the problem that the file name appears twice in the command. You need to tell GNU Parallel not only to append the file name to the mogrify command, but also to put it in a specified location. To do this, you can use the {} placeholder, as used on line 1 in Listing 2.

Listing 2

Using One Parameter in Several Places

01 $ parallel mogrify -write preview/{} -resize 50% {} ::: *.tif
02 $ parallel convert {} {.}.png ::: *.tif

GNU Parallel substitutes the parameter name (in this case the file name) for the curly braces ({} ). This placeholder also has other practical variants. The {.} placeholder drops the file extension, which can be used, for example, with convert to change the resized images into PNG format (line 2). Similarly, the {//} placeholder is replaced by the string with the directory name of the input line.

If no command is passed to GNU Parallel, it assumes that the input line contains the command. You can exploit this feature to feed GNU Parallel a number of different commands:

$ (echo ls; echo pwd) | parallel

By using the echo command, the shell provides the two lines ls and pwd to GNU Parallel to run in parallel (Figure  5).

Figure 5: GNU Parallel interprets input as commands.

Big Ones

Not only converting photo collections but processing large files can be a time sink. Here again, GNU Parallel helps. The following command, for example, feeds the large ubuntu.img file to the tool:

cat ubuntu.img | parallel --pipe --recend '' -k gzip >ubuntu.img.gz

By using the --pipe option, GNU Parallel splits the file into 1MB blocks. It then compresses each block – again in parallel – with Gzip. The tool collects all the zipped data blocks, puts them in the correct order (with -k ), and saves them in the ubuntu.img.gz file. It is then easy enough to unzip it again with gzip -d .

The --recend option stands for "record end," and you can use it to specify the end of a data block. Without this option, GNU Parallel looks for newlines to split the data into new records. The example above splits the data into 1MB blocks, which are then passed to Gzip. The repeated single quote (' ' ) supersede the default --recend behavior.

Other large files can be processed in parallel in the same way . The documentation also shows how to sort a big file in parallel [4].

Compressing with Gzip incidentally has a minor flaw. Because GNU Parallel works on blocks, the archiver might not see the whole file and, therefore, might not be able to compress it efficiently. The difference depends on the data to be compressed.

Caught in the Net

If you are still not convinced about how awesome parallel processing is, you can also use GNU Parallel over networked computers. With the use of this feature, you can get another number cruncher to resize your vacation photos. To make this work, you need to access the remote computer via SSH without a password (e.g., ssh-agent [5]).

Moreover, Rsync needs to be installed along with GNU Parallel on the remote computer. Rsync is used for data transfer, whereas GNU Parallel determines the number of processors. For example,

$ parallel --sshlogin 192.168.1.11,192.168.1.12 'hostname; echo' ::: 1 2

checks to see whether GNU Parallel can access the computers with IP addresses 192.168.1.11 and 192.168.1.12. GNU Parallel logs in to both computers via SSH, starts the hostname program, and returns the results (Figure 6). The echo command provides a bit of help to GNU Parallel, in that hostname doesn't take any of the parameters that you place after the three colons.

Figure 6: The echo and the ending numbers help contact both of the specified servers.

The harmless echo simply outputs a number. The two numbers at the end are necessary so that GNU Parallel contacts both computers. With only a 1 at the end, the tool would run hostname; echo on a single computer.

You can use hostnames instead of IP addresses and separate them with commas. Prefix login names with @ before the computer name.

If your connections work, you can get a colleague's computer to participate in your photo processing with the somewhat long command in Listing 3. GNU Parallel grabs the next image file and copies it via the --transfer option to a computer, where it runs the mogrify -resize 50% job. It then returns the processed file (--return {} ) and deletes the copy on the remote computer with the --cleanup option.

Listing 3

Using Networked Computers

parallel --sshlogin hank@192.168.2.11,peter@192.168.2.12 --transfer --return {} --cleanup mogrify -resize 50% {} ::: *.tif

To make the command clear, Mogrify and GNU Parallel simply overwrite the local file. Because the --transfer , --return , and --cleanup options are used often, you can abbreviate them as --trc .

This kind of distributed computing can expose some limiting bottlenecks. Remote transfers, for example, can take longer than the processing itself. Distributed processing is, therefore, best used with extensive or time-consuming processing.

Conclusion

GNU Parallel comes into its own, particularly with computationally intensive processing requiring multiple, mutually independent subtasks. In Bash scripts, you can often speed up for and while loops with GNU Parallel, such as those used with the photos at the beginning of this article. A pleasant side effect is that, thanks to GNU Parallel, the commands are much easier to read.

This tool recognizes many other parameters and functions whose descriptions could fill a book. If you're working with GNU Parallel for the first time, a look at the examples in the manual [6] will help.

Incidentally, I cheated a bit in the first example. Mogrify doesn't need to be stuck in a for loop because it can process multiple files by itself. It even uses multiple processor cores. GNU Parallel is, therefore, not a magic bullet. You should always check in advance whether your command-line commands are already running on multiple cores.

Cheap Imitation

In Ubuntu and Ubuntu-based distros, you're likely to get a cryptic error message when you run the parallel command (Figure 2). To avoid a conflict with the Parallel from the moreutils package, you can have GNU Parallel behave like its same-named competitor. The Debian package builders switched on the compatibility mode by default. To use GNU Parallel as described here, you either have to use the --gnu option with parallel or remove the --tollef entry from the /etc/parallel/config file. For this article, to keep parameter clutter to the minimum, I will assume you have done the latter.

Figure 2: If this error message appears, GNU Parallel is running in the wrong mode.