Create searchable PDFs and optimize their use

Slashdot it! Delicious Share on Facebook Tweet! Digg!

Tool Chest

The PDF toolkit provides you with an extensive set of tools for editing PDF files. It will let you read out metadata and edit multiple PDF files or chain them together. You can also extract individual pages from a PDF, rotate pages, and encrypt PDF files that also have password security.

If you are looking to scan a book and put each chapter in its own PDF file, then you can collect the individual files together to make a document using the commands from Listing 8. You should be careful to make sure that the collating sequence in the PDF file corresponds to the sort order of the file names.

Listing 8

Concatenating PDFs

$ pdftk chap1.pdf chap2.pdf chap3.pdf cat output book.pdf
$ pdftk chap*.pdf cat output book.pdf

Conversely, you can use PDFtk to extract specific passages from a larger PDF file. By way of example, assume that you want to put an ebook title page on page 1 of a slim PDF, and the ebook contents between the pages numbered 42 to 73. Here you should use the cat <1 42-73 option from the corresponding call in PDFtk (Listing 9, line 1).

Listing 9

Splitting, concatenating, and rotating PDFs

$ pdftk book.pdf cat <1 42-73 output chapitel.pdf
$ pdftk scan_quer.pdf cat 1-endeast output scan_hoch.pdf

Alternatively, you can do something like take a document scanned in landscape format and rotate it to the correct position (Listing 9, line 2). You can specify which pages get rotated and in which direction with the 1-endeast option. The first part, 1-end , refers to pages 1 through the last page, therefore all of the pages. The rotation applies directly to these pages. Here, the east is 90 degrees clockwise to the "east." The other rotational directions are south for 180 degrees to the east and west for 90 degrees counterclockwise to the west.

In addition, PDFtk offers the possibility of protecting a document from unauthorized viewing. The encrypt_128bit option encrypts the PDF file with a 128-bit key. You set the password using user_pw <password1> (where password1 is the password you set to protect your file). The call can be expanded with a second password with owner_pw <password2> . This would prevent anyone other than the owner from printing or editing the document unless the rights to do so are explicitly granted via something like allow printing (Listing 10). You should pay attention to the fact that only Adobe Acrobat Reader implements these digital rights management (DRM) functions (see the "Pseudo-DRM in PDFs" box).

Listing 10

Protecting a PDF document

$ pdftk file.pdf output file_encrypted.pdf user_pw <password>
$ pdftk file.pdf output encrypted.pdf user_pw <password> owner_pw <password2>
$ pdftk file.pdf output encrypted.pdf user_pw <password> owner_pw <password1> encrypt_128bit allow printing

Pseudo-DRM in PDFs

Only Adobe Acrobat Reader [10] makes it compulsory to pay attention to specifications delineating digital rights to a PDF document. For example, these specifications would include allow printing , allow assembly , and allow copycontents . KDE's document viewer Okular only complies with specifications when you activate the menu option Settings | Configure Okular… | General | Obey DRM limitations . Gnome's Evince document viewer completely ignores all DRM restrictions. Practically speaking, passwords do not add any protection value. The only genuine protection from undesired viewing comes with encryption with user_pw <Password> encrypt_128bit .

If you would like to avoid typing entries into PDFtk, then you can use PDF Chain [7], which has a suitable graphical user interface (Figure 4). This program has all of the toolkit functions in easy-to-understand dialogs. You will not have to remember any commands. You can find this program under pdfchain in Ubuntu's package repositories.

Figure 4: PDF Chain offers a simple but practical graphical interface for the high performance PDF toolkit PDFtk.

Pruning

If you scan books or magazines with a high performance scanner, then a double page frequently lands on a single page of the PDF document. Most of the time, it would be preferable to have each page of the original end up on just one page in the PDF file. This is where the program Krop can help. You can install Krop by downloading the latest deb file from [8]. Then use the following command to install and resolve dependency issues:

sudo dpkg -i krop_0.4.11-1_all.deb
sudo apt-get install -f

Krop offers a large number of possibilities for cropping PDF files, including splitting double pages in two (Figure 5).

Figure 5: After scanning, you can quickly and conveniently divide double pages into two individual pages in the PDF document.

It is easy for scanned documents to get so large that they significantly increase load times and present problems for further editing activities. Also, the recipient often places a limit on the maximum permissible file size. Therefore, it makes sense to start paying attention to the resolution when you scan a document. A black-and-white scan with 72dpi usually suffices for reading but not for text recognition. Experience shows that a black-and-white scan with 300dpi results in a good compromise between high quality and ease of handling.

If a PDF file becomes too large, you can use Ghostscript [9] to scale down the resolution while optimizing the document for various uses (Listing 11). Many distributions come preinstalled with this program , including Ubuntu. If not, you can install it with gs using Apt.

Listing 11

Changing resolutions with gs

$ gs -sDEVICE=pdfwrite -sPAPERSIZE=a4 -r72 -dNOPAUSE -dBATCH -sOutputFile=output.pdf input.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

The command in the first line of Listing 11 causes Ghostscript to create the smaller PDF file output.pdf in A4 format with 72dpi resolution from the input.pdf PDF file. The option -dNOPAUSE prevents Ghostscript from stopping after each page and then waiting for confirmation that it is okay to continue. The option -dBATCH makes Ghostscript close automatically at the end of the process. The call in the second line contains the dimensions for the PDF file, but automatically uses the option -dPDFSETTINGS=/ setting to set a series of additional switches that optimize the document for various areas of application (see Table 1).

Table 1

Ghostscript PDF Settings

Option Resolution Comments
/screen 72dpi Ideal for display on a PC.
/ebook 150dpi Good quality and small size; ideal for scanning credentials for job application
/printer 300dpi Optimized for print outs; however results in fairly large documents
/prepress 300dpi Intended for transfer to a printer; no reduction in the color index
/default no information Optimization for the optimal display on different output devices

Buy this article as PDF

Express-Checkout as PDF

Pages: 8

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content