Paperwork in the battle against paper stacks

Slashdot it! Delicious Share on Facebook Tweet! Digg!
ginasanders, 123RF

ginasanders, 123RF

Paperless

Paperwork is a new attempt to create the paperless office using free software components. This article describes just how far it's come.

The idea of Paperwork [1] goes back to the old concept of the paperless office. You scan in letters, bills, and loose pages or generate PNG or JPEG files of the documents. You then send these to optical character recognition (OCR) software that digitizes the content. Next, an application combines the images and text and saves them into a PDF.

This process is not all that straightforward. For good enough text recognition, you need the highest possible quality scans or photos of text pages – a good scanner needs a mandatory 600-dpi resolution. Additionally, the OCR software has to do its job well. Paperwork searches for the Tesseract [2] OCR product at startup. If it can't find this comprehensive program, it tries to use Cuneiform. In most cases, Tesseract gives the best results, so install it from the Ubuntu repositories (see the "Installation" box for details on installing Paperwork).

Installation

The current version of Paperwork is still not in the Ubuntu repositories. At the time of writing, there is not even a PPA available. You can find how to install Paperwork from the source code [3]. Alternatively, you can find installation instructions on the GitHub page of the developer [4].

Paperwork is based for the most part on two components. To scan documents, Paperwork uses SANE. Tesseract or Cuneiform take over for the text recognition. Whoosh [5] indexes the OCR-converted text so that it can be easily searched, and the tool generates suggestions for keywords. Paperwork then pulls everything together into a graphical interface developed in GTK/Glade.

The Tesseract OCR software was originally developed by Hewlett-Packard, and Google uses its library system for digitizing books [6]. Tesseract has a high accuracy rate and automates most processes. A drawback, however, is that Tesseract processes exclusively uncompressed TIFF input files, and this applies to documents as well.

Paperless Office

After starting up, Paperwork shows a clearly designed interface split into three sections. On the left is the current document; next to it, you'll see the existing scanned and reworked pages, and on the right is the current page in detail. Like the GScan3PDF [7] PDF scanner, Paperwork can take the document directly from an attached scanner or load existing images from the hard drive. The software combines scanned images as projects and exports them as PDF files.

By default, Paperwork saves the projects in the papers folder as a subdirectory with the current date as its name (e.g., 20140605_1350_31/ ). It drops many files in these directories. In paper.<number>.jpg , you'll find the JPEG images of scanned pages; the text extracted by the OCR engine is in paper.<number>.words . These texts aren't just in simple text files; they're in a special XML file format called hOCR [8], where the position in the original document is indicated along with the plain text. These files are hard to read in the text editor, so you can overlay the extracted text directly over the image files. The specially developed format DjVu [9] is based on this construct. Furthermore, Paperwork saves thumbnails of the scanned pages in this directory. You can identify them because they have the word thumb in their name. Files containing labels take up manually defined labels for the document, a file named extra.txt contains your assigned keywords.

Paperwork supports multiple sources for reading documents. You can drive a scanner right from the application. The program tries to locate the scanner through its SANE back end. Alternatively Paperwork supports USB-connected webcams, which, unfortunately, isn't the best solution because of the low resolution and quality. Paperwork can also use images as sources, such as screenshots from PDFs. Unsurprisingly, the OCR quality of the results will depend on the quality of the source material.

Paperwork also allows direct editing of PDF files. You load these with Document | Import file(s) into the program. Paperwork can import multiple PDFs at once, yet not recursively from subdirectories, which means you will have move all your files on the same level in a directory to take advantage of this feature.

Setting Up OCR

Before you begin scanning your document, you need to set up the program (Figure 1). In the toolbar, you'll find the button as the fourth one from the left. Apart from the working directory, you configure the scanner and determine the language for text recognition. Paperwork saves the settings in the ~/.config/paperwork.conf file; the index of all the scanned documents goes into the ~/.local/share/paperwork/index/ directory.

Figure 1: Configuring Paperwork is limited to just a few settings.

Calibrating the scanner is done through a settings dialog by clicking the icon on the right. Paperwork starts a scan that it uses as a basis for all other input from the device. How well that works depends not least on the applied fonts.

Figure 2 shows an example of a document that Paperwork almost completely recognized despite its being scanned crookedly. The words that were deciphered are shown in blue frames with the Highlight all words function in the Document menu under Advanced . You'll have to check for yourself whether the clear text really is accurate. In Figure 3, Paperwork tries an OpenOffice-generated PDF. It actually shows better results than a document read in by scanner. There are no unrecognized words, and Paperwork frames every recognized word (except stop-words) in a blue box. Unrecognized words have no box around them – as shown in Figure 4 with Spanish text.

Figure 2: Paperwork's text recognition is very good even with badly scanned documents.
Figure 3: Paperwork recognized everything in this PDF document.
Figure 4: Things didn't go so well with this Spanish document.

Buy this article as PDF

Express-Checkout as PDF

Pages: 3

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content