Paperwork in the battle against paper stacks

The idea of Paperwork [1] goes back to the old concept of the paperless office. You scan in letters, bills, and loose pages or generate PNG or JPEG files of the documents. You then send these to optical character recognition (OCR) software that digitizes the content. Next, an application combines the images and text and saves them into a PDF.

This process is not all that straightforward. For good enough text recognition, you need the highest possible quality scans or photos of text pages – a good scanner needs a mandatory 600-dpi resolution. Additionally, the OCR software has to do its job well. Paperwork searches for the Tesseract [2] OCR product at startup. If it can't find this comprehensive program, it tries to use Cuneiform. In most cases, Tesseract gives the best results, so install it from the Ubuntu repositories (see the "Installation" box for details on installing Paperwork).

Installation

The current version of Paperwork is still not in the Ubuntu repositories. At the time of writing, there is not even a PPA available. You can find how to install Paperwork from the source code [3]. Alternatively, you can find installation instructions on the GitHub page of the developer [4].

Paperwork is based for the most part on two components. To scan documents, Paperwork uses SANE. Tesseract or Cuneiform take over for the text recognition. Whoosh [5] indexes the OCR-converted text so that it can be easily searched, and the tool generates suggestions for keywords. Paperwork then pulls everything together into a graphical interface developed in GTK/Glade.

The Tesseract OCR software was originally developed by Hewlett-Packard, and Google uses its library system for digitizing books [6]. Tesseract has a high accuracy rate and automates most processes. A drawback, however, is that Tesseract processes exclusively uncompressed TIFF input files, and this applies to documents as well.

Paperless Office

After starting up, Paperwork shows a clearly designed interface split into three sections. On the left is the current document; next to it, you'll see the existing scanned and reworked pages, and on the right is the current page in detail. Like the GScan3PDF [7] PDF scanner, Paperwork can take the document directly from an attached scanner or load existing images from the hard drive. The software combines scanned images as projects and exports them as PDF files.

By default, Paperwork saves the projects in the papers folder as a subdirectory with the current date as its name (e.g., 20140605_1350_31/ ). It drops many files in these directories. In paper.<number>.jpg , you'll find the JPEG images of scanned pages; the text extracted by the OCR engine is in paper.<number>.words . These texts aren't just in simple text files; they're in a special XML file format called hOCR [8], where the position in the original document is indicated along with the plain text. These files are hard to read in the text editor, so you can overlay the extracted text directly over the image files. The specially developed format DjVu [9] is based on this construct. Furthermore, Paperwork saves thumbnails of the scanned pages in this directory. You can identify them because they have the word thumb in their name. Files containing labels take up manually defined labels for the document, a file named extra.txt contains your assigned keywords.

Paperwork supports multiple sources for reading documents. You can drive a scanner right from the application. The program tries to locate the scanner through its SANE back end. Alternatively Paperwork supports USB-connected webcams, which, unfortunately, isn't the best solution because of the low resolution and quality. Paperwork can also use images as sources, such as screenshots from PDFs. Unsurprisingly, the OCR quality of the results will depend on the quality of the source material.

Paperwork also allows direct editing of PDF files. You load these with Document | Import file(s) into the program. Paperwork can import multiple PDFs at once, yet not recursively from subdirectories, which means you will have move all your files on the same level in a directory to take advantage of this feature.

Setting Up OCR

Before you begin scanning your document, you need to set up the program (Figure 1). In the toolbar, you'll find the button as the fourth one from the left. Apart from the working directory, you configure the scanner and determine the language for text recognition. Paperwork saves the settings in the ~/.config/paperwork.conf file; the index of all the scanned documents goes into the ~/.local/share/paperwork/index/ directory.

Figure 1: Configuring Paperwork is limited to just a few settings.

Calibrating the scanner is done through a settings dialog by clicking the icon on the right. Paperwork starts a scan that it uses as a basis for all other input from the device. How well that works depends not least on the applied fonts.

Figure 2 shows an example of a document that Paperwork almost completely recognized despite its being scanned crookedly. The words that were deciphered are shown in blue frames with the Highlight all words function in the Document menu under Advanced . You'll have to check for yourself whether the clear text really is accurate. In Figure 3, Paperwork tries an OpenOffice-generated PDF. It actually shows better results than a document read in by scanner. There are no unrecognized words, and Paperwork frames every recognized word (except stop-words) in a blue box. Unrecognized words have no box around them – as shown in Figure 4 with Spanish text.

Figure 2: Paperwork's text recognition is very good even with badly scanned documents.
Figure 3: Paperwork recognized everything in this PDF document.
Figure 4: Things didn't go so well with this Spanish document.

Browsing Scanned Documents

Paperwork doesn't only process documents. The paperless office also needs a search function, which Paperwork provides. The program saves the recognized texts in an externally inaccessible index. You can search for them in Paperwork using keywords. The corresponding input box is at the upper left under the toolbar. Paperwork displays the matching document and highlights the hits in the document on the right (Figure 5). A tool tip shows how to limit the search to a specific date or use Boolean operators.

Figure 5: With the search function, you can quickly find the places inside indexed documents.

Besides the automatically generated keywords, you can assign additional keywords ("labels") to documents that don't even appear in the document. In the search, you use the syntax label:<term> , even in conjunction with a regular search. Paperwork keeps these labels in a file named labels in the documentation directory. You can mark additional keywords in the current document with the pencil button in the upper left of the toolbar. Paperwork saves this data in the extra.txt file.

On request, Paperwork exports the finished documents as PDFs. Actually, output in DjVu format should be possible, but that didn't work in this test. Other possibilities are pdf2hocr [10] or pdfsandwich [11]. Paperwork also provides a printing function for archive documents – which, of course, defeats the purpose of a truly paperless office.

Conclusion

Despite some interesting functionalities, Paperwork is still a bit immature to handle the flood of paper in the office. The program should be of particular interest to Python programmers, who can take advantage of the modules it implements for larger projects.

If you're looking for a good scan program with integrated OCR function, GScan2PDF [7] might be a better choice, because it is more stable and implements more functionalities. You will also find it has significantly more ways of preparing the scanned documents for OCR processing.

The unique selling point of Paperwork – the index function for the smattering of scanned documents over time – can just as easily be implemented with Recoll [12]. This desktop search engine works not only with indexed PDF documents but includes office document formats as well.