Paperwork in the battle against paper stacks

The idea of Paperwork [1] goes back to the old concept of the paperless office. You scan in letters, bills, and loose pages or generate PNG or JPEG files of the documents. You then send these to optical character recognition (OCR) software that digitizes the content. Next, an application combines the images and text and saves them into a PDF.

This process is not all that straightforward. For good enough text recognition, you need the highest possible quality scans or photos of text pages – a good scanner needs a mandatory 600-dpi resolution. Additionally, the OCR software has to do its job well. Paperwork searches for the Tesseract [2] OCR product at startup. If it can't find this comprehensive program, it tries to use Cuneiform. In most cases, Tesseract gives the best results, so install it from the Ubuntu repositories (see the "Installation" box for details on installing Paperwork).

Paperwork is based for the most part on two components. To scan documents, Paperwork uses SANE. Tesseract or Cuneiform take over for the text recognition. Whoosh [5] indexes the OCR-converted text so that it can be easily searched, and the tool generates suggestions for keywords. Paperwork then pulls everything together into a graphical interface developed in GTK/Glade.

[...]

Use Express-Checkout link below to read the full article (PDF).