Composing Office documents without Office

Slashdot it! Delicious Share on Facebook Tweet! Digg!

Formatting Word Files

You can download a Word file for practice:

$ wget ftp://ftp.linux-magazine.com/pub/listings/ubuntu-user.com/31/Pandoc/demo.docx

A file will appear in your directory called demo.docx . Pandoc cannot download the file by itself since DOCX is a compressed file format, which is not a text file. You can then convert the document to Markdown with:

$ pandoc -f docx -t markdown --atx-headers --toc --extract-media="." demo.docx -o demo2.md

The -f and -t switches determine the file type, although in this case they are not strictly necessary, since Markdown recognizes the file endings.

In certain cases, it is also possible to do without the --atx-headers (tags headers via preceding hash tags instead of underlining) and --toc (creates table of contents) options. The addition of --extract-media indicates that the program should extract all of the images. If you now follow the example and specify the current directory, the program will transfer the images to a subdirectory labeled media .

Now look at the MS Word file you downloaded earlier. I want to focus exclusively on the design. Comparing the original (Figure 1) with the Markdown version (Figure 2) demonstrates how Pandoc works. Markdown can only manage formatting that structures the text; coloring is lost.

Figure 1: An excerpt from the Word file shows the formatting used in the original.
Figure 2: Implementing the original format in Markdown works only up to a certain point.

A more detailed examination of the original file reveals that the author specified only a few headings. The tagging only occurs for things like bold face, font size, or color. As a result, the text loses important design elements. For example, it is not going to be very useful to automatically create a table of contents (--toc ).

The bullet lists were constructed automatically, and Pandoc picks them up okay. If they were entered manually, Pandoc would not recognize them as bulleted lists, although it would try to imitate the formatting as closely as possible. Implementing the table in Markdown also doesn't work well. For one thing, the line separators stretch across multiple lines, making it difficult to read the text. In addition, the software interprets paragraphs in the tables as new lines. Furthermore, Markdown ignores headers and footers. However, it is possible to give the original file as a reference when converting to DOCX.

Pandoc knows some predefined variables that can be enclosed at the beginning of the text with three dashes . You will find the structure of the text in Markdown in Listing 1. The transformation into Word was accomplished using the following instruction:

Listing 1

Pre-text Data

---
---
author: Roland Pleger
title: Using Pandoc to convert documents
date: Date
# Simple Complexity
Pandoc lets you convert Markdown documents to DOCX and work conveniently
with the editor of your choice. It is even possible to keep track of
changes made by various authors.
By Roland Pleger

Listing 2

Mathematical Expression

$$
\bar{x}_{\mathrm{arithm}} = \frac{1}{n} \sum_{i=1}^n{x_i}
$$ {#eq:item}
$ pandoc --toc --reference-doc=demo.docx demor.md -o demor.docx

The converter looks at the headers and footers from the reference file demo2.doc via --reference-doc . Moreover, it carries out formatting for the title, author, and headings. If there are no descriptions for headings in the reference file, then you will have to take care of this. The simplest solution is to first execute the transformation without reference to the original file. Then you can copy the result to the reference file. You should modify the formats, headings, and title in Word as desired. Then, you should repeat the transformation with the expanded reference file.

Footnotes and Cross References

The contents are the most important part of a scientific work. The formatting comes after the contents have been written. Simple text files are popular. They don't make any demands on the selection of the operating system. They also make it possible to exchange information with colleagues. The framework can be quickly assembled and filled with keywords. Then comes the composition. You can include images and tables as needed.

Equations are expressed according to the LaTeX formatting language. Mathematical expressions inside of a text are enclosed by dollar signs. More complex mathematical expressions (Figure 3) stand between two dollar signs, each in its own line (Listing 3)

Listing 3

Integrating Extensions

$ pip install pandoc-fignos
The program ,pip' is currently not installed.
$ sudo apt install python-pip
$ sudo pip install pandocfilters --upgrade
$ sudo pip install pandoc-fignos
$ sudo pip install pandoc-eqnos
$ sudo pip install pandoc-tablenos
Figure 3: Pandoc even creates complex mathematical formulas that you can enter into the source code in a format borrowed from LaTeX.

If necessary, you can use the expression @eq:item to refer to the formula that you have given with the reference {#eq:item} . Making a reference requires an extension [3], which you incorporate when calling Pandoc. The same goes for images [4] and tables [5]. Listing 3 shows how to integrate these add-ons into Pandoc. The error message in the second line prompts you to install the Python package manager pip if it is still missing from the system.

Pandoc includes tools that can accurately format lists of sources from scholarly publications. The explanations in the Citation Style Language (CSL) format file goes beyond the scope of this article, but it is worth mentioning in case you need it. The same goes for managing source texts in formats like BibTeX or BibLaTex.

Pandoc comes with references for headings and footnotes. Table 2 shows the spellings for the various references. In order to be able to use cross references with different names, you should run the following:

$ pandoc --filter pandoc-fignos --filter pandoc-eqnos --filter pandoc-tablenos --filter pandoc-citeproc myfile.md -o myfile.docx

Currently, the developers are working on summarizing the filter and automatically creating the references [6].

Table 2

References

Type of Reference Reference LaTeX Reference Pandoc
Footnote [^item] [^item]:
Heading (#item) {#item}
Bibliography @item @Article{item}
Image @fig:item {#fig:item}
Table @tbl:item {#tbl:item}
Equation @eq:item {#eq:item}

Like other typical text-editing programs, Pandoc can keep track of document changes. You can save all changes from the Office document office.docx in the Markdown document myfile.md using the command from Listing 4, line 1. The instructions from line 2 discard all of the changes, and the instructions in line 3 record them.

Listing 4

Tracking Changes

01 $ pandoc office.docx -o myfile.md --track-changes=accept
02 $ pandoc office.docx -o myfile.md --track-changes=reject
03 $ pandoc office.docx -o myfile.md --track-changes=all

Markdown relies on HTML Syntax as soon as it comes across nonstandard instructions. This is how the call from line 3 in Listing 4 created the Markdown snippet from Listing 5 that is shown in the office.docx file in Figure 4.

Listing 5

office.docx

~~~ {.html}
The example for deletion,
And an example for insertion in Pandoc Markdown->Office.
~~~
Figure 4: Markings indicate author changes to a document in progress. Pandoc relies on HTML syntax for output in Markdown.

In order to save changes, you should delete the span tags from the Markdown file. After switching from Markdown to DOCX, it is impossible to distinguish the editing markings from those originally found in Office. If information about the author and date are missing from the changes, then the program can add these items.

Buy this article as PDF

Express-Checkout as PDF

Pages: 4

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content