Composing Office documents without Office

Many people appreciate programs like their graphical word processors (think LibreOffice Writer or Microsoft Word) because they simplify the process of creating appealing texts. However, even when used efficiently, these programs can make entering formatted text complex since they tie styles closely and invisibly with the text. As a result, you can often not see when one style ends and the next begins.

The free tool Pandoc offers a different approach by separating form and content clearly. The tools presented here also work with file types such as HTML, ODT, TEX, and PDF, as well as DOCX (the proprietary format used by Microsoft Word since 2009).

Excellent

Tools like Pandoc and AsciiDoc differ from classical text editors in that they reduce the use of element tags. Highlights and headings are typically labeled with certain strings (tags) in the text.

As the name "markup language" (ML) suggests, HTML belongs to this language family, as do LaTeX, RTF, and others. Simplified markup languages use fewer marks and symbols; thus they adapt more smoothly to the flow of the text.

A good place to start is with the markdown tool. It supports programs such as IPython Notebook/Jupyter, R/knitr, and others. Table 1 summarizes important commands; a comprehensive list is available online [1].

Table 1

Markdown

Markup Function
*<Word>* Cursive
<Word> Bold
# <Heading> First-level heading
## <Heading> Second-level heading
* Bullet for a bulleted list
1. Numbered bullets
`<Code>` Monospace(for code)
Horizontal line
> Indented block of text
[<Link-Text>](<URL>) Hyperlink
![<imagetext>](<file>) Include an image

Pandoc

Pandoc has undergone significant development over the years. Note that users with 32-bit systems are limited to version 1.2, which will not cope well with all of the examples presented in this article.

The repositories for most distributions offer the program in a stable 1.16.0.2 version for 64-bit systems. In addition, the web page [2] links to the newest packages for Ubuntu and Debian. Even if you don't need the functions right away, it is a good idea to immediately install the packages pandoc-citeproc and python-pandocfilters :

sudo apt update
sudo apt install pandoc python-pandocfilters pandoc-citeproc

Pandoc expects text files in UTF8 format, which is actually a matter of course for Linux. The following call is suitable for converting a Markdown text into the MS Word format DOCX.

$ pandoc test.md -o test.docx

Originally, Markdown was only supposed to simplify the entry of HTML code. Even today, the original intent is still recognizable in the language. As with HTML, the software interprets empty spaces and line breaks as simple word separators. However, there are some exceptions: If a line begins with four empty spaces, the software will interpret this as a block of code in monospace font without a line break. If the line ends with at least two empty spaces, then the software will add a break in the output.

You can introduce a heading with an empty space followed by double pound sign placed at the beginning of the following line. You should enclose bulleted lists, indented text, and text blocks with empty lines. At least one empty space should follow markings for list entries.

Formatting Word Files

You can download a Word file for practice:

$ wget ftp://ftp.linux-magazine.com/pub/listings/ubuntu-user.com/31/Pandoc/demo.docx

A file will appear in your directory called demo.docx . Pandoc cannot download the file by itself since DOCX is a compressed file format, which is not a text file. You can then convert the document to Markdown with:

$ pandoc -f docx -t markdown --atx-headers --toc --extract-media="." demo.docx -o demo2.md

The -f and -t switches determine the file type, although in this case they are not strictly necessary, since Markdown recognizes the file endings.

In certain cases, it is also possible to do without the --atx-headers (tags headers via preceding hash tags instead of underlining) and --toc (creates table of contents) options. The addition of --extract-media indicates that the program should extract all of the images. If you now follow the example and specify the current directory, the program will transfer the images to a subdirectory labeled media .

Now look at the MS Word file you downloaded earlier. I want to focus exclusively on the design. Comparing the original (Figure 1) with the Markdown version (Figure 2) demonstrates how Pandoc works. Markdown can only manage formatting that structures the text; coloring is lost.

Figure 1: An excerpt from the Word file shows the formatting used in the original.
Figure 2: Implementing the original format in Markdown works only up to a certain point.

A more detailed examination of the original file reveals that the author specified only a few headings. The tagging only occurs for things like bold face, font size, or color. As a result, the text loses important design elements. For example, it is not going to be very useful to automatically create a table of contents (--toc ).

The bullet lists were constructed automatically, and Pandoc picks them up okay. If they were entered manually, Pandoc would not recognize them as bulleted lists, although it would try to imitate the formatting as closely as possible. Implementing the table in Markdown also doesn't work well. For one thing, the line separators stretch across multiple lines, making it difficult to read the text. In addition, the software interprets paragraphs in the tables as new lines. Furthermore, Markdown ignores headers and footers. However, it is possible to give the original file as a reference when converting to DOCX.

Pandoc knows some predefined variables that can be enclosed at the beginning of the text with three dashes . You will find the structure of the text in Markdown in Listing 1. The transformation into Word was accomplished using the following instruction:

Listing 1

Pre-text Data

---
---
author: Roland Pleger
title: Using Pandoc to convert documents
date: Date
# Simple Complexity
Pandoc lets you convert Markdown documents to DOCX and work conveniently
with the editor of your choice. It is even possible to keep track of
changes made by various authors.
By Roland Pleger

Listing 2

Mathematical Expression

$$
\bar{x}_{\mathrm{arithm}} = \frac{1}{n} \sum_{i=1}^n{x_i}
$$ {#eq:item}
$ pandoc --toc --reference-doc=demo.docx demor.md -o demor.docx

The converter looks at the headers and footers from the reference file demo2.doc via --reference-doc . Moreover, it carries out formatting for the title, author, and headings. If there are no descriptions for headings in the reference file, then you will have to take care of this. The simplest solution is to first execute the transformation without reference to the original file. Then you can copy the result to the reference file. You should modify the formats, headings, and title in Word as desired. Then, you should repeat the transformation with the expanded reference file.

Footnotes and Cross References

The contents are the most important part of a scientific work. The formatting comes after the contents have been written. Simple text files are popular. They don't make any demands on the selection of the operating system. They also make it possible to exchange information with colleagues. The framework can be quickly assembled and filled with keywords. Then comes the composition. You can include images and tables as needed.

Equations are expressed according to the LaTeX formatting language. Mathematical expressions inside of a text are enclosed by dollar signs. More complex mathematical expressions (Figure 3) stand between two dollar signs, each in its own line (Listing 3)

Listing 3

Integrating Extensions

$ pip install pandoc-fignos
The program ,pip' is currently not installed.
$ sudo apt install python-pip
$ sudo pip install pandocfilters --upgrade
$ sudo pip install pandoc-fignos
$ sudo pip install pandoc-eqnos
$ sudo pip install pandoc-tablenos
Figure 3: Pandoc even creates complex mathematical formulas that you can enter into the source code in a format borrowed from LaTeX.

If necessary, you can use the expression @eq:item to refer to the formula that you have given with the reference {#eq:item} . Making a reference requires an extension [3], which you incorporate when calling Pandoc. The same goes for images [4] and tables [5]. Listing 3 shows how to integrate these add-ons into Pandoc. The error message in the second line prompts you to install the Python package manager pip if it is still missing from the system.

Pandoc includes tools that can accurately format lists of sources from scholarly publications. The explanations in the Citation Style Language (CSL) format file goes beyond the scope of this article, but it is worth mentioning in case you need it. The same goes for managing source texts in formats like BibTeX or BibLaTex.

Pandoc comes with references for headings and footnotes. Table 2 shows the spellings for the various references. In order to be able to use cross references with different names, you should run the following:

$ pandoc --filter pandoc-fignos --filter pandoc-eqnos --filter pandoc-tablenos --filter pandoc-citeproc myfile.md -o myfile.docx

Currently, the developers are working on summarizing the filter and automatically creating the references [6].

Table 2

References

Type of Reference Reference LaTeX Reference Pandoc
Footnote [^item] [^item]:
Heading (#item) {#item}
Bibliography @item @Article{item}
Image @fig:item {#fig:item}
Table @tbl:item {#tbl:item}
Equation @eq:item {#eq:item}

Like other typical text-editing programs, Pandoc can keep track of document changes. You can save all changes from the Office document office.docx in the Markdown document myfile.md using the command from Listing 4, line 1. The instructions from line 2 discard all of the changes, and the instructions in line 3 record them.

Listing 4

Tracking Changes

01 $ pandoc office.docx -o myfile.md --track-changes=accept
02 $ pandoc office.docx -o myfile.md --track-changes=reject
03 $ pandoc office.docx -o myfile.md --track-changes=all

Markdown relies on HTML Syntax as soon as it comes across nonstandard instructions. This is how the call from line 3 in Listing 4 created the Markdown snippet from Listing 5 that is shown in the office.docx file in Figure 4.

Listing 5

office.docx

~~~ {.html}
The example for deletion,
And an example for insertion in Pandoc Markdown->Office.
~~~
Figure 4: Markings indicate author changes to a document in progress. Pandoc relies on HTML syntax for output in Markdown.

In order to save changes, you should delete the span tags from the Markdown file. After switching from Markdown to DOCX, it is impossible to distinguish the editing markings from those originally found in Office. If information about the author and date are missing from the changes, then the program can add these items.

Conclusion

Pandoc almost achieves the impossible. It creates simple text files from Office documents. The formatting for the text files is borrowed from Markup languages. Pandoc can also reverse the process and convert text files into properly formatted documents for Office text-processing programs. This simpler way of working and the cleanly structured documents make up for the fact that formatting sometimes gets lost during the transformation.

The creation of Office documents in ODT and DOCX format represents only a portion of Pandoc's capabilities. The actual strength of the program lies in generating HTML, TeX, LaTeX, and PDF files. And, although the corresponding markup languages' complexity requires intensive study of the language elements, at least the editing follows the rules presented here.