Pandoc filters for scientific writing
Introduction
Since 2005, I wrote all my scientific articles and presentations with the help of the document converter Pandoc. In this post, I will present some tools I created to adapt Pandoc to the needs of scientific writing.
Pandoc is a program to convert documents between different formats. For example, Pandoc can convert Word files to HTML, or HTML to PDF etc. One of the formats supported by Pandoc is Markdown, which is a text format to write documents with elements such as headings, enumerations, links, images, tables, code listings and so on. It distinguishes itself from other text formats like LaTeX and HTML by its intuitive syntax that emphasises readability of the source. For example, consider an enumeration in Markdown:
* Apples
* Oranges
* Bananas
The equivalent in HTML, generated by Pandoc, is:
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>
And in LaTeX:
\begin{itemize}
\item
Apples
\item
Oranges
\item
Bananas
\end{itemize}
For me, the Markdown version wins in terms of compactness and readability. Markdown allows me to spend more time writing actual content and less time remembering to close tags.
My workflow is to write an article in Markdown,
then to convert it to LaTeX via Pandoc, and
then to create a PDF via pdflatex
.
Some LaTeX functionality I frequently use is not provided by Markdown. For example, in my articles, I often need to reference parts of the article like as follows:
As mentioned in \autoref{introduction}, ...
This yields a text like “As mentioned in Section 1, …”
in the resulting PDF document.
There is no similar functionality built into Pandoc to create
a link to another part of the document with an automatically generated name.
One way to achieve this is to write
\autoref{introduction}
in the Markdown document,
because Pandoc detects LaTeX commands in Markdown documents
and outputs them when converting to LaTeX.
This works when generating PDF via LaTeX,
but it does not when generating HTML:
At the place where \autoref{introduction}
is written,
there will be just blank space in the HTML output.
Furthermore, I find it cumbersome to always write \autoref{...}
.
Can’t we make this shorter somehow?
Filters
It is possible to extend Pandoc with so-called filters. A filter is a program that transforms a document read by Pandoc before Pandoc outputs it. This allows us to modify the output of Pandoc, and in some cases also to modify the meaning of the syntax, thus making certain frequently used functionalities easily accessible. In this section, I show several such filters I created. They can be obtained from GitHub.
Intra-document links
As mentioned in the introduction, I often need to write something like
\autoref{introduction}
in LaTeX to reference parts of my document.
I created a filter that allows to instead write
[](#introduction)
in Markdown documents.
This syntax is intuitive if you know that
Markdown provides a way to create intra-document links,
such as [Introduction](#introduction)
,
which will link to the introduction of this post as follows:
Introduction.
If you create a link without title, e.g. [](#introduction)
,
Pandoc just yields an empty space.
That means that nobody using vanilla Pandoc
writes something like [](#introduction)
in their documents.
Therefore my filter can safely redefine its output to yield
\autoref{introduction}
when Pandoc outputs LaTeX, and
[introduction](#introduction)
for any other format.
LaTeX environments
I often need to write definitions and theorems in my articles. In LaTeX, I write something along the lines of:
\begin{definition}[Natural number]\label{def:nat}
A natural number is either zero, or the successor of a natural number.
\end{definition}
I created a filter that lets you write an equivalent of the above in Markdown:
Definition def:nat (Natural number)
: A natural number is either zero, or the successor of a natural number.
This syntax is actually used in several Markdown dialects to create definition lists. For example:
Love (noun)
: A deep and tender feeling of affection.
: A score of zero in tennis.
In HTML, this is transformed to:
- Love (noun)
- A deep and tender feeling of affection.
- A score of zero in tennis.
However, I have never used a definition list in any of my articles, and I cannot recall ever having seen a definition list in any other articles. Therefore, I deem it safe for the filter to transform every definition list to a LaTeX environment with an optional title and an optional label, as above. Furthermore, this Markdown text is transformed by Pandoc without using any filter to the following HTML output:
- Definition def:nat (Natural number)
- A natural number is either zero, or the successor of a natural number.
Note that this allows creating LaTeX environments with multiple paragraphs.
It is also possible to reference the resulting LaTeX environments,
using the intra-document links presented earlier.
That means you can write [](#def:nat)
to yield \autoref{def:nat}
in LaTeX.
Floating Tables and Code Blocks
By default, Pandoc puts tables and code blocks at exactly
the position where they are placed in the Markdown document.
However, in scientific articles, it is common practice to
create floating tables and code blocks.
I created a filter that floats every table and code block that has a caption.
See README.md
on GitHub for more information.
Conclusion
I found article writing much more pleasant since I discovered Pandoc, and my pleasure still increased with my scientific writing filters.
I invite you to give Pandoc a try, and I hope that my filters may contribute to your positive experience!