Pandoc is a versatile and extensible document converter,
able to parse from and serialise to many Markdown dialects
as well as equivalent subsets of other text markup languages
including HTML and (via LaTeX or other intermediaries) PDF.
Pandoc offers several APIs for adapting its content processing
as well as a templating structure for styling of output.
This chapter provides an analysis of Pandoc,
with a focus on ways to extend it to support a new Markdown extension.
Separation of concerns
The data processing workflow of Pandoc
is shaped in the spirit of Markdown
(see @sec-spirit).
The Pandoc AST evolved from the needs of tracking Markdown.
It mainly keeps track of structure of content,
and only to a limited extent handles layout:
import from layout-rich document formats reduces such information,
and export to layout-rich document formats
use templates to effectively impose a normalized style.
APIs to supplant or extend workflow components {#sec-pandoc-apis}
Pandoc is designed with extension in mind,
not only in its programming design
where most code is externally exposed as public libraries
reusable by third-party projects,
but also at runtime through several APIs.
Styling (templates), import (parsing) and export (serialising)
are overridable.
The styling of the target document is isolated as one or more templates.
Parsing of source content and rendering of target document
are both isolated as sets of Reader and Writer routines.
All three --
template, import and export --
can be superseded at runtime,
either by passing command-line options
or by declaring options in document-wide metadata in the source file
(for the source text formats supporting that).
The AST itself is also exposed through a filter API,
allowing for intervening in the data processing workflow
after source import and before target export,
by directly mangling the AST.
The AST segments content into smallest type {#sec-pandoc-ast}
The Pandoc AST contains objects,
each representing an element of the content --
some object types containing lists of other element objects,
and some themselves holding a string of content.
A Reader function breaks content into element objects.
For example, this short Markdown text:
# Greeting
Hello, world!
...gets broken down into these seven elements:
Pandoc
block, containing two block objects
Header
block, containing one inline objects
Str
inline, representing the string "Greeting"
Para
block, containing three inline objects
Str
inline, representing the string "Hello,"
Space
inline, repreenting the string " "
Str
inline, representing the string "world!"
This structure has similar traits
as as the Document Object Model (DOM) for web pages
(which is not surprising,
given that Markdown is a superset of HTML).
Breaking inline strings at each space
is aligned with HTML behaviour for spaces,
where a line break requires explicit markup
and any amount of simple space is treated as one soft-break,
which means that it behaves as either a single word delimiting space
or as a line break if needed by the rendering medium.
For example, this Markdown text:
This is
just one
line.
Renders (regarding spacing) like this in an HTML or PDF document:
This is just one line.
Grouping content into sets of inline elements within block elements
is aligned with HTML distinction between blocks and inlines,
where content is either part of a relative loose string
flowing within the constraints of a block
(inline elements)
or itself a block (block elements).
The Pandoc filter API,
which exposes the full AST,
is efficient and intuitive for example to traverse all blocks
(omitting inlines) and capitalize each word within each Headline,
or to only traverse inlines and replace "http://" with "https://"
in the link target of all link elements.
Since the full AST is available,
the filter API can also be used to reorganise content,
for example, to redefine plain text as belonging to a link,
or vice versa,
but if content is "put into the wrong boxes"
(for example, if parsed wrongly as is done in this project),
then the organisation of the AST
can become more of an obstacle than an optimization.
Complex features {#sec-pandoc-complex}
Some functionalities of Pandoc hook into multiple components
of its workflow.
An example of a complex functionality is citation handling,
which involves two sources, a style guide, and a template.
The source text contains citation annotations,
referencing document-wide or externally stored bibliographic metadata,
such as this:
"you can create code span by wrapping text in backtick quotes"
[@Gruber2004]
Pandoc uses a Markdown extension for citations
to capture the string [@Gruber2004]
,
looks up its embedded citation key Gruber2004
in a bibliographic database
(which might be locally maintained or using a suitable cloud service),
and renders two or three texts:
- the replacement text where the citation was written
- a footnote on the same or a following page
- an entry in a bibliography chapter at the end of the text
The style of these renderings, and whether the second one is omitted,
depends on which citation style is chosen,
usually selected from Citation Style Language,
a collection of more than 10.000 styles.
One common style is APA,
that says to include surname and year in the first text,
omit the second text,
and sort the entry in the third text by surname --
and contains hundreds of other details,
for example, on how to abbreviate a citation with 16 authors,
or an entry without an author listed.
Ideally, citations can be separated
into prose (annotation), structure (bibliographic metadata)
and layout concerns (composition),
with each concern anchored at one place in the workflow.
In reality, however, automated composition sometimes fail short --
for example, the style chosen from Citation Style Language
is missing translation of some terms in your (natural) language.
The author may then need to violate the separation of concerns
(until the underlying issue is properly solved),
often done by adding compositional information
to the bibliographic metadata.
The filter API is either efficient or versatile {#sec-pandoc-filter-versatile}
Pandoc strives to maintain a separation of concern
between content, structure and layout,
offering separate APIs for import, filtering and export,
but since some functionality cross those boundaries,
this separation is not guaranteed,
and complex features are more likely to work correctly
with the most commonly used Reader and Writer components.
This is exacerbated when using wrapper tools
since their extensions are likely tailored to those same
most commonly used Reader and Writer components.
The Quarto framework, like Pandoc,
allows using any Reader.
However, Quarto-specific macro expansions
are only available when using the default Reader --
a custom derivative of the Pandoc default Markdown Reader.
Despite the Reader API being specifically tailored
for implementing a custom source format parser,
when integration with complex functions is relevant,
either within Pandoc itself or part of wrapper tools,
that API effectively is discouraged.
Since a core principle of Markdown is
to treat unknown markup as content,
a viable alternative for an extended Markdown format
is to use use the regular Markdown Reader
and then "finish up" the parsing in a filter.
As an example, this Semantic Markdown text:
[This]{:depiction} is not [a pipe]{:depiction}.
...should ideally become this as regular Markdown:
This is not a pipe.
...but misparsed as regular Markdown would result in this Pandoc AST:
Pandoc
Para
Str
"[This]{:depiction}"
Space
Str
"is"
Space
Str
"not"
Space
Str
"[a"
Space
Str
"pipe]{:depiction"
Space
Str
"#the_object}."
...where a filter would then look for {...}
and reorganise the AST to become this:
Pandoc
Para
Str
"This"
Space
Str
"is"
Space
Str
"not"
Space
Str
"a"
Space
Str
"pipe."
The first Str
simply needed its content reduced to "This".
The trailing part is more involved, however,
since it involves three Str
and two Space
elements.
Since this project prioritises improving existing tools
(see @sec-improve),
and implementing using the import API limits use
with wrapper tools such as Quarto
(see @sec-pandoc-complex),
the approach chosen for this project is
an implementation using the filter API,
as covered next in @sec-filter.