Pandoc is a versatile and extensible document converter, able to parse from and serialise to many Markdown dialects as well as equivalent subsets of other text markup languages including HTML and (via LaTeX or other intermediaries) PDF. Pandoc offers several APIs for adapting its content processing as well as a templating structure for styling of output. This chapter provides an analysis of Pandoc, with a focus on ways to extend it to support a new Markdown extension. ## Separation of concerns The data processing workflow of Pandoc is shaped in the spirit of Markdown (see @sec-spirit). The Pandoc AST evolved from the needs of tracking Markdown. It mainly keeps track of structure of content, and only to a limited extent handles layout: import from layout-rich document formats reduces such information, and export to layout-rich document formats use templates to effectively impose a normalized style. ## APIs to supplant or extend workflow components {#sec-pandoc-apis} Pandoc is designed with extension in mind, not only in its programming design where most code is externally exposed as public libraries reusable by third-party projects, but also at runtime through several APIs. Styling (templates), import (parsing) and export (serialising) are overridable. The styling of the target document is isolated as one or more templates. Parsing of source content and rendering of target document are both isolated as sets of Reader and Writer routines. All three -- template, import and export -- can be superseded at runtime, either by passing command-line options or by declaring options in document-wide metadata in the source file (for the source text formats supporting that). The AST itself is also exposed through a filter API, allowing for intervening in the data processing workflow after source import and before target export, by directly mangling the AST. ## The AST segments content into smallest type {#sec-pandoc-ast} The Pandoc AST contains objects, each representing an element of the content -- some object types containing lists of other element objects, and some themselves holding a string of content. A Reader function breaks content into element objects. For example, this short Markdown text: ```markdown # Greeting Hello, world! ``` ...gets broken down into these seven elements: * `Pandoc` block, containing two block objects * `Header` block, containing one inline objects * `Str` inline, representing the string "Greeting" * `Para` block, containing three inline objects * `Str` inline, representing the string "Hello," * `Space` inline, repreenting the string " " * `Str` inline, representing the string "world!" This structure has similar traits as as the Document Object Model (DOM) for web pages (which is not surprising, given that Markdown is a superset of HTML). Breaking inline strings at each space is aligned with HTML behaviour for spaces, where a line break requires explicit markup and any amount of simple space is treated as one soft-break, which means that it behaves as either a single word delimiting space or as a line break if needed by the rendering medium. For example, this Markdown text: ```markdown This is just one line. ``` Renders (regarding spacing) like this in an HTML or PDF document: > This is just one line. Grouping content into sets of inline elements within block elements is aligned with HTML distinction between blocks and inlines, where content is either part of a relative loose string flowing within the constraints of a block (inline elements) or itself a block (block elements). The Pandoc filter API, which exposes the full AST, is efficient and intuitive for example to traverse all blocks (omitting inlines) and capitalize each word within each Headline, or to only traverse inlines and replace "http://" with "https://" in the link target of all link elements. Since the full AST is available, the filter API can also be used to reorganise content, for example, to redefine plain text as belonging to a link, or vice versa, but if content is "put into the wrong boxes" (for example, if parsed wrongly as is done in this project), then the organisation of the AST can become more of an obstacle than an optimization. ## Complex features {#sec-pandoc-complex} Some functionalities of Pandoc hook into multiple components of its workflow. An example of a complex functionality is citation handling, which involves two sources, a style guide, and a template. The source text contains citation annotations, referencing document-wide or externally stored bibliographic metadata, such as this: ```markdown "you can create code span by wrapping text in backtick quotes" [@Gruber2004] ``` Pandoc uses a Markdown extension for citations to capture the string `[@Gruber2004]`, looks up its embedded citation key `Gruber2004` in a bibliographic database (which might be locally maintained or using a suitable cloud service), and renders two or three texts: 1. the replacement text where the citation was written 2. a footnote on the same or a following page 3. an entry in a bibliography chapter at the end of the text The style of these renderings, and whether the second one is omitted, depends on which citation style is chosen, usually selected from Citation Style Language, a collection of more than 10.000 styles. One common style is APA, that says to include surname and year in the first text, omit the second text, and sort the entry in the third text by surname -- and contains hundreds of other details, for example, on how to abbreviate a citation with 16 authors, or an entry without an author listed. Ideally, citations can be separated into prose (annotation), structure (bibliographic metadata) and layout concerns (composition), with each concern anchored at one place in the workflow. In reality, however, automated composition sometimes fail short -- for example, the style chosen from Citation Style Language is missing translation of some terms in your (natural) language. The author may then need to violate the separation of concerns (until the underlying issue is properly solved), often done by adding compositional information to the bibliographic metadata. ## The filter API is either efficient or versatile {#sec-pandoc-filter-versatile} Pandoc strives to maintain a separation of concern between content, structure and layout, offering separate APIs for import, filtering and export, but since some functionality cross those boundaries, this separation is not guaranteed, and complex features are more likely to work correctly with the most commonly used Reader and Writer components. This is exacerbated when using wrapper tools since their extensions are likely tailored to those same most commonly used Reader and Writer components. The Quarto framework, like Pandoc, allows using any Reader. However, Quarto-specific macro expansions are only available when using the default Reader -- a custom derivative of the Pandoc default Markdown Reader. Despite the Reader API being specifically tailored for implementing a custom source format parser, when integration with complex functions is relevant, either within Pandoc itself or part of wrapper tools, that API effectively is discouraged. Since a core principle of Markdown is to treat unknown markup as content, a viable alternative for an extended Markdown format is to use use the regular Markdown Reader and then "finish up" the parsing in a filter. As an example, this Semantic Markdown text: ```markdown [This]{:depiction} is not [a pipe]{:depiction}. ``` ...should ideally become this as regular Markdown: ```markdown This is not a pipe. ``` ...but misparsed as regular Markdown would result in this Pandoc AST: * `Pandoc` * `Para` * `Str` "[This]{:depiction}" * `Space` * `Str` "is" * `Space` * `Str` "not" * `Space` * `Str` "[a" * `Space` * `Str` "pipe]{:depiction" * `Space` * `Str` "#the_object}." ...where a filter would then look for `{...}` and reorganise the AST to become this: * `Pandoc` * `Para` * `Str` "This" * `Space` * `Str` "is" * `Space` * `Str` "not" * `Space` * `Str` "a" * `Space` * `Str` "pipe." The first `Str` simply needed its content reduced to "This". The trailing part is more involved, however, since it involves three `Str` and two `Space` elements. Since this project prioritises improving existing tools (see @sec-improve), and implementing using the import API limits use with wrapper tools such as Quarto (see @sec-pandoc-complex), the approach chosen for this project is an implementation using the filter API, as covered next in @sec-filter.