aboutsummaryrefslogtreecommitdiff
path: root/_filter.qmd
blob: 37698d9f5ce33284f68823e02b6d9176803e5c0b (plain)

Misparsing and then cleaning up {#sec-misparsing}

Pandoc offers two ways to implement a syntax extension to Markdown: either an import extension or a filter for its AST. This project chose the latter approach.

As described in @sec-improve, of priority for this project is to improve existing tools rather than implement parallel competing ones, despite the latter potentially being easier to do or leading to a simpler product. Pandoc offers an API specifically for custom-implementing a source format described in @sec-pandoc-apis, but as pointed out in @sec-pandoc-filter-versatile, that interface limits uses of the implementation when combined with other extensions, notably those provided with the Quarto framework.

Markdown leniently tolerates broken markup (see @sec-spirit) -- what the parser cannot recognize as markup is simply treated as content. This project abuses that feature of Markdown to deliberately misparse Semantic Markdown as CommonMark at first, and then parse the misparsed content again using the filter API, adjusting to the extended syntax.

The choice of Lua

This project is implemented in the scripting language Lua.

Pandoc filters can be written in any general-purpose language. Pandoc provides a JSON serialisation and parsing API for its AST, which some filters make use of. JSON-based Pandoc filters have been found in active use that are written in Haskell (same as Pandoc itself), Python, Perl and Rust.

In recent years, Pandoc has embedded a Lua interpreter and offers the alternative of processing Lua filters without the need for full serialisation and parsing or for making system calls. For one simple test, a Lua implementation had an overhead of 2% compared to 35% for a Haskell-based implementation via the JSON interface [@MacFarlane2025, section "Introduction"].

While efficiency is desirable, user convenience is prioritised in this project. General experience using the Pandoc-based Quarto framework indicates that non-Lua filters often require additional quirks like placement in a specific directory, adding a symbolic link or passing its full path, whereas Lua filters usually work provided only a relative path. That issue might be specific to the Quarto framework, but even that aside, Lua-based filters are quite common and the documentation for writing them more detailed than the legacy JSON-based interface.

Parsing tasks

The filter traverses the AST several times. First it processes PrefixDefinition blocks, and then sifts through all inline content cleanup up misparsed KeyWords.

For this Minimum Viable Product (see @sec-phase1), dropping unneeded block-level elments before processing inline ones is slightly simpler and slightly more efficient. More importantly, however, is that for future planned works (see @sec-rdf) information gathered from PrefixDefinition is needed for processing KeyWords.

Keeping track of enclosure states

The main part of the filter is the callback function Statements(), called by Pandoc for each block of the AST. The function iterates through each content element of the block, keeping track of whether it is free from annotation-related enclosures, or are in an enclosure for the content part or the annotation part.