## Observations from implementation work Work on implementing `sem-md` uncovered a few observations, not specific to the problem of annotations and Markdown but regardless considered relevant for others reaching at a similar combination of design choices. ### Zero is true in Lua Programming in Lua has been a new experience but with a fairly low learning curve, after discovering (after a day of debugging) the oddity, compared to dynamically typed languages in general, that zero is a true value in a boolean context. ### Backwards compatibility Some Pandoc features are unavailable in older Pandoc releases still in widespread use, affecting the coding style of `sem-md`. As a concrete example, the function `pandoc.List:at()` provides a convenient way to inspect the contents of a collection of Pandoc elements, but the function has only been available since Pandoc 3.5, released in November 2024. Pandoc is freely available for download from the author's website, but is commonly installed via general-purpose software distributions. The largest software distribution, Debian (supplying both directly to users and to the popular distribution Ubuntu, among others), currently provides Pandoc 3.1.11.1 and is not likely to update in the coming 2-3 years, partly due to a conservative release cycle of Debian, and partly to complex dependency chains of Haskell libraries [@Debian2025; @Smedegaard2025]. As a balance between keeping code simple and covering a larger user base, the filter implementation isolates calls to `pandoc.List:at()` in a local function `TableEmpty()` with more tedious fallback code for older releases of Pandoc, and resolves once at startup a static boolean `PANDOC_IS_OLD` to signal if the legacy code is required. When newer Pandoc releases become more widespread, uses of `PANDOC_IS_OLD` can be dropped. Currently only one small function uses this mechanism. It has been established mainly for the next phases that are expected to rely more heavily on newer libraries where possible (see @sec-rdf). ### Improved testing The code as currently implemented does not allow for unit testing. It would be helpful to reorganize the codebase into smaller functions that each could be covered as units. The commonmark-spec project includes a large testsuite. It would certainly be useful to check the filter against that testsuite, to ensure no breakage of conventional Markdown markup. Useful here might be a draft work on rewritng the Semantic Markdown specification as an extension to the CommonMark specification [@Smedegaard2022cmark]. ### Vague limiting of implementation scope {#sec-experimental-test} One test failure (see @sec-test-annotation-block) revealed an issue that is either a bug in the code or the testsuite. This project attempted to setup limits for the intended use of the draft Semantic Markdown specification, but those constraints did not take into account that the specification describes features beyond what was intended for this project. ## Future works {#sec-future} The central design choice of this project being implemented as a filter would be interesting to investigate. This work is part of a series of works to integrate semantic text annotations with creative textual interactions. Concretely building upon this work are projects extending the same codebase to support extracting or converting annotations. Also, beyond this concrete project and its planned extensions, the ability to author semantic text annotations and process them allows for improvements in related workflows, including interactive collaborative authoring and streamlining of automated document layout. ### Implementation as import extension This project has been implementet as a cleanup filter for a misparsing of Semantic Markdown as regular Markdown (see @sec-misparsing). This approach was chosen based in an assumption on fitting better with existing uses of Pandoc, notably integrated with the Quarto framework. An interesting task would be to challenge that assumption, by implementing as a Pandoc import extension and comparing usability and efficiency. It might also be interesting to try compare code reliability: Although the code size might be larger, an import extension would potentially contain far less quirks due to it fundamentally iterating through well-balanced enclosures rather than fixing breakage as the misparser cleanup filter does. ### This work as basis for others Authors interested in authoring with semantic annotations have been discouraged by the lack of tools supporting it (beyond specific narrow scopes like hypermedia and citations). Tools have been discouraged in implementing general-purpose support for semantic annotations, probably by a perceived lack of demand for it. This work can be seen as an initial solution to that chicken-and-egg problem; further work is needed to evaluate the usability of authoring as Markdown with this semantic extension, and to assess whether there is an interest in this approach to authoring. Parallel to that, follow-up works are planned to extend this Pandoc-based setup to support not only tolerate semantic annotations in source but take them into account when parsing and rendering output documents. ### Parsing and rendering annotations {#sec-rdf} This work has been phase 1 of a larger set of projects, all building on the same codebase, illustrated in @fig-phases. ![A filter to strip, extract or translate annotations.](workflow/phases.svg){#fig-phases} Phase 2 will extend the Pandoc filter to support extraction of annotations, storing them as an initial document-wide metadata section. Phase 3 will extend the Pandoc filter further to support translating annotations when rendering as HTML or PDF. The annotation syntax for the current phase 1 of these project was chosen in anticipation of phases 2 and 3, as it aligns with established annotation formats supported by the PDF and HTML document formats: PDF documents support metadata and text-specific annotations stored as XMP [@PDFAssociation2020 chapter 14.3, @Adobe2012, p. 9]; HTML documents support text-specific annotations stored as RDFa [@Herman2015]. Both XMP and RDFa are concrete formulations (serialisations) of Resource Description Framework (RDF), an abstract language for expressing semantics. Both phases 2 and 3 will involve a more intimate parsing of annotations. Each annotation statement needs to be parsed into abstract RDF and then serialised into concrete forms more suitable for either of Markdown (document-wide), HTML (text-specific) and PDF (document-wide and/or text-specific), respectively. These extensions to the Pandoc-based workflow have uses in themselves, and also enables further explorations into more complex workflows. ### Generalizing Quarto metadata {#sec-quarto} Quarto, a document authoring framework using Pandoc to render academic papers, includes a sometimes quite elaborate restructuring and layout of author and publisher metadata. Currently this processing is done inconsistently across target formats, and even for formats like HTML and PDF that supports RDF-based metadata (as described in @sec-rdf), the information is only laid out visually, with the elaborately prepared structure not preserved. A Pandoc filter could be written, or this filter extended, to embed structured data as RDF for target formats supporting it. Also, Pandoc could extend its AST to block- and inline-specific metadata (in addition to the existing document-wide metadata). Such change, and consequential refinements of default Pandoc templates encouraging more normalized structures, for example, about authors and publishers, might reduce the amount of custom restructuring needed downstream, for example, in Quarto. ### Alternative implementations It might be interesting to try implement as a Pandoc Reader, both to see if that can be done notably simpler, and also to explore possibilities of use with wrapper tools such as Quarto. In other words, try to prove the pessimistic usability analysis in @sec-pandoc-filter-versatile wrong. It might be helpful, both for authors in more widely enabling semantic text annotations and for testing the language design of the Semantic Markdown specification, to try implement Semantic Markdown for other extensible Markdown parsers. Might also be fun to try implement Semantic Markdown as a patch for djot, a language derived from Markdown with the specific aim of being less ambiguous [@MacFarlane2024djot]. ### Nuanced citations in scholarly papers Pandoc and Quarto (see @sec-quarto) support annotating scholarly citations. Recent work on annotating contextualisations of citations, as presented in @Daquino2023, however require further hinting than is currently easily achieved with Pandoc and Quarto tooling, which can likely leverage on this work as well as the planned next phases. ## Conclusion This project demonstrates, that it is possible to extend the Markdown processor Pandoc to cover an extension of the Markdown language, using the filter API of Pandoc. Further work is needed to stabilize the implemented filter, shown by several severe bugs revealed by testing the code. Future works are planned to extend this work, and suggestions are made for other works that may build upon this.