Observations from implementation work
Work on implementing sem-md
uncovered a few observations,
not specific to the problem of annotations and Markdown
but regardless considered relevant
for others reaching at a similar combination of design choices.
Zero is true in Lua
Programming in Lua has been a new experience
but with a fairly low learning curve,
after discovering (after a day of debugging) the oddity,
compared to dynamically typed languages in general,
that zero is a true value in a boolean context.
Backwards compatibility
Some Pandoc features
are unavailable in older Pandoc releases still in widespread use,
affecting the coding style of sem-md
.
As a concrete example,
the function pandoc.List:at()
provides a convenient way
to inspect the contents of a collection of Pandoc elements,
but the function has only been available since Pandoc 3.5,
released in November 2024.
Pandoc is freely available for download from the author's website,
but is commonly installed via general-purpose software distributions.
The largest software distribution, Debian
(supplying both directly to users
and to the popular distribution Ubuntu, among others),
currently provides Pandoc 3.1.11.1
and is not likely to update in the coming 2-3 years,
partly due to a conservative release cycle of Debian,
and partly to complex dependency chains of Haskell libraries
[@Debian2025; @Smedegaard2025].
As a balance between keeping code simple and covering a larger user base,
the filter implementation isolates calls to pandoc.List:at()
in a local function TableEmpty()
with more tedious fallback code for older releases of Pandoc,
and resolves once at startup a static boolean PANDOC_IS_OLD
to signal if the legacy code is required.
When newer Pandoc releases become more widespread,
uses of PANDOC_IS_OLD
can be dropped.
Currently only one small function uses this mechanism.
It has been established mainly for the next phases
that are expected to rely more heavily on newer libraries where possible
(see @sec-rdf).
Improved testing
The code as currently implemented
does not allow for unit testing.
It would be helpful to reorganize the codebase
into smaller functions that each could be covered as units.
The commonmark-spec project includes a large testsuite.
It would certainly be useful
to check the filter against that testsuite,
to ensure no breakage of conventional Markdown markup.
Useful here might be a draft work on rewritng the Semantic Markdown specification
as an extension to the CommonMark specification
[@Smedegaard2022cmark].
Vague limiting of implementation scope {#sec-experimental-test}
One test failure (see @sec-test-annotation-block)
revealed an issue that is either a bug in the code or the testsuite.
This project attempted to setup limits
for the intended use of the draft Semantic Markdown specification,
but those constraints did not take into account that the specification
describes features beyond what was intended for this project.
Future works {#sec-future}
The central design choice
of this project being implemented as a filter
would be interesting to investigate.
This work is part of a series of works
to integrate semantic text annotations with creative textual interactions.
Concretely building upon this work are projects extending the same codebase
to support extracting or converting annotations.
Also,
beyond this concrete project and its planned extensions,
the ability to author semantic text annotations and process them
allows for improvements in related workflows,
including interactive collaborative authoring
and streamlining of automated document layout.
Implementation as import extension
This project has been implementet as a cleanup filter
for a misparsing of Semantic Markdown as regular Markdown
(see @sec-misparsing).
This approach was chosen based in an assumption
on fitting better with existing uses of Pandoc,
notably integrated with the Quarto framework.
An interesting task would be to challenge that assumption,
by implementing as a Pandoc import extension
and comparing usability and efficiency.
It might also be interesting to try compare code reliability:
Although the code size might be larger,
an import extension would potentially contain far less quirks
due to it fundamentally iterating through well-balanced enclosures
rather than fixing breakage as the misparser cleanup filter does.
This work as basis for others
Authors interested in authoring with semantic annotations
have been discouraged by the lack of tools supporting it
(beyond specific narrow scopes like hypermedia and citations).
Tools have been discouraged in implementing general-purpose support
for semantic annotations, probably by a perceived lack of demand for it.
This work can be seen as an initial solution to that chicken-and-egg problem;
further work is needed to evaluate
the usability of authoring as Markdown with this semantic extension,
and to assess whether there is an interest in this approach to authoring.
Parallel to that,
follow-up works are planned to extend this Pandoc-based setup
to support not only tolerate semantic annotations in source
but take them into account when parsing and rendering output documents.
Parsing and rendering annotations {#sec-rdf}
This work has been phase 1 of a larger set of projects,
all building on the same codebase,
illustrated in @fig-phases.
{#fig-phases}
Phase 2 will extend the Pandoc filter
to support extraction of annotations,
storing them as an initial document-wide metadata section.
Phase 3 will extend the Pandoc filter further
to support translating annotations
when rendering as HTML or PDF.
The annotation syntax for the current phase 1 of these project
was chosen in anticipation of phases 2 and 3,
as it aligns with established annotation formats
supported by the PDF and HTML document formats:
PDF documents support metadata and text-specific annotations stored as XMP
[@PDFAssociation2020 chapter 14.3, @Adobe2012, p. 9];
HTML documents support text-specific annotations stored as RDFa
[@Herman2015].
Both XMP and RDFa are concrete formulations (serialisations)
of Resource Description Framework (RDF),
an abstract language for expressing semantics.
Both phases 2 and 3 will involve a more intimate parsing of annotations.
Each annotation statement needs to be parsed into abstract RDF
and then serialised into concrete forms more suitable
for either of
Markdown (document-wide),
HTML (text-specific)
and PDF (document-wide and/or text-specific),
respectively.
These extensions to the Pandoc-based workflow have uses in themselves,
and also enables further explorations into more complex workflows.
Generalizing Quarto metadata {#sec-quarto}
Quarto,
a document authoring framework using Pandoc to render academic papers,
includes a sometimes quite elaborate restructuring and layout
of author and publisher metadata.
Currently this processing is done inconsistently across target formats,
and even for formats like HTML and PDF that supports RDF-based metadata
(as described in @sec-rdf),
the information is only laid out visually,
with the elaborately prepared structure not preserved.
A Pandoc filter could be written,
or this filter extended,
to embed structured data as RDF
for target formats supporting it.
Also,
Pandoc could extend its AST to block- and inline-specific metadata
(in addition to the existing document-wide metadata).
Such change, and consequential refinements of default Pandoc templates
encouraging more normalized structures,
for example, about authors and publishers,
might reduce the amount of custom restructuring
needed downstream, for example, in Quarto.
Alternative implementations
It might be interesting to try implement as a Pandoc Reader,
both to see if that can be done notably simpler,
and also to explore possibilities of use with wrapper tools such as Quarto.
In other words,
try to prove the pessimistic usability analysis in @sec-pandoc-filter-versatile wrong.
It might be helpful,
both for authors in more widely enabling semantic text annotations
and for testing the language design of the Semantic Markdown specification,
to try implement Semantic Markdown
for other extensible Markdown parsers.
Might also be fun to try implement Semantic Markdown as a patch for djot,
a language derived from Markdown
with the specific aim of being less ambiguous
[@MacFarlane2024djot].
Nuanced citations in scholarly papers
Pandoc and Quarto (see @sec-quarto) support annotating scholarly citations.
Recent work on annotating contextualisations of citations,
as presented in @Daquino2023,
however require further hinting than is currently easily achieved
with Pandoc and Quarto tooling,
which can likely leverage on this work as well as the planned next phases.
Conclusion
This project demonstrates,
that it is possible to extend the Markdown processor Pandoc
to cover an extension of the Markdown language,
using the filter API of Pandoc.
Further work is needed to stabilize the implemented filter,
shown by several severe bugs revealed by testing the code.
Future works are planned to extend this work,
and suggestions are made for other works that may build upon this.