A Markdown-formatted document should be publishable as-is,
as plain text,
without looking like it’s been marked up
with tags or formatting instructions.
[@Gruber2004syntax, section "Philosophy"]
The markup language Markdown was introduced in 2004
with the specific aim of helping authors focus on content,
separate from layout concerns
[@Gruber2004].
It has since then been widely adopted as authoring source format
for generating reading-optimized documents in HTML and PDF
that in the same time period have evolved
to support semantic text annotations.
But Markdown can only express structure and hypermedia annotations,
not semantic annotations.
Annotating text differs from annotating the document as a whole
in that the information is tied to specific text strings.
A document annotation can say
e.g. "this document contains strong language somewhere"
or "this document contains a link to this URL somwehere",
whereas a text annotation can say
e.g. "this text string is strongly emphasized"
or "this text string links to this URL".
Semantic text annotation
is the process of applying information about meaning
to specific text strings.
Some Markdown dialects support prepending a metadata section
to the whole content markup,
i.e. semantic document annotation,
but tying semantics to specific strings is currently unsupported.
The author cannot express "this currency amount is in 1980 dollars"
or "this uses the derogatory meaning of the term".
Few years ago a shoutout was made on a blog
for extending Markdown to cover semantic text annotations
[@Francart2020].
This lead to a draft specification called "Semantic Markdown"
[@Smedegaard2022];
no actual implementation was made at the time, however.
Pandoc is a common and versatile Markdown processor,
supporting two-way translation between Markdown and many other document formats.
Pandoc supports adding import and export plugins
and filters to manipulate its Abstract Syntax Tree (AST).
This project aims to implement the Semantic Markdown draft specification,
by way of a Pandoc filter to handle semantic text annotations.
Problem formulation
So,
How can Pandoc be extended to support semantic text annotations?
- What are the core qualities of Markdown,
and how can a Markdown dialect express semantic text annotations
while maintaining those qualities?
- How do Pandoc convert Markdown to HTML or PDF,
and how can this workflow be extended
to handle semantic text annotations?
- Which approach to extending Pandoc
is more likely long-term sustainable?
- How can the reliability of a Pandoc extension be evaluated?
Project constraints
Driven by an interest in sustained development of this research project
well beyond the delivery covered in this paper,
the project has been voluntarily constrained
by the following early design decisions:
- The current delivery is intentionally scoped
as a minimum viable product,
with additional features planned as separate future works.
- Programming language and integration design
is largely dictated by existing actively used systems,
rather than convenience of personal familiarity or efficiency.
- The project solely involves freely licensed tools and resources,
and is itself licensed under collaboratively incentivising free licences.
Scope limited to authoring
The scope of this project is to enable authoring,
with a further aim of later extending to more complex processing.
The primary aim is to enable authors to annotate semantics
as integral part of their creating writing process.
Future works may expand on this,
enhancing renderings of the authored content,
as discussed in @sec-rdf.
The idea is to introduce semantic text annotation to Markdown authoring
without disruptions in the Markdown-based workflows.
A filter is added to the document generation workflow
that strips semantic text annotations from the Markdown content,
so that further processing need not know about the new markup,
as illustrated in @fig-phase1.
{#fig-phase1}
Evolve accepted formats and tools
Markdown is a widely adopted authoring format,
and Pandoc a widely adopted Markdown processor.
It is easier to invent a new format in a new tool
than convincing widely adopted ones to evolve,
but the latter is, if succesful, likely far more reliable.
This project aims at introducing new syntax
while staying close to existing Markdown,
unlike e.g. SAM that deviates notably from Markdown
[@SAM2018].
Many Markdown processors exist,
but few are in widespread use.
Pandoc is widely adopted and in recent times integrated
with R Markdown and the Quarto document publishing system
to streamline the production of academic papers.
The easiest way to implement a derivation of Markdown
is likely by writing a Pandoc import filter,
but that would limit its usefulness with larger frameworks like Quarto:
Although they do support alternative source languages,
important features like citation handling is far better streamlined
when using the main and best supported source format.
This project therefore implements its deviation of Markdown
by reading the deviant source as if it was Markdown,
and then applying a filter to adjust the AST after the deliberate misparsing.
Collaborative licensing
To encourage collaboration and stimulate a circular gift economy
as introduced by @Mikkelsen2000,
this project solely uses freely licensed tools and resources,
and the project itself is copyleft licensed:
Code parts are licensed
under the GNU Public Licence version 3 or newer,
and non-code parts are licensed
under the Creative Commons crediting share-alike 4.0.
Implementation plan
FIXME: summarize next chapters