A Markdown-formatted document should be publishable as-is,
as plain text,
without looking like it’s been marked up
with tags or formatting instructions.
[@Gruber2004syntax, section "Philosophy"]
The markup language Markdown was introduced in 2004
with the specific aim of helping authors focus on content,
separate from layout concerns
[@Gruber2004].
It has since been widely adopted as an authoring source format
for generating reading-optimised documents in HTML and PDF,
which over the same time period have evolved
to support semantic text annotations.
But Markdown can only express structure and hypermedia annotations,
not semantic annotations.
Annotating text differs from annotating the document as a whole
in that the information is tied to specific text strings.
A text annotation can state, for example,
"this text string is strongly emphasized"
or "this text string links to this URL",
whereas a document annotation is limited to stating, for example,
"this document contains strong language somewhere"
or "this document contains a link to this URL somewhere".
Semantic text annotation
is the process of applying information about meaning
to specific text strings.
Some Markdown dialects support prepending a metadata section
to the whole content markup
(i.e., semantic document annotation),
but tying semantics to specific strings is currently unsupported.
The author cannot express "this currency amount is in 1980 dollars"
or "this uses the derogatory meaning of the term".
A few years ago, a call was made on a blog
to extend Markdown to cover semantic text annotations
[@Francart2020].
This led to the draft specification Semantic Markdown
[@Smedegaard2022];
no implementation was made at the time, however.
Pandoc is a common and versatile Markdown processor,
supporting two-way translation between Markdown and many other document formats.
Pandoc supports adding import and export plugins
and filters to manipulate its Abstract Syntax Tree (AST).
This project aims to implement the Semantic Markdown draft specification,
by way of a Pandoc filter to handle semantic text annotations.
Problem formulation
Thus,
how can Pandoc be extended to support semantic text annotations?
- What are the core qualities of Markdown,
and how can a Markdown dialect express semantic text annotations
while maintaining those qualities?
- How does Pandoc parse Markdown into a generalised content structure,
and how can that process be extended
to handle semantic text annotations?
- Which approach to extending Pandoc
is more likely to be sustainable in the long term?
- How can the reliability of a Pandoc extension be evaluated?
Project constraints
Driven by an interest in sustained development of this research project
beyond the scope of this paper,
the project has been voluntarily constrained
by the following early design decisions:
- The current delivery is intentionally scoped
as a minimum viable product,
with additional features planned as separate future works.
- Newly introduced syntax is kept in line
with the core principles of Markdown.
- Programming language and integration design
are largely dictated by existing actively used systems,
rather than convenience of personal familiarity or efficiency.
- The project solely involves freely licensed tools and resources,
and is itself licensed under Free licences that encourage collaboration.
Scope limited to use for authoring {#sec-phase1}
The scope of this project is to enable annotating while authoring.
Rendering that makes use of annotations is outside this scope,
but planned for future works
as discussed in @sec-rdf.
The idea is to introduce semantic text annotation to Markdown authoring
without disruptions to the Markdown-based workflows.
A filter is added to the document generation workflow
that strips semantic text annotations from the Markdown content,
so that further processing need not know about the new markup,
as illustrated in @fig-phase1.
{#fig-phase1}
For example a text authored like this:
[This]{:depiction} is not a pipe.
...that with the filter applied becomes this:
This is not a pipe.
Added syntax in the spirit of Markdown {#sec-spirit}
Markdown is a lenient markup language.
Partly this is derived from Markdown being a superset of HTML,
which is (or was, when Markdown was introduced) based on SGML
that by design permits parsers to omit markup they cannot handle
[@Connolly1994, section "SGML as a Layered Communications Medium"].
Partly it is due to a deliberate design choice,
as reflected in the quote introducing this paper:
The format should be easy to both write and read in source form.
The annotation syntax in a Markdown text...
- should be relatively comprehensible to humans
(in the same spirit as Markdown -- for example, **strong emphasis**)
- should not conflict with other Markdown extensions
- must not conflict with core Markdown
(i.e. must not cause ambiguities)
A rendering of syntactically correct annotations...
- must be functionally equal to the text authored without annotations
(similar to how a text with a hyperlink is still rendered as text)
- may include the annotation in some multimodal form
(again similar to how some renderings of a hyperlink offer interacive navigation)
A rendering of syntactically incorrect or structurally unsupported annotations...
- should include the annotation as regular text
Improve accepted formats and tools {#sec-improve}
Markdown is a widely adopted authoring format,
and Pandoc a widely adopted Markdown processor.
It is likely easier to invent a new format as a new tool
than to adapt widely adopted ones,
because they have an established userbase
familar with and unlikely to accept changes to formats and functionality.
The latter, if successful, is however likely to become relevant,
again due to an existing userbase already having been established.
This project aims at staying close to existing Markdown,
striving to only minimally introduce new syntax,
unlike, for example, Semantic Authoring Markdown (SAM)
that changes core markup such as for links
[@SAM2018],
or Knowledge Representation Markup Language (KRML)
that requires a list for each annotation
[@KRML2025].
Many Markdown processors exist,
but few are in widespread use.
One popular Markdown processor is Pandoc,
widely adopted and integrated with multiple wrapper tools
such as R Markdown and the Quarto document publishing system
to streamline the production of, for example, academic papers.
Yet another Markdown parser would probably be easiest to implement
as a standalone project,
either by forking an existing one and customize,
or by use of a library for common components
and then subclassing as needed.
This project, however, strives to reuse existing formats and tools,
and has chosen to extend the existing Markdown processor Pandoc.
Collaborative licensing
To encourage collaboration and stimulate a circular gift economy
as introduced by @Mikkelsen2000,
this project exclusively uses freely licensed tools and resources,
and the project itself is copyleft licensed:
Code parts are licensed
under the GNU Public Licence version 3 or newer,
and non-code parts are licensed
under the Creative Commons crediting share-alike 4.0.
Implementation plan
The plan for this project is to...
- implement an extension to existing widespread authoring language,
not an alternative one
- keep changes to Markdown syntax to a minimum,
both to ease adoption of existing Markdown users,
and to keep the original principle that it "should be publishable as-is".
- emphasize ease of user adoption rather than ease of implementation
Achieving these goals requires an understanding of Markdown and annotations,
and it requires working code.
Existing Markdown syntax is analysed in @sec-commonmark
and proposed new coverage of annotation in @sec-semantic-markdown.
(Eventual rendering of annotations is outside the scope of this paper
and will only be briefly discussed later in @sec-rdf.)
Then existing Markdown parsing with the tool Pandoc is analysed in @sec-pandoc,
and the implementation of a filter to parse annotations
is described in @sec-filter,
and evaluated and discussed in the later chapters.
Use of generative AI
The abstract for this paper was automatically generated
(and then manually corrected)
by use of an instance of the generative AI GPT 4.1 mini
trained on the text of the draft report.
Other than that,
generative AI has not been used directly in the creation of this work
(only indirectly for aid in information search and grammar validation).