The markup language Markdown was introduced in 2004
with the specific aim of helping authors focus on content,
separate from layout concerns
[@Gruber2004].
Markdown has since been widely adopted,
with one common use is as manually authored source format
for generating reading-optimized documents in HTML, PDF and other formats.
HTML and PDF have evolved in the same time period
to support semantic text annotations,
but Markdown can only express structure and hypermedia markup,
not semantic markup.
Text annotation is the process of applying contextual information to text.
Annotating text differs from annotating the document as a whole --
in that the information is tied to specific text strings.
A document annotation essentially says
"this document contains hyperlinks to these URLs"
or "this document contains strong language"
whereas a text annotation says
"this particular string is emphasized"
or "this particular string refers to this specific URL".
Semantic text annotation (also called semantic markup)
is the process of applying information about meaning
to specific text strings.
Semantic document annotation is supported by some Markdown dialects
by prepending a metadata section to the content markup,
but tying semantics to specific strings is currently unsupported.
The author cannot annotate "this currency amount is in 1980 dollars"
or "this uses the derogatory meaning of the term".
Problem formulation
This project aims to enable authors to include semantic annotations
as part of their writing,
similarly unobtrusive as the widely adopted structural and hypermedia markup,
by extending a common Markdown processor
to handle semantic markup.
This aim has been framed with the following problem statement:
How can Pandoc
be extended to support context annotations?
To aid in achieving that goal,
the problem statement has been divided into the following subquestions:
- What are the core qualities of Markdown,
and how could a Markdown dialect express context annotations
while maintaining those qualities?
- How do Quarto convert Markdown source to HTML or PDF output,
and how can this workflow be extended
to handle context annotations?
- Which approach to altering Quarto is more likely long-term sustainable?
- How could reliability of a Pandoc plugin for Quarto be evaluated?
Levels of implementation
The primary aim is to support authoring;
enhancing renderings of the authored content is secondary.
Annotations are metadata, not directly part of the content.
Consequently, the simplest way
to correctly process markdown with semantic text annotations
is to omit the annotations altogether,
rendering as if they had not been applied at all.
More advanced processing may optionally embed semantic text annotations
as semantic document annotations, i.e. embed as document-wide metadata.
Even more advanced processing might enhance the content rendering,
similar to how some PDF rendering provide interactive navigation
for embedded hypermedia markup like links and anchors.
This project aims at the simplest processing level as described above,
due to the limited time of the project.
Maintaining usability and interoperability
A notable challenge is aligning with existing practice and systems.
Markdown is known for its unobtrusive plaintext editing format,
and an extension to its vocabulary will need to fit that principle.
Also,
FIXME
Motivation
FIXME: rewrite to be project motivations (not author motivations)
The author of this project prefers
a console-based WYSIWYM authoring environment,
and for political reasons an environment
consisting solely of freely licensed code.
More specifically,
the authoring system chosen of choice is Quarto,
which allows either a WYSIWYM environment (RStudio among others)
or a more technical console-based environment (any modern text editor),
as it uses the plaintext markup format Markdown as source format,
which is automatically processed by the tool Pandoc
into any of a range of output formats
including LaTeX with further postprocessing into PDF.
Implementation idea and brief plan
Implement plugins for the Pandoc document converter
to enable authoring of ontological annotations in the text content,
inspired by the conceptual idea in @Francart2020,
and publish the plugins
for easy use with the Quarto document publishing system.
Pandoc reads a text document,
parses its structural components into an internal data structure
called Abstract Syntax Tree (AST),
and serialises and writes back into a text document.
The AST is deliberately prioritises structural information
and is relaxed about visual information,
to preserve literal content
while reducing format-specific stylistic details,
relevant especially when processing between different formats.
Most common is to read plaintext Markdown files
and write LaTeX code further compiled into a PDF file.
Pandoc allows supplying custom reader and writer functions
as well as plugging into and manipulating the AST,
which this project will exploit:
This project will write an extension
to adjust the AST
when abusing the default Markdown reader
to read Markdown with added markup for ontological annotations,
as proposed in @Francart2020
and further sketched as a draft markup format in @Smedegaard2022.
The implemented Pandoc extensions will be designed
both for use standalone and as part of the document authoring framework Quarto,
which uses Pandoc as central tool with a large set of extensions and templates.
First milestone is reached
when the filter can simply suppress the added markup.
A further milestone is to embed the expressed annotations
in supported output formats,
e.g. as XMP metadata in PDF output
[@PDFAssociation2020 chapter 14.3]
and as RDFa in html output
[@Herman2015].
Another further milestone is to make use of the added markup,
e.g. to annotate purpose of scholarly citations
as presented in @Daquino2023.
As mentioned above,
a draft specification has already been drafted
in @Smedegaard2022
for the syntax of embedding ontological annotations in Markdown.
The main challenge of this project is to implement that specification
as extensions for the existing Pandoc tool and Quarto framework,
and as part of that potentially also refine the draft specification.