Markdown is a markup language
that encourages treating structure as integral part of content
while postponing styling till later.
This separation of visual concerns from content and structure
is harnessed by the document converter Pandoc
and the Pandoc-based document authoring framework Quarto,
and is suitable for scholarly writing
where styling may be dictated by a publisher.
Pandoc and Quarto lack support for annotations
beyond the specific contexts of hypertext navigation and scholarly citation.
You can annotate a string as a hyperlink with a title,
or as a citation with a source reference.
You can add docuent-wide metadata
as a YAML structure at the top of the text --
both using well-defined keys for author, supervisor and publication date
and arbitrarily added keys.
And you can produce a web page or a PDF document from your text.
You cannot, however, write an annotation for a string
contextually related an arbitrary content domain,
and then in the output document
have that annotation suppressed yet accessible as metadata.
Example annotations might include
one set of numbers is in meter and another in nautical miles,
one citation being supportive and another a rebuttal,
or that one quote uses "she" as personal pronoun
and another uses it derogatory.
Such meta information tied not to the document as a whole
but to specific strings in the text
cannot be written as such --
i.e. structurally part of the writing
but communicatively meta to the prose content of the text.
Problem formulation
Quarto supports hypertext and citation annotations,
and document-wide metadata.
The aim of this project is to extend
Pandoc processing and Quarto authoring workflow
to support writing arbitrary domain-specific annotations
which are converted to metadata,
if possible preserving reference to location of the string.
This aim has been framed with the following problem statement:
How can Pandoc
be extended to support domain-specific annotations?
To aid in achieving that goal,
the problem statement has been divided into the following subquestions:
- What are the core qualities of Markdown,
and how could a Markdown flavor express domain-specific annotations
while maintaining those qualities?
- How do Quarto convert Markdown source to HTML or PDF output,
and how can this workflow be extended
to cover Markdown with domain-specific annotations?
- Which approach to altering the workflow of Pandoc and Quarto
is more likely efficient and long-term sustainable?
Motivation
FIXME: rewrite to be project motivations (not author motivations)
The author of this project prefers
a console-based WYSIWYM authoring environment,
and for political reasons an environment
consisting solely of freely licensed code.
More specifically,
the authoring system chosen of choice is Quarto,
which allows either a WYSIWYM environment (RStudio among others)
or a more technical console-based environment (any modern text editor),
as it uses the plaintext markup format Markdown as source format,
which is automatically processed by the tool Pandoc
into any of a range of output formats
including LaTeX with further postprocessing into PDF.
Implementation idea and brief plan
Implement plugins for the Pandoc document converter
to enable authoring of ontological annotations in the text content,
inspired by the conceptual idea in @Francart2020,
and publish the plugins
for easy use with the Quarto document publishing system.
Pandoc reads a text document,
parses its structural components into an internal data structure
called Abstract Syntax Tree (AST),
and serialises and writes back into a text document.
The AST is deliberately prioritises structural information
and is relaxed about visual information,
to preserve literal content
while reducing format-specific stylistic details,
relevant especially when processing between different formats.
Most common is to read plaintext Markdown files
and write LaTeX code further compiled into a PDF file.
Pandoc allows supplying custom reader and writer functions
as well as plugging into and manipulating the AST,
which this project will exploit:
This project will write an extension
to adjust the AST
when abusing the default Markdown reader
to read Markdown with added markup for ontological annotations,
as proposed in @Francart2020
and further sketched as a draft markup format in @Smedegaard2022.
The implemented Pandoc extensions will be designed
both for use standalone and as part of the document authoring framework Quarto,
which uses Pandoc as central tool with a large set of extensions and templates.
First milestone is reached
when the filter can simply suppress the added markup.
A further milestone is to embed the expressed annotations
in supported output formats,
e.g. as XMP metadata in PDF output
[@PDFAssociation2020 chapter 14.3]
and as RDFa in html output
[@Herman2015].
Another further milestone is to make use of the added markup,
e.g. to annotate purpose of scholarly citations
as presented in @Daquino2023.
As mentioned above,
a draft specification has already been drafted
in @Smedegaard2022
for the syntax of embedding ontological annotations in Markdown.
The main challenge of this project is to implement that specification
as extensions for the existing Pandoc tool and Quarto framework,
and as part of that potentially also refine the draft specification.