aboutsummaryrefslogtreecommitdiff
path: root/_intro.qmd
blob: 20ec4066c7ca604fd5b7ee37f9ffeb4bbed369af (plain)

Markdown is a markup language for unobtrusive annotations of text content. Processors exist for many dialects of Markdown, but none that supports context annotation.

The markup language Markdown, originally intended for authoring HTML, has been repurposed for more general document authoring. The Markdown-based tool Quarto can render scholarly papers conforming to prescribed style guides and document formats from Markdown text, void of visual styling. The author can emphasize, annotate a string as a hyperlink or a citation, and declare document-wide metadata like authorship, ownership and release date.

Context annotation is unsupported in Markdown

The author cannot, however, annotate a string with an arbitrary context -- e.g. that one citation uses the metric system while another predates it, or that a certain personal pronomen is used supportive or derogatory. Such information can be expressed as part of prose, e.g. parenthesised, but none of the Markdown dialects generally available to Quarto provide a way for context annotations to be omitted in output or diverted to document metadata.

Problem formulation

The aim of this project is to extend Quarto to detect context annotations contained in the source Markdown content, suppress them from inclusion in the content of output documents and optionally add them to the document metadata of output documents.

This aim has been framed with the following problem statement:
How can Pandoc be extended to support context annotations?

To aid in achieving that goal, the problem statement has been divided into the following subquestions:

  • What are the core qualities of Markdown, and how could a Markdown dialect express context annotations while maintaining those qualities?
  • How do Quarto convert Markdown source to HTML or PDF output, and how can this workflow be extended to handle context annotations?
  • Which approach to altering Quarto is more likely long-term sustainable?
  • How could reliability of a Pandoc plugin for Quarto be evaluated?

Maintaining usability and interoperability

A notable challenge is aligning with existing practice and systems. Markdown is known for its unobtrusive plaintext editing format, and an extension to its vocabulary will need to fit that principle. Also,
FIXME

Motivation

FIXME: rewrite to be project motivations (not author motivations)

The author of this project prefers a console-based WYSIWYM authoring environment, and for political reasons an environment consisting solely of freely licensed code. More specifically, the authoring system chosen of choice is Quarto, which allows either a WYSIWYM environment (RStudio among others) or a more technical console-based environment (any modern text editor), as it uses the plaintext markup format Markdown as source format, which is automatically processed by the tool Pandoc into any of a range of output formats including LaTeX with further postprocessing into PDF.

Implementation idea and brief plan

Implement plugins for the Pandoc document converter to enable authoring of ontological annotations in the text content, inspired by the conceptual idea in @Francart2020, and publish the plugins for easy use with the Quarto document publishing system.

Pandoc reads a text document, parses its structural components into an internal data structure called Abstract Syntax Tree (AST), and serialises and writes back into a text document. The AST is deliberately prioritises structural information and is relaxed about visual information, to preserve literal content while reducing format-specific stylistic details, relevant especially when processing between different formats. Most common is to read plaintext Markdown files and write LaTeX code further compiled into a PDF file.

Pandoc allows supplying custom reader and writer functions as well as plugging into and manipulating the AST, which this project will exploit: This project will write an extension to adjust the AST when abusing the default Markdown reader to read Markdown with added markup for ontological annotations, as proposed in @Francart2020 and further sketched as a draft markup format in @Smedegaard2022. The implemented Pandoc extensions will be designed both for use standalone and as part of the document authoring framework Quarto, which uses Pandoc as central tool with a large set of extensions and templates.

First milestone is reached when the filter can simply suppress the added markup. A further milestone is to embed the expressed annotations in supported output formats, e.g. as XMP metadata in PDF output [@PDFAssociation2020 chapter 14.3] and as RDFa in html output [@Herman2015]. Another further milestone is to make use of the added markup, e.g. to annotate purpose of scholarly citations as presented in @Daquino2023.

As mentioned above, a draft specification has already been drafted in @Smedegaard2022 for the syntax of embedding ontological annotations in Markdown. The main challenge of this project is to implement that specification as extensions for the existing Pandoc tool and Quarto framework, and as part of that potentially also refine the draft specification.