> A Markdown-formatted document should be publishable as-is, > as plain text, > without looking like it’s been marked up > with tags or formatting instructions. > [@Gruber2004syntax, section "Philosophy"] The markup language Markdown was introduced in 2004 with the specific aim of helping authors focus on content, separate from layout concerns [@Gruber2004]. It has since been widely adopted as an authoring source format for generating reading-optimised documents in HTML and PDF, which over the same time period have evolved to support semantic text annotations. But Markdown can only express structure and hypermedia annotations, not semantic annotations. Annotating text differs from annotating the document as a whole in that the information is tied to specific text strings. A document annotation can state, for example, "this document contains strong language somewhere" or "this document contains a link to this URL somewhere", whereas a text annotation can state, for example, "this text string is strongly emphasized" or "this text string links to this URL." Semantic text annotation is the process of applying information about meaning to specific text strings. Some Markdown dialects support prepending a metadata section to the whole content markup (i.e., semantic *document* annotation), but tying semantics to specific strings is currently unsupported. The author cannot express "this currency amount is in 1980 dollars" or "this uses the derogatory meaning of the term". A few years ago, a call was made on a blog to extend Markdown to cover semantic text annotations [@Francart2020]. This led to the draft specification Semantic Markdown [@Smedegaard2022]; no actual implementation was made at the time, however. Pandoc is a common and versatile Markdown processor, supporting two-way translation between Markdown and many other document formats. Pandoc supports adding import and export plugins and filters to manipulate its Abstract Syntax Tree (AST). This project aims to implement the Semantic Markdown draft specification, by way of a Pandoc filter to handle semantic text annotations. ## Problem formulation Thus, **how can Pandoc be extended to support semantic text annotations?** * What are the core qualities of Markdown, and how can a Markdown dialect express semantic text annotations while maintaining those qualities? * *TODO: Analysis of spec, phrased as a question* * How does Pandoc parse Markdown into a generalised content structure, and how can that process be extended to handle semantic text annotations? * Which approach to extending Pandoc is more likely to be sustainable in the long term? * How can the reliability of a Pandoc extension be evaluated? *TODO: Maybe reframe as an investigation of the spec, implementing it to identify its strengths and weaknesses* *FIXME: align above questions with actual findings* ## Project constraints Driven by an interest in sustained development of this research project beyond the scope of this paper, the project has been voluntarily constrained by the following early design decisions: * The current delivery is intentionally scoped as a minimum viable product, with additional features planned as separate future works. * Programming language and integration design are largely dictated by existing actively used systems, rather than convenience of personal familiarity or efficiency. * The project solely involves freely licensed tools and resources, and is itself licensed under Free licences that encourage collaboration. ### Scope limited to authoring The scope of this project is to enable authoring, with a further aim of extending to more complex processing in the future. The primary aim is to enable authors to annotate semantics as integral part of their creative writing process. Future works may expand on this, enhancing renderings of the authored content, as discussed in @sec-rdf. The idea is to introduce semantic text annotation to Markdown authoring without disruptions to the Markdown-based workflows. A filter is added to the document generation workflow that strips semantic text annotations from the Markdown content, so that further processing need not know about the new markup, as illustrated in @fig-phase1. ![A filter to strip annotations from content.](workflow/phase1.svg){#fig-phase1} ### Improve accepted formats and tools {#sec-improve} Markdown is a widely adopted authoring format, and Pandoc a widely adopted Markdown processor. It is easier to invent a new format in a new tool than to convince widely adopted ones to evolve; however, the latter, if successful, is likely to be far more reliable. This project aims at staying close to existing Markdown, striving to only minimally introduce new syntax, unlike, for example, SAM that deviates notably from Markdown [@SAM2018]. Many Markdown processors exist, but few are in widespread use. One popular Markdown processor is Pandoc, widely adopted and integrated with multiple wrapper tools such as R Markdown and the Quarto document publishing system to streamline the production of, for example, academic papers. Yet another Markdown parser would probably be easiest to implement as a standalone project, either by forking an existing one and customize, or by use of a library for common components and then subclassing as needed. This project, however, strives to reuse existing formats and tools, and has chosen to extend the existing Markdown processor Pandoc. ### Collaborative licensing To encourage collaboration and stimulate a circular gift economy as introduced by @Mikkelsen2000, this project exclusively uses freely licensed tools and resources, and the project itself is copyleft licensed: Code parts are licensed under the GNU Public Licence version 3 or newer, and non-code parts are licensed under the Creative Commons crediting share-alike 4.0. ## Implementation plan *FIXME: sharpen this end summary, to better conclude the previous parts* * extention to existing widespread authoring language, not an alternative one * subtle extension, both to ease adoption of existing Markdown users, and to keep the original principle that it "should be publishable as-is". * ease of user adoption rather than ease of implementation Achieving these goals requires an understanding of Markdown and annotations, and it requires working code. Existing Markdown syntax is analysed in @sec-commonmark and proposed new coverage of annotation in @sec-semantic-markdown. (Eventual rendering of annotations is outside the scope of this paper and will only be briefly discussed later in @sec-rdf.) Then existing Markdown parsing is described in @sec-pandoc, and implementation of extending that to parse annotations is described in @sec-filter, and evaluated and discussed in the later chapters. *TODO: Maybe rephrase Pandoc as being not "described" but "analysed"?*