aboutsummaryrefslogtreecommitdiff
path: root/_intro.qmd
blob: a15883a4eed0d93a977af1959342f483e840669e (plain)

A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
[@Gruber2004syntax, section "Philosophy"]

The markup language Markdown was introduced in 2004 with the specific aim of helping authors focus on content, separate from layout concerns [@Gruber2004]. It has since then been widely adopted as authoring source format for generating reading-optimized documents in HTML and PDF that in the same time period have evolved to support semantic text annotations. But Markdown can only express structure and hypermedia annotations, not semantic annotations.
Annotating text differs from annotating the document as a whole in that the information is tied to specific text strings. A document annotation can say e.g. "this document contains strong language somewhere" or "this document contains a link to this URL somwehere", whereas a text annotation can say e.g. "this text string is strongly emphasized" or "this text string links to this URL".

Semantic text annotation is the process of applying information about meaning to specific text strings. Some Markdown dialects support prepending a metadata section to the whole content markup, i.e. semantic document annotation, but tying semantics to specific strings is currently unsupported. The author cannot express "this currency amount is in 1980 dollars" or "this uses the derogatory meaning of the term".

A draft specification exists for a Markdown syntax extension to cover semantic text annotations, called "Semantic Markdown" [@Smedegaard2022].

FIXME: Elaborate on above draft spec and its relevancy.

TODO: Maybe move above paragraph down to implementation plan.

This project aims to enable authors to include semantic annotations as part of their writing, similarly unobtrusively as structural and hypermedia markup, by extending the Markdown processor Pandoc to handle semantic text annotations, inspired by the syntax extension Semantic Markdown.

FIXME: Rewrite and expand above to properly introduce Pandoc.

Problem formulation

So, How can Pandoc be extended to support semantic text annotations?

  • What are the core qualities of Markdown, and how can a Markdown dialect express semantic text annotations while maintaining those qualities?
  • How do Pandoc convert Markdown to HTML or PDF, and how can this workflow be extended to handle semantic text annotations?
  • Which approach to extending Pandoc is more likely long-term sustainable?
  • How can the reliability of a Pandoc extension be evaluated?

Levels of implementation

FIXME: maybe move this subsection to later subsection on expectations for processors

The primary aim is to support authoring; enhancing renderings of the authored content is secondary.

Annotations are metadata, not directly part of the content. Consequently, the simplest way to correctly process markdown with semantic text annotations is to omit the annotations altogether, rendering as if they had not been applied at all, as illustrated in @fig-phase1.

Simplest implementation level.{#fig-phase1}

More advanced processing may optionally embed semantic text annotations as semantic document annotations, i.e. embed as document-wide metadata.

Even more advanced processing might enhance the content rendering, similar to how some PDF rendering provide interactive navigation for embedded hypermedia markup like links and anchors.

This project aims at the simplest processing level as described above, due to the limited time of the project. (further processing is discussed in @sec-rdf).

Maintaining usability and interoperability

A notable challenge is aligning with existing practice and systems. Markdown is known for its unobtrusive plaintext editing format, and an extension to its syntax will need to fit that principle. Also,
FIXME: drop or rewrite...

Implementation plan

FIXME: drop or rewrite below section

Pandoc reads a text document, parses its structural components into an Abstract Syntax Tree (AST), and serialises and writes back into a text document. The Pandoc AST deliberately prioritises structural information and is relaxed about visual information, to preserve literal content while reducing format-specific stylistic details, relevant especially when processing between different formats. Most common is to read plaintext Markdown files and write LaTeX code further compiled into a PDF file.

Pandoc allows supplying custom reader and writer functions as well as plugging into and manipulating the AST, which this project will exploit: This project will write an extension to adjust the AST when abusing the default Markdown reader to read Markdown with added markup for ontological annotations.

First milestone is reached when the filter can simply suppress the added markup. A further milestone is to embed the expressed annotations in supported output formats. Another further milestone is to make use of the added markup, e.g. to annotate purpose of scholarly citations as presented in @Daquino2023.

  • Outline work
  • definér hvad er semantisk textannotation
  • analysér eksisterende værktøj
  • forentet output
  • implementering som filter...