A Markdown-formatted document should be publishable as-is,
as plain text,
without looking like it’s been marked up
with tags or formatting instructions.
[@Gruber2004syntax, section "Philosophy"]
The markup language Markdown was introduced in 2004
with the specific aim of helping authors focus on content,
separate from layout concerns
[@Gruber2004].
It has since then been widely adopted as authoring source format
for generating reading-optimized documents in HTML and PDF
that in the same time period have evolved
to support semantic text annotations.
But Markdown can only express structure and hypermedia annotations,
not semantic annotations.
Annotating text differs from annotating the document as a whole
in that the information is tied to specific text strings.
A document annotation can say
e.g. "this document contains strong language somewhere"
or "this document contains a link to this URL somwehere",
whereas a text annotation can say
e.g. "this text string is strongly emphasized"
or "this text string links to this URL".
Semantic text annotation
is the process of applying information about meaning
to specific text strings.
Some Markdown dialects support prepending a metadata section
to the whole content markup,
i.e. semantic document annotation,
but tying semantics to specific strings is currently unsupported.
The author cannot express "this currency amount is in 1980 dollars"
or "this uses the derogatory meaning of the term".
A draft specification exists
for a Markdown syntax extension to cover semantic text annotations,
called "Semantic Markdown"
[@Smedegaard2022].
FIXME: Elaborate on above draft spec and its relevancy.
TODO: Maybe move above paragraph down to implementation plan.
This project aims to enable authors to include semantic annotations
as part of their writing,
similarly unobtrusively as structural and hypermedia markup,
by extending the Markdown processor Pandoc
to handle semantic text annotations,
inspired by the syntax extension Semantic Markdown.
FIXME: Rewrite and expand above to properly introduce Pandoc.
Problem formulation
So,
How can Pandoc be extended to support semantic text annotations?
- What are the core qualities of Markdown,
and how can a Markdown dialect express semantic text annotations
while maintaining those qualities?
- How do Pandoc convert Markdown to HTML or PDF,
and how can this workflow be extended
to handle semantic text annotations?
- Which approach to extending Pandoc
is more likely long-term sustainable?
- How can the reliability of a Pandoc extension be evaluated?
Levels of implementation
FIXME: maybe move this subsection to later subsection on expectations for processors
The primary aim is to support authoring;
enhancing renderings of the authored content is secondary.
Annotations are metadata, not directly part of the content.
Consequently, the simplest way
to correctly process markdown with semantic text annotations
is to omit the annotations altogether,
rendering as if they had not been applied at all,
as illustrated in @fig-phase1.
{#fig-phase1}
More advanced processing may optionally embed semantic text annotations
as semantic document annotations, i.e. embed as document-wide metadata.
Even more advanced processing might enhance the content rendering,
similar to how some PDF rendering provide interactive navigation
for embedded hypermedia markup like links and anchors.
This project aims at the simplest processing level as described above,
due to the limited time of the project.
(further processing is discussed in @sec-rdf).
Maintaining usability and interoperability
A notable challenge is aligning with existing practice and systems.
Markdown is known for its unobtrusive plaintext editing format,
and an extension to its syntax will need to fit that principle.
Also,
FIXME: drop or rewrite...
Implementation plan
FIXME: drop or rewrite below section
Pandoc reads a text document,
parses its structural components into an Abstract Syntax Tree (AST),
and serialises and writes back into a text document.
The Pandoc AST deliberately prioritises structural information
and is relaxed about visual information,
to preserve literal content
while reducing format-specific stylistic details,
relevant especially when processing between different formats.
Most common is to read plaintext Markdown files
and write LaTeX code further compiled into a PDF file.
Pandoc allows supplying custom reader and writer functions
as well as plugging into and manipulating the AST,
which this project will exploit:
This project will write an extension
to adjust the AST
when abusing the default Markdown reader
to read Markdown with added markup for ontological annotations.
First milestone is reached
when the filter can simply suppress the added markup.
A further milestone is to embed the expressed annotations
in supported output formats.
Another further milestone is to make use of the added markup,
e.g. to annotate purpose of scholarly citations
as presented in @Daquino2023.
- Outline work
- definér hvad er semantisk textannotation
- analysér eksisterende værktøj
- forentet output
- implementering som filter...