title: |
Processing Semantic Text Annotations
in Markdown
subtitle: Semantic Markdown Implemented as a Pandoc Filter
date: 2025-05-27
toc-depth: 2
format:
stylish-report-pdf:
pdfversion: "2.0"
pdfstandard: [A-4f, UA-2]
generate Well-Tagged PDF using tag tagging
requires documentmetadatasupport 1.0n (2025-03-25)
tagging: on
else generate Well-Tagged PDF using latest testphase components
requires documentmetadatasupport 1.0k (2024-12-21)
pdftestphase: latest
fail on error generating Well-Tagged PDF
pdftestphasestrict: true
debugging of Well-Tagged PDF
pdfdebug: ["para"]
metadata-files:
- _actors.yml
keywords:
- Markdown
- CommoMmark
- Lua
- pandoc
- semantic authoring
- semantic annotation
link-citations: true
bibliography: ref.bib
csl: apa
resource-path:
- /usr/share/citation-style-language/styles
filters:
- en2em
- nobreaks
- include-code-files
include-in-header:
add alt text to embedded image, to aid non-visual rendering
- text: |
\renewcommand{\doclicenseImage}[1][]{%
\setkeys{doclicense}{#1}
\href{\doclicenseURL}{%
\includegraphics[
alt=\doclicenseText%
width=\doclicense@imagewidth%
]{\doclicenseImageFileName}%
}
}
fix british spelling of "licence", and append link to published source
- text: |
\makeatletter
@namedef{doclicense@lang@word@license}{ licence.\
Source is available at
\href{https://source.jones.dk/sem-md}{https://source.jones.dk/sem-md}
and decentrally with
\href{https://app.radicle.xyz/nodes/seed.radicle.garden/rad:z3ckfAxH326pgPFKe7rDYJ4mxBoWo}{%
rad:z3ckfAxH326pgPFKe7rDYJ4mxBoWo}%
}
\makeatother
#keep-tex: true
Abstract
This report explores the extension of the Markdown markup language
to support semantic text annotations,
focusing on implementation via a Pandoc filter.
Markdown, originally designed for simplicity and readability,
lacks native support for tying semantic information to specific text strings.
The project implements the Semantic Markdown draft specification
by misparsing semantic annotations as regular Markdown
and subsequently cleaning them up through a Lua-based Pandoc filter.
This approach preserves Markdown's core qualities
while enabling enriched semantic authoring
integrated with existing workflows.
The report analyses Markdown syntax,
the Pandoc Abstract Syntax Tree, and the filter API
to achieve this integration.
Future work aims to enhance annotation extraction and rendering,
facilitating improved workflows
for collaborative and automated document processing.
::: {lang=da}
Denne rapport undersøger udvidelsen af markup-sproget Markdown
til at understøtte semantiske tekstannotationer
med fokus på implementering via et Pandoc-filter.
Markdown, oprindeligt designet til enkelhed og læsbarhed,
mangler indbygget støtte
til at knytte semantisk information til specifikke tekststrenge.
Projektet implementerer udkastet til Semantic Markdown-specifikationen
ved først at fejltolke semantiske annotationer som almindelig Markdown
og derefter rense dem via et Lua-baseret Pandoc-filter.
Denne tilgang bevarer Markdowns kerneegenskaber
samtidig med, at den muliggør beriget semantisk forfatterskab
integreret i eksisterende arbejdsgange.
Rapporten analyserer Markdown-syntaks,
Pandocs abstrakte syntaks-træ og filter-API'en
for at opnå denne integration.
Fremtidige arbejder sigter mod at forbedre
udtrækning og gengivelse af annotationer,
hvilket understøtter bedre arbejdsgange
for samarbejde og automatiseret dokumentbehandling.
:::
Introduction
{{< include _intro.qmd >}}
Markdown syntax for annotations
{{< include _markdown.qmd >}}
How Pandoc processes Markdown {#sec-pandoc}
{{< include _pandoc.qmd >}}
Pandoc filter sem-md
{#sec-filter}
{{< include _filter.qmd >}}
Evaluation of sem-md
{{< include _evaluation.qmd >}}
Discussion and Conclusion
{{< include _conclusion.qmd >}}
Bibliography {.appendix}
\begingroup
\raggedright
::: {#refs}
:::
\endgroup
\appendix
Appendix: Pandoc filter sem-md
{.appendix}
Appendix: Markdown syntax as PEG {.appendix #sec-def-peg}
Appendix: Markdown syntax as syntax diagrams {.appendix #sec-def-dia}
{{< include _syntax.qmd >}}