This chapter will provide and analysis
of the data format Markdown
and the Markdown-based publishing system Quarto.
This project mainly involves navigating in and altering data structures.
Main data structures are the document formats Markdown, HTML and PDF,
and the abstract data language RDF,
serialised as RDFa (embedded in HTML) and PDF (embedded in PDF).
Markdown
Structural and layout annotation, and metadata
Original Markdown provides unobtrusive markup
for content and hypermedia structure,
to ease the authoring of style-agnostic hypermedia content.
Later dialects extends the language
to cover more content and hypermedia structure,
style annotation
and text-wide metadata.
The separation of visual concerns from content and structure
is harnessed by the document converter Pandoc
and the Pandoc-based document authoring framework Quarto:
Pandoc with Quarto plugins and templates
allows annotating a string as a hyperlink or a citation,
declaring authorship, ownership and release date,
and rendering as a scholarly paper
conforming to a prescribed style guide and document format.
Semantic annotation is missing
None of the existing Markdown dialects,
however,
covers annotation of content semantics.
You cannot -- using existing Markdown dialects --
annotate a string as contextually related to some content domain,
in a way that Markdown processors will treat it as such:
When rendering an output document
the annotation is omitted from the text
and optionally accessible as part of document metadata.
Example annotations might include
some numbers in meter and others in nautical miles,
or one citation being supportive and another a rebuttal,
or one quote using "she" as personal pronoun
and another using it derogatory.
Such meta information tied not to the document as a whole
but to specific strings in the text
cannot be written as such --
i.e. structurally part of the writing
but communicatively meta to the prose content of the text.
Markdown is "probably the most popular markup language today"
[@Rapp2023, p. 42].
It was originally defined by @Gruber2004
as a superset of HTML,
improving readability and ease of writing
by adding email-style markup
for common content structure like headers, emphasis, lists and hyperlinks.
A core principle of Markdown is readability:
A Markdown-formatted document should be publishable as-is,
as plain text,
without looking like it’s been marked up
with tags or formatting instructions..
[@Gruber2004, section "Philosophy"].
Many dialects of Markdown have evolved,
some tightening the language for parsing efficiency and disambiguation,
some extending to cover additional structures
and some including support for a YAML or TOML metadata header section.
Markdown as originally designed is a source format to produce HTML.
If using only Markdown-defined markup, avoiding HTML tags,
the text is however reliably translatable also to other formats.
Pandoc is a tool that can convert texts in Markdown dialects
into many document formats including HTML and (via LaTeX) PDF,
applying visual style and positioning throught templates.
Such document workflows,
including minimal structural markup as part of the creative writing,
applying visual layout as an automated templating process
tied to a target document format,
has been further streamlined for academic texts
in the Quarto document publishing system.
Markdown is a text markup language
with an emphasis on being easy for humans to read
[@Gruber2004].
Compared to word processors like Microsoft Word and LibreOffice Writer,
Markdown authoring stores both content and markup together
in a human-readable tekst file.
::: {#fig-formality}
informal /---------formatted text----------\ formal
<------v-------------v-------------v-----------------------v---->
plain text informal markup formal markup binary format
(Markdown) (HTML, XML, etc.)
Markdown is informal, ASCII-based markup
[@Leonard2016, p. 4]
:::
HTML is itself a plaintext format,
but is less human-readable.
Similarly the format LaTeX is also plaintext,
but its markdown arguably distracts the reading process
[@Mailund2019chap2, p. 9].
Alternatives
Other human-readable document source formats exists.
TODO: briefly cover reStructuredText, Org-mode and AsciiDoc.
Integration
Markdown is in widespread use.
Major source forges use Markdown by default for README files
[@Github2025; @GitLab2025; @Codeberg2024].
Some major programming languages
natively support Markdown in embedded docstrings
in core tools
[@Microsoft2023; @Oracle2025; @RustTeam2024];
others offer optional support e.g. through plugins
[@Heesch2025; @Sphinx2025; @JSDoc2023].
Pandoc and Quarto
The Markdown processor Pandoc can transform Markdown not only to HTML
but also to other output formats like PDF.
Pandoc offers an API for adapting its content processing
as well as a templating structure for customizing layout,
which is streamlined in the document authoring framework Quarto:
Pandoc with a set of plugins and templates
enables rendering of scholarly papers
conforming to prescribed style guides and document formats.
Pandoc is a document converter built around the markdown markup language,
able to parse from and serialise to many Markdown dialects
as well as equivalent subsets of other text markup languages
including HTML, LaTeX (and by extension PDF),
Office Open XML (as used by recent releases of Microsoft Word)
and OpenDocument (as used by OpenOffice and LibreOffice).
Pandoc is extensible,
supporting custom code loaded at runtime
either for custom parsing or serialising,
or for manipulating the intermediate internal content structure
called Abstract Syntax Tree (AST).
Pandoc supports redefining input and output formats
and manipulating the internal document structure
as part of the automated parts of the framework.
Pandoc is extendable.
Source and output format can be changed or completely redefined
and the internal document structure manipulated,
in the automated parts of the framework.
Collection of interrelated POSIX scripts and Pandoc extensions
for enabling semantic annotations in Markdown-based authoring workflows.
- filter extension to capture annotations
- identify semantic metadata in stylistic metadata part of Pandoc YAML header
- identify semantic metadata in content part of Pandoc document structure
- append semantic metadata to Pandoc YAML document header
- strip identified metadata from stylistic metadata and content
- output format extension to generate PDF
- read semantic metadata from Pandoc YAML document header
- structure semantic metadata as RDF triples
- append RDF triples serialized as part of XMP metadata in PDF
- output format extension to generate web page
- read semantic metadata from Pandoc YAML document header
- structure semantic metadata as RDF triples
- append RDF triples serialized as RDFa
Markdown provides intuitive and unobtrusive markup syntax
for structure like headers, emphasis, lists and hyperlinks.
Pandoc extends Markdown with syntax
for citation annotation
and an optional YAML metadata header.
Quarto extends Markdown further with syntax
for some styling and some convenience macros,
and applies templates for a uniform visual styling
across target document formats.
Interfaces
- Pandoc document object model (DOM)
- Resource Description Framework (RDF)
- Markdown
- CommonMark