_markdown.qmd



This chapter will introduce a grammar
represented as visual diagrams,
and then provide two analyses
of the markup language Markdown by use of that grammar.
First an analysis of a widely used subset of the language
that is relevant for annotation,
and then an analysis of a not yet used subset
proposed specifically for the intent of marking up annotations.
The purpose of these analyses is to identify
what is needed for a Markdown parser
to cover semantic text annotations.
Dialects as sets of extensions
As prerequisite for these analyses,
it is helpful to clarify a bit of terminology,
specifically the terms "dialect" and "extension":
Markdown is not a single, strictly defined language
but rather a range of slightly varying languages,
all derived from one somewhat ambiguous original specification.
One way to approach this variability,
adopted from the documentation of the Markdown processor Pandoc,
is to refer to each language variation as a "dialect",
and to each set of markup patterns not generally agreed on
as an "extension".
The first analysis covers the Markdown dialect CommonMark,
chosen both because it is widely used --
among its users is (with some additional extensions)
the renderers for README.md files at Github and Gitlab --
and because its syntax is thoroughly documented,
separately from implementations of parsers of that dialect.
The second analysis covers the Markdown extension Semantic Markdown,
chosen because it covers semantic text annotation
and is the only Markdown extension description that covers it,
as far as the author of this paper is aware.
Additionally,
the embedded language for the annotations themselves
used in this specification
will likely ease future work of enhanced renderings of Markdown,
since it is abstractly equivalent
to metadata embedding formats of both PDF and HTML
(as discussed in more detail at @sec-rdf).
Formal and visual syntax description
Markdown consists of blocks of content,
optionally prepended a set of Metadata blocks.
Here is an example as @fig-hello:
::: {#fig-hello}
---
author: Jonas Smedegaard
---
# Greeting

Hello, world!

Simple CommonMark-flavored Markdown text
:::
The example contains two different types of blocks
(as well as a set of metadata blocks which will not be covered here).
First a headline (see @fig-hello-header)
and then a regular paragraph (see @fig-hello-paragraph).
::: {#fig-hello-header}
# Greeting

Example header block in CommonMark-flavored Markdown
:::
::: {#fig-hello-paragraph}
Hello, world!

Example paragraph block in CommonMark-flavored Markdown
:::
Describing syntax using text snippets quickly becomes tedious, however,
and cumbersome distinguishing which parts of an example are the syntax
and which is dummy text.
A formal parsing expression grammar
[PEG, described in @Ford2004]
was formulated for this project,
based on the CommonMark specification version 0.31.2
[@MacFarlane2024]
and on the Semantic Markdown draft specification
[@Francart2020].
The PEG grammar,
included as [Appendix @sec-def-peg],
was then used as basis for the visual syntax diagrams
used onwards in this paper.
For example, the above example is visually described
as seen in @fig-def-Markdown.
{#fig-def-Markdown}
The diagram describes the possible order of elements,
laid out like trains on rails.
An oval element (called "terminal" in PEG)
contains either a regular expression or a string,
and a rectangle element (called "nonterminal" in PEG)
contains a label referencing another element.
A syntax label written in all capital letters indicates
that the referenced syntax is in turn terminal.
An element containing a sub-element
prefixed with an exclamation mark
excludes that sub-element.
Reading order matters
Syntax diagrams like used here commonly represent context-free EBNF grammars,
but here they importantly represent a parsing expression grammar,
chosen because context-free grammars are unlikely
to be able to cover Markdown
[@MacFarlane2014].
The significance is clear even in the above quite general diagram:
If you try first to follow the lower "track" instead of the upper one,
then the grammar (wrongly) describes @fig-hello
as a text with no metadata and three content blocks:
Existence of a metadata block is optional,
but checking for its existence is not.
These syntax diagrams are read from left to right and top to bottom,
including at points with choices.
Syntax of dialect CommonMark {#sec-commonmark}
More specifically,
the example @fig-hello contains two different types of blocks
(as well as a set of metadata blocks which will not be covered here),
first a headline and then a regular paragraph.
These are defined in the visual syntax as block types Header and Paragraph
Header consists of space-delimited words followed a line break,
and Paragraph consists of lines of space-delimited words
followed by two or more line breaks.
See @fig-def-blocks for syntax diagrams for those block types.
::: {#fig-def-blocks}


Syntax of Block, Header and Paragraph.
:::
In this formalisation,
"Words" are sets of printable characters, including punctuation.
They can be styled,
have a hyperlink annotated
and have CSS structure and styling annotated.
Each set can contain each other,
or a set of plain words.
See @fig-def-words for their syntax diagrams.
::: {#fig-def-words}


Syntax of StyledWords, LinkedWords, AnnotatedWords and PlainWords.
:::

Other content blocks and inline types exist
but are omitted in this description,
which is limited to the components affected
by extending the Markdown language with additional types of annotation.
Syntax diagrams for some additional Markdown components are included
as [Appendix @sec-def-dia].
Syntax of extension Semantic Markdown {#sec-semantic-markdown}
Semantic Markdown mainly extends the syntax for AnnotatedWords,
and introduces a new syntax similar to LinkDefinition.
The syntax drawings in this subsection has the extended syntax marked
with a dotted frame.
FIXME: add dot-frame for all drawings used here
AnnotatedWords can in principle contain any word,
but in practice expects CSS id or class definitions,
which means alphanumeric-only words prefixed by either dot or hash.
New higher prioritized syntaxes are added that should not clash with these,
for URI and CURIE words,
as in @fig-def-extensions.
FIXME: mention and draw extended LinkedWordsX as well.
::: {#fig-def-extensions}

Syntax of AnnotatedWords and LinkedWords, extended with SemWords.
:::
The new SemWords are components in the RDF language,
which is described further in @sec-rdf
either an angle-bracketed Uri or a CURIE.
Each component has an optional prefix
to denote whether it is an RDF subject, predicate or object.
(Again, these RDF terms are described further in @sec-rdf).
See @fig-def-additions for their syntax diagrams.
FIXME: mention and draw Curie and NAME
::: {#fig-def-additions}


Syntax of SemWords, Curie, SEMPREFIX and NAME.
:::
Expectations of processors
Parsing should be human-friendly,
in the spirit of Markdown
(see @sec-spirit).
This may translate to the annotations being dropped,
unlike Markdown in general but like link definition blocks.
For a Markdown parser to cover the Markdown extension Semantic Markdown,
it needs to cover the existing extension AnnotatedWords,
extended to contain URIs and CURIEs,
and it needs to cover AnnotatedWords not only immediately after Words,
but also as leading or trailing Words for a block.
Additionally, a new block type needs to be covered,
similar to LinkDefinition but a simpler structure
with a CURIE as initial element.
These new Word and Block syntaxes should be prioritized,
as the restricted patterns tied to CURIEs is unlikely to collide
with existing Markdown or non-markup plain text.