_markdown.qmd



This chapter will first introduce a grammar
translated into visual diagrams,
and then provide two analyses
of the markup language Markdown by use of that grammar.
First an analysis of a widely used subset of the language
that is relevant for annotation,
and then an analysis of a not yet used subset
proposed specifically for the intent of marking up annotations.
The purpose of these analyses is to identify
what is needed for a Markdown parser
to cover semantic text annotations.
Dialects as sets of extensions
As prerequisite for these analyses,
it is helpful to clarify a bit of terminology,
specifically the terms "dialect" and "extension":
Markdown is not a single, strictly defined language
but rather a range of slightly varying languages,
all derived from one somewhat ambiguous original specification.
One way to approach this variability,
adopted from the documentation of the Markdown processor Pandoc,
is to refer to each language variation as a "dialect",
and to each set of markup patterns not generally agreed on
as an "extension".
The first analysis covers the Markdown dialect CommonMark,
chosen both because it is widely used --
among its users is (with some additional extensions)
the renderers for README.md files at Github and Gitlab --
and because its syntax is thoroughly documented,
separately from implementations of parsers of that dialect.
The second analysis covers the Markdown extension Semantic Markdown,
chosen because it covers semantic text annotation
and is the only Markdown extension description that covers it,
as far as the author of this paper is aware.
Additionally,
the embedded language for the annotations themselves
used in this specification
will likely ease future work of enhanced renderings of Markdown,
since it is abstractly equivalent
to metadata embedding formats of both PDF and HTML
(as discussed in more detail at @sec-rdf).
Formal and visual syntax description
Markdown consists of blocks of content,
optionally prepended a set of Metadata blocks.
Here is an example as @fig-hello:
::: {#fig-hello}
---
author: Jonas Smedegaard
---
# Greeting

Hello, world!

Simple CommonMark-flavored Markdown text
:::
The example contains two different types of blocks
(as well as a set of metadata blocks which will not be covered here).
First a headline (see @fig-hello-header)
and then a regular paragraph (see @fig-hello-paragraph).
::: {#fig-hello-header}
# Greeting

Example header block in CommonMark-flavored Markdown
:::
::: {#fig-hello-paragraph}
Hello, world!

Example paragraph block in CommonMark-flavored Markdown
:::
Describing syntax using text snippets quickly becomes tedious, however,
and cumbersome distinguishing which parts of an example are the syntax
and which is dummy text.
A formal parsing expression grammar
[PEG, described in @Ford2004]
was formulated for this project,
based on the CommonMark specification version 0.31.2
[@MacFarlane2024]
and on the Semantic Markdown draft specification
[@Francart2020].
The PEG grammar,
included as [Appendix @sec-def-peg],
was then used as basis for the visual syntax diagrams
used onwards in this paper.
For example, the above example is visually described
as seen in @fig-def-Markdown.
{#fig-def-Markdown}
The diagram describes the possible order of elements,
laid out like trains on rails.
An oval element (called "terminal" in PEG)
contains either a regular expression or a string,
and a rectangle element (called "nonterminal" in PEG)
contains a label referencing another element.
A syntax label written in all capital letters indicates
that the referenced syntax is in turn terminal.
An element containing a sub-element
prefixed with an exclamation mark
excludes that sub-element.
Reading order matters
Syntax diagrams like used here commonly represent context-free EBNF grammars,
but here they importantly represent a parsing expression grammar,
chosen because context-free grammars are unlikely
to be able to cover Markdown
[@MacFarlane2014].
The significance is clear even in the above quite general diagram:
If you try first to follow the lower "track" instead of the upper one,
then the grammar (wrongly) describes @fig-hello
as a text with no metadata and three content blocks:
Existence of a metadata block is optional,
but checking for its existence is not.
These syntax diagrams are read from left to right and top to bottom,
including at points with choices.
Syntax of dialect CommonMark {#sec-commonmark}
The example @fig-hello contains two different types of blocks
(as well as a set of metadata blocks which will not be covered here),
first a headline and then a regular paragraph.
These are defined in the visual syntax as block types Header and Paragraph
Header consists of space-delimited words followed a line break,
and Paragraph consists of lines of space-delimited words
followed by two or more line breaks.
See @fig-def-blocks for syntax diagrams for those block types.
::: {#fig-def-blocks}


Syntax of Block, Header and Paragraph.
:::
In this formalisation,
"Words" are sets of printable characters, including punctuation.
They can be styled,
have a hyperlink annotated
and have CSS structure and styling annotated.
Each set can contain each other,
or a set of plain words.
See @fig-def-words for their syntax diagrams.
::: {#fig-def-words}


Syntax of StyledWords, LinkedWords, AnnotatedWords and PlainWords.
:::

Other content blocks and inline types exist
but are omitted in this description,
which is limited to the components affected
by extending the Markdown language with additional types of annotation.
Syntax diagrams for some additional Markdown components are included
as [Appendix @sec-def-dia].
Syntax of extension Semantic Markdown {#sec-semantic-markdown}
Semantic Markdown mainly extends the syntax for AnnotatedWords,
and introduces a new syntax similar to LinkDefinition.
AnnotatedWords can in principle contain any word,
but in practice expects CSS id or class definitions,
which means alphanumeric-only words prefixed by either dot or hash.
New higher prioritized syntaxes are added,
prioritized since that is simplest
and it should not cause clash with existing elements,
as in @fig-def-extensions.
Syntax newly introduced as part of Semantic Markdown
is marked with a dotted frame.
::: {#fig-def-extensions}


Syntax of derived KeyWord and new block PrefixDefinition.
:::
The above changes mainly introduces the new SemWord
which is either an angle-bracketed Uri or the new Curie,
optional prefixed with the new SEMPREFIX.
(These new terms relate to the metalanguage RDF
discussed briefly in @sec-rdf).
See @fig-def-additions for syntax diagrams of the new syntax.
::: {#fig-def-additions}


Syntax of SemWord, Curie and SEMPREFIX.
:::
Here is an elaborate example written in Semantic Markdown:
# {=<#artwork> .:Image} Semantics

Simple ontological annotation:
[This][painting] is not a [pipe].

Nested, mixed-use and custom-namespaced annotations:
[[Ceci][painting] n'est pas une [pipe].]{lang=fr bibo:shortDescription}

[painting]: {wd:Q1061035}
  "A painting of a smoking pipe {:depiction}"

[pipe]: {wd:Q104526}
  "A smoking pipe {:depicts}"

{@default}: foaf

{bibo}: http://purl.org/ontology/bibo/

{wd}: http://www.wikidata.org/entity/

...which with filter applied should become this:
# Semantics

Simple ontological annotation:
[This][painting] is not a [pipe].

Nested, mixed-use and custom-namespaced annotations:
[[Ceci][painting] n'est pas une [pipe].]{lang=fr}

[painting]: https://www.wikidata.org/entity/Q1061035
  "A painting of a smoking pipe"

[pipe]: https://www.wikidata.org/entity/Q104526
  "A smoking pipe"

Suggestions for processors
The purpose of these analyses is parsing,
as covered in the following chapters.
To conclude follows some general suggestions
for parsing the Markdown extension Semantic Markdown.
For a parser of existing Markdown to be extended
to cover the Markdown extension Semantic Markdown,
it needs to cover the existing extension AnnotatedWords,
extended to contain Uri and Curie,
and AnnotatedWords should be permitted not only immediately after Words,
but also as initial or final Words in a block.

Additionally, the new block type PrefixDefinition needs to be covered,
similar to LinkDefinition but a simpler structure
with Curie as the initial element.
Parsing of Semantic Markdown should be human-friendly,
in the spirit of Markdown
(see @sec-spirit).
For properly matched annotations,
this translates to the markup being removed from content,
similar to the removal of LinkTarget inlines and LinkDefinition blocks
in core Markdown,
and non-semantic AnnotatedWords in some dialects.
Non-matched data should be preserved, treated as content.
These extended or new Word and Block syntaxes --
AnnotatedWords and PrefixDefinition --
should be safe to parse prioritized,
because the majorly involved Curie pattern should not collide
with any existing Markdown,
and is unlikely to appear in natural language text.