_markdown.qmd



This chapter will provide two analyses
of the markup language Markdown;
first an analysis of a widely used subset of the language
that is relevant for annotation,
and then an analysis of a not yet used subset
proposed specifically for the intent of marking up annotations.
Ahead of these analyses,
it helps to define terminology of "dialect" and "extension":
Markdown is not a single strictly defined language,
but a range of slightly varying languages
all derived from the same slightly ambiguous original specification.
One way to approach this variability,
used among other places in the documentation of the Markdown processor Pandoc
and reused here,
is to call each language variation a "dialect",
and each set of markup patterns not generally agreed on an "extension".
The first analysis covers the Markdown dialect CommonMark,
chosen both because it is widely used --
among its users is (with some additional extensions)
the renderers for README.md files at Github and Gitlab --
and because its syntax is thoroughly documented,
separately from implementations of parsers of that dialect.
The second analysis covers the Markdown extension semantic-markdown,
chosen because it covers semantic text annotation
and is the only Markdown extension description that covers it,
as far as the author of this paper is aware.
Additionally,
the embedded language for the annotations themselves
used in this specification
will likely ease future work of enhanced renderings of Markdown,
since it is abstractly equivalent
to metadata embedding formats of both PDF and HTML
(as discussed in more detail at @sec-rdf).
Syntax of Markdown dialect Commonmark
Markdown consists of blocks of content,
optionally prepended a set of YAML-formatted Metadata blocks.
Visually, this can be described using a syntax diagram
where the possible order of elements are laid out
like trains on rails,
as seen in @fig-def-Markdown.
{#fig-def-Markdown}
Here is an example:
---
author: Jonas Smedegaard
---
# Greeting

Hello, world!

This example involves
the syntax for the block types Header and Paragraph
(and for MetaBlock which will not be covered here).
Header consists of space-delimited words followed a line break,
and Paragraph consists of lines of space-delimited words
followed by two or more line breaks.
See @fig-def-blocks for syntax diagrams for those block types.
::: {#fig-def-blocks}


Syntax of Block, Header and Paragraph.
:::
Reading order matters:
The syntax diagrams should be read left-to-right and top-to-bottom,
also at places with choice --
e.g. the block type Header should be tried before Paragraph.
Otherwise if Paragraph syntax was parsed first,
then it would match both blocks
because that block type begins with any words,
including the characters defitive for the Header block type.
In other words,
these syntax diagrams do not represent the more common EBNF grammars
but instead a parsing expression grammar [@Ford2004],
chosen because context-free grammars are unlikely
to be able to cover Markdown
[@MacFarlane2014].
The PEG grammar covering all syntax diagrams shown here
is included as [Appendix @sec-def-peg].
Words are sets of printable characters
(including punctuation and other printable characters).
They can be styled,
have a hyperlink annotated
and have CSS structure and styling annotated.
Each set can contain each other,
or a set of plain words.
Se @fig-def-words for their syntax diagrams.
::: {#fig-def-words}


Syntax of StyledWords, LinkedWords, AnnotatedWords and PlainWords.
:::

Other content blocks and inline types exist
but are omitted in this description,
which is limited to the comonents affected
by extending the Markdown language with additional types of annotation.
Syntax diagrams for some additional Markdown components are included
as [Appendix @sec-def-dia].
Syntax of extension Semantic Markdown
Semantic Markdown mainly extends the syntax for AnnotatedWords,
and introduces a new syntax similar to LinkDefinition.
The syntax drawings in this subsection has the extended syntax marked
with a dotted frame.
FIXME: add dot-frame for all drawings used here
AnnotatedWords can in principle contain any word,
but in practice expects CSS id or class definitions,
which means alphanumeric-only words prefixed by either dot or hash.
New higher prioritized syntaxes are added that should not clash with these,
for URI and CURIE words,
as in @fig-def-extensions.
FIXME: mention and draw extended LinkedWordsX as well.
::: {#fig-def-extensions}

Syntax of AnnotatedWords and LinkedWords, extended with SemWords.
:::
The new SemWords are components in the RDF language,
which is described further in @sec-rdf
either an angle-bracketed Uri or a CURIE.
Each component has an optional prefix
to denote whether it is an RDF subject, predicate or object.
(Again, these RDF terms are described further in @sec-rdf).
See @fig-def-additions for their syntax diagrams.
FIXME: mention and draw Curie and NAME
::: {#fig-def-additions}


Syntax of SemWords, Curie, SEMPREFIX and NAME.
:::
Expectations of processors
FIXME: write this!

hvad skal med
hvad skal ikke med
hvis PDF..., hvad så?

Dette afsnit udgør "requirements"
Readability
In the source format of Markdown with annotations...

the syntax for annotation should be relatively human comprehensible
(in the same spirit as Markdown -- e.g. **strong emphasis**)
annotation syntax must not conflict with core Markdown
(i.e. must not cause disambiguations)
annotation syntaxt should not conflict with other Markdown extensions

For syntactically correct and structurally supported annotations...

visual output must be identical to source without the annotation
metadata of output must contain the annotation

For syntactically incorrect or structurally unsupported annotations...

the annotation must not disappear from visual output
visual output should include the annotation in source form