This chapter will provide two analyses of the markup language Markdown; first an analysis of a widely used subset of the language that is relevant for annotation, and then an analysis of a not yet used subset proposed specifically for the intent of marking up annotations. Ahead of these analyses, it helps to define terminology of "dialect" and "extension": Markdown is not a single strictly defined language, but a range of slightly varying languages all derived from the same slightly ambiguous original specification. One way to approach this variability, used among other places in the documentation of the Markdown processor Pandoc and reused here, is to call each language variation a "dialect", and each set of markup patterns not generally agreed on an "extension". The first analysis covers the Markdown dialect CommonMark, chosen both because it is widely used -- among its users is (with some additional extensions) the renderers for `README.md` files at Github and Gitlab -- and because its syntax is thoroughly documented, separately from implementations of parsers of that dialect. The second analysis covers the Markdown extension semantic-markdown, chosen because it covers semantic text annotation and is the only Markdown extension description that covers it, as far as the author of this paper is aware. Additionally, the embedded language for the annotations themselves used in this specification will likely ease future work of enhanced renderings of Markdown, since it is abstractly equivalent to metadata embedding formats of both PDF and HTML (as discussed in more detail at @sec-rdf). ## Syntax of Markdown dialect Commonmark Markdown consists of blocks of content, optionally prepended a set of YAML-formatted Metadata blocks. Visually, this can be described using a syntax diagram where the possible order of elements are laid out like trains on rails, as seen in @fig-def-Markdown. ![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown} Here is an example: ```markdown --- author: Jonas Smedegaard --- # Greeting Hello, world! ``` This example involves the syntax for the block types `Header` and `Paragraph` (and for `MetaBlock` which will not be covered here). `Header` consists of space-delimited words followed a line break, and `Paragraph` consists of lines of space-delimited words followed by two or more line breaks. See @fig-def-blocks for syntax diagrams for those block types. ::: {#fig-def-blocks} ![](syntax/def_Block.svg) ![](syntax/def_Header.svg) ![](syntax/def_Paragraph.svg) Syntax of `Block`, `Header` and `Paragraph`. ::: Reading order matters: The syntax diagrams should be read left-to-right and top-to-bottom, also at places with choice -- e.g. the block type `Header` should be tried before `Paragraph`. Otherwise if `Paragraph` syntax was parsed first, then it would match both blocks because that block type begins with any words, including the characters defitive for the `Header` block type. In other words, these syntax diagrams do *not* represent the more common EBNF grammars but instead a parsing expression grammar [@Ford2004], chosen because context-free grammars are unlikely to be able to cover Markdown [@MacFarlane2014]. The PEG grammar covering all syntax diagrams shown here is included as [Appendix @sec-def-peg]. Words are sets of printable characters (including punctuation and other printable characters). They can be styled, have a hyperlink annotated and have CSS structure and styling annotated. Each set can contain each other, or a set of plain words. Se @fig-def-words for their syntax diagrams. ::: {#fig-def-words} ![](syntax/def_StyledWords.svg) ![](syntax/def_LinkedWords.svg) ![](syntax/def_AnnotatedWords.svg) ![](syntax/def_PlainWords.svg) Syntax of `StyledWords`, `LinkedWords`, `AnnotatedWords` and `PlainWords`. ::: Other content blocks and inline types exist but are omitted in this description, which is limited to the comonents affected by extending the Markdown language with additional types of annotation. Syntax diagrams for some additional Markdown components are included as [Appendix @sec-def-dia]. ## Syntax of extension Semantic Markdown Semantic Markdown mainly extends the syntax for `AnnotatedWords`, and introduces a new syntax similar to `LinkDefinition`. The syntax drawings in this subsection has the extended syntax marked with a dotted frame. *FIXME: add dot-frame for all drawings used here* `AnnotatedWords` can in principle contain any word, but in practice expects CSS id or class definitions, which means alphanumeric-only words prefixed by either dot or hash. New higher prioritized syntaxes are added that should not clash with these, for URI and CURIE words, as in @fig-def-extensions. *FIXME: mention and draw extended LinkedWordsX as well.* ::: {#fig-def-extensions} ![](syntax/def_AnnotatedWordsX.svg) Syntax of `AnnotatedWords` and `LinkedWords`, extended with `SemWords`. ::: The new `SemWords` are components in the RDF language, which is described further in @sec-rdf either an angle-bracketed `Uri` or a `CURIE`. Each component has an optional prefix to denote whether it is an RDF subject, predicate or object. (Again, these RDF terms are described further in @sec-rdf). See @fig-def-additions for their syntax diagrams. *FIXME: mention and draw `Curie` and `NAME`* ::: {#fig-def-additions} ![](syntax/def_SemWords.svg) ![](syntax/def_SEMPREFIX.svg) Syntax of `SemWords`, `Curie`, `SEMPREFIX` and `NAME`. ::: ## Expectations of processors *FIXME: write this!* * hvad skal med * hvad skal ikke med * hvis PDF..., hvad så? Dette afsnit udgør "requirements" ### Readability In the source format of Markdown with annotations... * the syntax for annotation **should** be relatively human comprehensible (in the same spirit as Markdown -- e.g. \*\*strong emphasis\*\*) * annotation syntax **must not** conflict with core Markdown (i.e. must not cause disambiguations) * annotation syntaxt **should not** conflict with other Markdown extensions For syntactically correct and structurally supported annotations... * visual output **must** be identical to source without the annotation * metadata of output **must** contain the annotation For syntactically incorrect or structurally unsupported annotations... * the annotation **must not** disappear from visual output * visual output **should** include the annotation in source form