This chapter will first introduce a grammar translated into visual diagrams, and then provide two analyses of the markup language Markdown by use of that grammar. First an analysis of a widely used subset of the language that is relevant for annotation, and then an analysis of a not yet used subset proposed specifically for the intent of marking up annotations. The purpose of these analyses is to identify what is needed for a Markdown parser to cover semantic text annotations. ## Dialects as sets of extensions As prerequisite for these analyses, it is helpful to clarify a bit of terminology, specifically the terms "dialect" and "extension": Markdown is not a single, strictly defined language but rather a range of slightly varying languages, all derived from one somewhat ambiguous original specification. One way to approach this variability, adopted from the documentation of the Markdown processor Pandoc, is to refer to each language variation as a "dialect", and to each set of markup patterns not generally agreed on as an "extension". The first analysis covers the Markdown dialect CommonMark, chosen both because it is widely used -- among its users is (with some additional extensions) the renderers for `README.md` files at Github and Gitlab -- and because its syntax is thoroughly documented, separately from implementations of parsers of that dialect. The second analysis covers the Markdown extension Semantic Markdown, chosen because it covers semantic text annotation and is the only Markdown extension description that covers it, as far as the author of this paper is aware. Additionally, the embedded language for the annotations themselves used in this specification will likely ease future work of enhanced renderings of Markdown, since it is abstractly equivalent to metadata embedding formats of both PDF and HTML (as discussed in more detail at @sec-rdf). ## Formal and visual syntax description Markdown consists of blocks of content, optionally prepended a set of Metadata blocks. Here is an example as @fig-hello: ::: {#fig-hello} ```markdown --- author: Jonas Smedegaard --- # Greeting Hello, world! ``` Simple CommonMark-flavored Markdown text ::: The example contains two different types of blocks (as well as a set of metadata blocks which will not be covered here). First a headline (see @fig-hello-header) and then a regular paragraph (see @fig-hello-paragraph). ::: {#fig-hello-header} ```markdown # Greeting ``` Example header block in CommonMark-flavored Markdown ::: ::: {#fig-hello-paragraph} ```markdown Hello, world! ``` Example paragraph block in CommonMark-flavored Markdown ::: Describing syntax using text snippets quickly becomes tedious, however, and cumbersome distinguishing which parts of an example are the syntax and which is dummy text. A formal parsing expression grammar [PEG, described in @Ford2004] was formulated for this project, based on the CommonMark specification version 0.31.2 [@MacFarlane2024] and on the Semantic Markdown draft specification [@Francart2020]. The PEG grammar, included as [Appendix @sec-def-peg], was then used as basis for the visual syntax diagrams used onwards in this paper. For example, the above example is visually described as seen in @fig-def-Markdown. ![Syntax of `Markdown`.](syntax/Markdown.svg){#fig-def-Markdown} The diagram describes the possible order of elements, laid out like trains on rails. An oval element (called "terminal" in PEG) contains either a regular expression or a string, and a rectangle element (called "nonterminal" in PEG) contains a label referencing another element. A syntax label written in all capital letters indicates that the referenced syntax is in turn terminal. An element containing a sub-element prefixed with an exclamation mark excludes that sub-element. ### Reading order matters Syntax diagrams like used here commonly represent context-free EBNF grammars, but here they importantly represent a parsing expression grammar, chosen because context-free grammars are unlikely to be able to cover Markdown [@MacFarlane2014]. The significance is clear even in the above quite general diagram: If you try first to follow the lower "track" instead of the upper one, then the grammar (wrongly) describes @fig-hello as a text with no metadata and three content blocks: Existence of a metadata block is optional, but checking for its existence is not. These syntax diagrams are read from left to right and top to bottom, including at points with choices. ## Syntax of dialect CommonMark {#sec-commonmark} The example @fig-hello contains two different types of blocks (as well as a set of metadata blocks which will not be covered here), first a headline and then a regular paragraph. These are defined in the visual syntax as block types `Header` and `Paragraph` `Header` consists of space-delimited words followed a line break, and `Paragraph` consists of lines of space-delimited words followed by two or more line breaks. See @fig-def-blocks for syntax diagrams for those block types. ::: {#fig-def-blocks} ![](syntax/Block.svg) ![](syntax/Header.svg) ![](syntax/Paragraph.svg) Syntax of `Block`, `Header` and `Paragraph`. ::: In this formalisation, "Words" are sets of printable characters, including punctuation. They can be styled, have a hyperlink annotated and have CSS structure and styling annotated. Each set can contain each other, or a set of plain words. See @fig-def-words for their syntax diagrams. ::: {#fig-def-words} ![](syntax/StyledWords.svg) ![](syntax/LinkedWords.svg) ![](syntax/AnnotatedWords.svg) ![](syntax/PlainWords.svg) Syntax of `StyledWords`, `LinkedWords`, `AnnotatedWords` and `PlainWords`. ::: Other content blocks and inline types exist but are omitted in this description, which is limited to the components affected by extending the Markdown language with additional types of annotation. Syntax diagrams for some additional Markdown components are included as [Appendix @sec-def-dia]. ## Syntax of extension Semantic Markdown {#sec-semantic-markdown} Semantic Markdown mainly extends the syntax for `AnnotatedWords`, and introduces a new syntax similar to `LinkDefinition`. `AnnotatedWords` can in principle contain any word, but in practice expects CSS id or class definitions, which means alphanumeric-only words prefixed by either dot or hash. New higher prioritized syntaxes are added, prioritized since that is simplest and it should not cause clash with existing elements, as in @fig-def-extensions. Syntax newly introduced as part of Semantic Markdown is marked with a dotted frame. ::: {#fig-def-extensions} ![](syntax/KeyWordX.svg) ![](syntax/PrefixDefinition.svg) Syntax of derived `KeyWord` and new block `PrefixDefinition`. ::: The above changes mainly introduces the new `SemWord` which is either an angle-bracketed `Uri` or the new `Curie`, optional prefixed with the new `SEMPREFIX`. (These new terms relate to the metalanguage RDF discussed briefly in @sec-rdf). See @fig-def-additions for syntax diagrams of the new syntax. ::: {#fig-def-additions} ![](syntax/SemWord.svg) ![](syntax/Curie.svg) ![](syntax/SEMPREFIX.svg) Syntax of `SemWord`, `Curie` and `SEMPREFIX`. ::: Here is an elaborate example written in Semantic Markdown: ```markdown # {=<#artwork> .:Image} Semantics Simple ontological annotation: [This][painting] is not a [pipe]. Nested, mixed-use and custom-namespaced annotations: [[Ceci][painting] n'est pas une [pipe].]{lang=fr bibo:shortDescription} [painting]: {wd:Q1061035} "A painting of a smoking pipe {:depiction}" [pipe]: {wd:Q104526} "A smoking pipe {:depicts}" {@default}: foaf {bibo}: http://purl.org/ontology/bibo/ {wd}: http://www.wikidata.org/entity/ ``` ...which with filter applied should become this: ```markdown # Semantics Simple ontological annotation: [This][painting] is not a [pipe]. Nested, mixed-use and custom-namespaced annotations: [[Ceci][painting] n'est pas une [pipe].]{lang=fr} [painting]: https://www.wikidata.org/entity/Q1061035 "A painting of a smoking pipe" [pipe]: https://www.wikidata.org/entity/Q104526 "A smoking pipe" ``` ## Suggestions for processors The purpose of these analyses is parsing, as covered in the following chapters. To conclude follows some general suggestions for parsing the Markdown extension Semantic Markdown. For a parser of existing Markdown to be extended to cover the Markdown extension Semantic Markdown, it needs to cover the existing extension `AnnotatedWords`, extended to contain `Uri` and `Curie`, and `AnnotatedWords` should be permitted not only immediately after Words, but also as initial or final `Words` in a block. Additionally, the new block type `PrefixDefinition` needs to be covered, similar to `LinkDefinition` but a simpler structure with `Curie` as the initial element. Parsing of Semantic Markdown should be human-friendly, in the spirit of Markdown (see @sec-spirit). For properly matched annotations, this translates to the markup being removed from content, similar to the removal of `LinkTarget` inlines and `LinkDefinition` blocks in core Markdown, and non-semantic `AnnotatedWords` in some dialects. Non-matched data should be preserved, treated as content. These extended or new `Word` and `Block` syntaxes -- `AnnotatedWords` and `PrefixDefinition` -- should be safe to parse prioritized, because the majorly involved `Curie` pattern should not collide with any existing Markdown, and is unlikely to appear in natural language text.