From bb62c80bc253bf977b3718cd27da2fd074fab18c Mon Sep 17 00:00:00 2001 From: Jonas Smedegaard Date: Sun, 25 May 2025 21:49:48 +0200 Subject: misc content updates --- _markdown.qmd | 200 ++++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 130 insertions(+), 70 deletions(-) (limited to '_markdown.qmd') diff --git a/_markdown.qmd b/_markdown.qmd index 2e57b18..5d9a15f 100644 --- a/_markdown.qmd +++ b/_markdown.qmd @@ -1,20 +1,29 @@ -This chapter will provide two analyses -of the markup language Markdown; -first an analysis of a widely used subset of the language +This chapter will introduce a grammar +represented as visual diagrams, +and then provide two analyses +of the markup language Markdown by use of that grammar. +First an analysis of a widely used subset of the language that is relevant for annotation, and then an analysis of a not yet used subset proposed specifically for the intent of marking up annotations. -Before these analyses, -it is helpful to define terminology of "dialect" and "extension": +The purpose of these analyses is to identify +what is needed for a Markdown parser +to cover semantic text annotations. + +## Dialects as sets of extensions + +As prerequisite for these analyses, +it is helpful to clarify a bit of terminology, +specifically the terms "dialect" and "extension": Markdown is not a single, strictly defined language but rather a range of slightly varying languages, -all derived from the same somewhat ambiguous original specification. +all derived from one somewhat ambiguous original specification. One way to approach this variability, -used among other places in the documentation of the Markdown processor Pandoc -and reused here, -is to call each language variation a "dialect", -and each set of markup patterns not generally agreed on an "extension". +adopted from the documentation of the Markdown processor Pandoc, +is to refer to each language variation as a "dialect", +and to each set of markup patterns not generally agreed on +as an "extension". The first analysis covers the Markdown dialect CommonMark, chosen both because it is widely used -- @@ -35,19 +44,13 @@ since it is abstractly equivalent to metadata embedding formats of both PDF and HTML (as discussed in more detail at @sec-rdf). -## Syntax of dialect CommonMark {#sec-commonmark} +## Formal and visual syntax description Markdown consists of blocks of content, optionally prepended a set of Metadata blocks. +Here is an example as @fig-hello: -Visually, this can be described using a syntax diagram -where the possible order of elements are laid out -like trains on rails, -as seen in @fig-def-Markdown. - -![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown} - -Here is an example: +::: {#fig-hello} ```markdown --- @@ -58,9 +61,92 @@ author: Jonas Smedegaard Hello, world! ``` -This example involves -the syntax for the block types `Header` and `Paragraph` -(and for `MetaBlock` which will not be covered here). +Simple CommonMark-flavored Markdown text + +::: + +The example contains two different types of blocks +(as well as a set of metadata blocks which will not be covered here). +First a headline (see @fig-hello-header) +and then a regular paragraph (see @fig-hello-paragraph). + +::: {#fig-hello-header} + +```markdown +# Greeting +``` + +Example header block in CommonMark-flavored Markdown + +::: + +::: {#fig-hello-paragraph} + +```markdown +Hello, world! +``` + +Example paragraph block in CommonMark-flavored Markdown + +::: + +Describing syntax using text snippets quickly becomes tedious, however, +and cumbersome distinguishing which parts of an example are the syntax +and which is dummy text. +A formal parsing expression grammar +[PEG, described in @Ford2004] +was formulated for this project, +based on the CommonMark specification version 0.31.2 +[@MacFarlane2024] +and on the Semantic Markdown draft specification +[@Francart2020]. +The PEG grammar, +included as [Appendix @sec-def-peg], +was then used as basis for the visual syntax diagrams +used onwards in this paper. +For example, the above example is visually described +as seen in @fig-def-Markdown. + +![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown} + +The diagram describes the possible order of elements, +laid out like trains on rails. +An oval element (called "terminal" in PEG) +contains either a regular expression or a string, +and a rectangle element (called "nonterminal" in PEG) +contains a label referencing another element. +A syntax label written in all capital letters indicates +that the referenced syntax is in turn terminal. +An element containing a sub-element +prefixed with an exclamation mark +excludes that sub-element. + +### Reading order matters + +Syntax diagrams like used here commonly represent context-free EBNF grammars, +but here they importantly represent a parsing expression grammar, +chosen because context-free grammars are unlikely +to be able to cover Markdown +[@MacFarlane2014]. + +The significance is clear even in the above quite general diagram: +If you try first to follow the lower "track" instead of the upper one, +then the grammar (wrongly) describes @fig-hello +as a text with no metadata and three content blocks: +Existence of a metadata block is optional, +but checking for its existence is not. + +These syntax diagrams are read from left to right and top to bottom, +including at points with choices. + +## Syntax of dialect CommonMark {#sec-commonmark} + + +More specifically, +the example @fig-hello contains two different types of blocks +(as well as a set of metadata blocks which will not be covered here), +first a headline and then a regular paragraph. +These are defined in the visual syntax as block types `Header` and `Paragraph` `Header` consists of space-delimited words followed a line break, and `Paragraph` consists of lines of space-delimited words followed by two or more line breaks. @@ -78,24 +164,8 @@ Syntax of `Block`, `Header` and `Paragraph`. ::: -Reading order matters: -The syntax diagrams should be read from left to right and top to bottom, -including at points with choices -- -for example, the block type `Header` should be tried before `Paragraph`. -Otherwise if `Paragraph` syntax was parsed first, -then it would match both blocks -because that block type begins with any words, -including the characters definitive for the `Header` block type. -In other words, -these syntax diagrams do *not* represent the more common EBNF grammars -but instead, a parsing expression grammar [@Ford2004], -chosen because context-free grammars are unlikely -to be able to cover Markdown -[@MacFarlane2014]. -The PEG grammar covering all syntax diagrams shown here -is included as [Appendix @sec-def-peg]. - -Words are sets of printable characters, including punctuation. +In this formalisation, +"Words" are sets of printable characters, including punctuation. They can be styled, have a hyperlink annotated and have CSS structure and styling annotated. @@ -130,7 +200,7 @@ each containing a block Other content blocks and inline types exist but are omitted in this description, -which is limited to the comonents affected +which is limited to the components affected by extending the Markdown language with additional types of annotation. Syntax diagrams for some additional Markdown components are included as [Appendix @sec-def-dia]. @@ -183,32 +253,22 @@ Syntax of `SemWords`, `Curie`, `SEMPREFIX` and `NAME`. ## Expectations of processors -*FIXME: write this!* - - - -### Readability - -In the source format of Markdown with annotations... - -* the syntax for annotation **should** be relatively comprehensible to humans - (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*) -* annotation syntax **must not** conflict with core Markdown - (i.e. must not cause ambiguities) -* annotation syntax **should not** conflict with other Markdown extensions - -For syntactically correct and structurally supported annotations... - -* visual output **must** be identical to the source without the annotation -* metadata of output **must** contain the annotation - -For syntactically incorrect or structurally unsupported annotations... - -* the annotation **must not** disappear from visual output -* visual output **should** include the annotation in its source form +Parsing should be human-friendly, +in the spirit of Markdown +(see @sec-spirit). +This may translate to the annotations being dropped, +unlike Markdown in general but like link definition blocks. + +For a Markdown parser to cover the Markdown extension Semantic Markdown, +it needs to cover the existing extension AnnotatedWords, +extended to contain URIs and CURIEs, +and it needs to cover AnnotatedWords not only immediately after Words, +but also as leading or trailing Words for a block. + +Additionally, a new block type needs to be covered, +similar to LinkDefinition but a simpler structure +with a CURIE as initial element. + +These new Word and Block syntaxes should be prioritized, +as the restricted patterns tied to CURIEs is unlikely to collide +with existing Markdown or non-markup plain text. -- cgit v1.2.3