From bb62c80bc253bf977b3718cd27da2fd074fab18c Mon Sep 17 00:00:00 2001
From: Jonas Smedegaard <dr@jones.dk>
Date: Sun, 25 May 2025 21:49:48 +0200
Subject: misc content updates

---
 _markdown.qmd | 200 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 130 insertions(+), 70 deletions(-)

(limited to '_markdown.qmd')

diff --git a/_markdown.qmd b/_markdown.qmd
index 2e57b18..5d9a15f 100644
--- a/_markdown.qmd
+++ b/_markdown.qmd
@@ -1,20 +1,29 @@
-This chapter will provide two analyses
-of the markup language Markdown;
-first an analysis of a widely used subset of the language
+This chapter will introduce a grammar
+represented as visual diagrams,
+and then provide two analyses
+of the markup language Markdown by use of that grammar.
+First an analysis of a widely used subset of the language
 that is relevant for annotation,
 and then an analysis of a not yet used subset
 proposed specifically for the intent of marking up annotations.
 
-Before these analyses,
-it is helpful to define terminology of "dialect" and "extension":
+The purpose of these analyses is to identify
+what is needed for a Markdown parser
+to cover semantic text annotations.
+
+## Dialects as sets of extensions
+
+As prerequisite for these analyses,
+it is helpful to clarify a bit of terminology,
+specifically the terms "dialect" and "extension":
 Markdown is not a single, strictly defined language
 but rather a range of slightly varying languages,
-all derived from the same somewhat ambiguous original specification.
+all derived from one somewhat ambiguous original specification.
 One way to approach this variability,
-used among other places in the documentation of the Markdown processor Pandoc
-and reused here,
-is to call each language variation a "dialect",
-and each set of markup patterns not generally agreed on an "extension".
+adopted from the documentation of the Markdown processor Pandoc,
+is to refer to each language variation as a "dialect",
+and to each set of markup patterns not generally agreed on
+as an "extension".
 
 The first analysis covers the Markdown dialect CommonMark,
 chosen both because it is widely used --
@@ -35,19 +44,13 @@ since it is abstractly equivalent
 to metadata embedding formats of both PDF and HTML
 (as discussed in more detail at @sec-rdf).
 
-## Syntax of dialect CommonMark {#sec-commonmark}
+## Formal and visual syntax description
 
 Markdown consists of blocks of content,
 optionally prepended a set of Metadata blocks.
+Here is an example as @fig-hello:
 
-Visually, this can be described using a syntax diagram
-where the possible order of elements are laid out
-like trains on rails,
-as seen in @fig-def-Markdown.
-
-![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown}
-
-Here is an example:
+::: {#fig-hello}
 
 ```markdown
 ---
@@ -58,9 +61,92 @@ author: Jonas Smedegaard
 Hello, world!
 ```
 
-This example involves
-the syntax for the block types `Header` and `Paragraph`
-(and for `MetaBlock` which will not be covered here).
+Simple CommonMark-flavored Markdown text
+
+:::
+
+The example contains two different types of blocks
+(as well as a set of metadata blocks which will not be covered here).
+First a headline (see @fig-hello-header)
+and then a regular paragraph (see @fig-hello-paragraph).
+
+::: {#fig-hello-header}
+
+```markdown
+# Greeting
+```
+
+Example header block in CommonMark-flavored Markdown
+
+:::
+
+::: {#fig-hello-paragraph}
+
+```markdown
+Hello, world!
+```
+
+Example paragraph block in CommonMark-flavored Markdown
+
+:::
+
+Describing syntax using text snippets quickly becomes tedious, however,
+and cumbersome distinguishing which parts of an example are the syntax
+and which is dummy text.
+A formal parsing expression grammar
+[PEG, described in @Ford2004]
+was formulated for this project,
+based on the CommonMark specification version 0.31.2
+[@MacFarlane2024]
+and on the Semantic Markdown draft specification
+[@Francart2020].
+The PEG grammar,
+included as [Appendix @sec-def-peg],
+was then used as basis for the visual syntax diagrams
+used onwards in this paper.
+For example, the above example is visually described
+as seen in @fig-def-Markdown.
+
+![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown}
+
+The diagram describes the possible order of elements,
+laid out like trains on rails.
+An oval element (called "terminal" in PEG)
+contains either a regular expression or a string,
+and a rectangle element (called "nonterminal" in PEG)
+contains a label referencing another element.
+A syntax label written in all capital letters indicates
+that the referenced syntax is in turn terminal.
+An element containing a sub-element
+prefixed with an exclamation mark
+excludes that sub-element.
+
+### Reading order matters
+
+Syntax diagrams like used here commonly represent context-free EBNF grammars,
+but here they importantly represent a parsing expression grammar,
+chosen because context-free grammars are unlikely
+to be able to cover Markdown
+[@MacFarlane2014].
+
+The significance is clear even in the above quite general diagram:
+If you try first to follow the lower "track" instead of the upper one,
+then the grammar (wrongly) describes @fig-hello
+as a text with no metadata and three content blocks:
+Existence of a metadata block is optional,
+but checking for its existence is not.
+
+These syntax diagrams are read from left to right and top to bottom,
+including at points with choices.
+
+## Syntax of dialect CommonMark {#sec-commonmark}
+
+
+More specifically,
+the example @fig-hello contains two different types of blocks
+(as well as a set of metadata blocks which will not be covered here),
+first a headline and then a regular paragraph.
+These are defined in the visual syntax as block types `Header` and `Paragraph`
 `Header` consists of space-delimited words followed a line break,
 and `Paragraph` consists of lines of space-delimited words
 followed by two or more line breaks.
@@ -78,24 +164,8 @@ Syntax of `Block`, `Header` and `Paragraph`.
 
 :::
 
-Reading order matters:
-The syntax diagrams should be read from left to right and top to bottom,
-including at points with choices --
-for example, the block type `Header` should be tried before `Paragraph`.
-Otherwise if `Paragraph` syntax was parsed first,
-then it would match both blocks
-because that block type begins with any words,
-including the characters definitive for the `Header` block type.
-In other words,
-these syntax diagrams do *not* represent the more common EBNF grammars
-but instead, a parsing expression grammar [@Ford2004],
-chosen because context-free grammars are unlikely
-to be able to cover Markdown
-[@MacFarlane2014].
-The PEG grammar covering all syntax diagrams shown here
-is included as [Appendix @sec-def-peg].
-
-Words are sets of printable characters, including punctuation.
+In this formalisation,
+"Words" are sets of printable characters, including punctuation.
 They can be styled,
 have a hyperlink annotated
 and have CSS structure and styling annotated.
@@ -130,7 +200,7 @@ each containing a block
 
 Other content blocks and inline types exist
 but are omitted in this description,
-which is limited to the comonents affected
+which is limited to the components affected
 by extending the Markdown language with additional types of annotation.
 Syntax diagrams for some additional Markdown components are included
 as [Appendix @sec-def-dia].
@@ -183,32 +253,22 @@ Syntax of `SemWords`, `Curie`, `SEMPREFIX` and `NAME`.
 
 ## Expectations of processors
 
-*FIXME: write this!*
-
-<!--
-* hvad skal med
-* hvad skal ikke med
-* hvis PDF..., hvad så?
-
-Dette afsnit udgør "requirements"
--->
-
-### Readability
-
-In the source format of Markdown with annotations...
-
-* the syntax for annotation **should** be relatively comprehensible to humans
-  (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*)
-* annotation syntax **must not** conflict with core Markdown
-  (i.e. must not cause ambiguities)
-* annotation syntax **should not** conflict with other Markdown extensions
-
-For syntactically correct and structurally supported annotations...
-
-* visual output **must** be identical to the source without the annotation
-* metadata of output **must** contain the annotation
-
-For syntactically incorrect or structurally unsupported annotations...
-
-* the annotation **must not** disappear from visual output
-* visual output **should** include the annotation in its source form
+Parsing should be human-friendly,
+in the spirit of Markdown
+(see @sec-spirit).
+This may translate to the annotations being dropped,
+unlike Markdown in general but like link definition blocks.
+
+For a Markdown parser to cover the Markdown extension Semantic Markdown,
+it needs to cover the existing extension AnnotatedWords,
+extended to contain URIs and CURIEs,
+and it needs to cover AnnotatedWords not only immediately after Words,
+but also as leading or trailing Words for a block.
+
+Additionally, a new block type needs to be covered,
+similar to LinkDefinition but a simpler structure
+with a CURIE as initial element.
+
+These new Word and Block syntaxes should be prioritized,
+as the restricted patterns tied to CURIEs is unlikely to collide
+with existing Markdown or non-markup plain text.
-- 
cgit v1.2.3