aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJonas Smedegaard <dr@jones.dk>2025-05-25 21:49:48 +0200
committerJonas Smedegaard <dr@jones.dk>2025-05-25 21:49:48 +0200
commitbb62c80bc253bf977b3718cd27da2fd074fab18c (patch)
tree92096f7838166b725492eda359104d267bf5d202
parentd8eafc0a82b63c4b61d32b0fe2a62e2bb7dc1184 (diff)
misc content updates
-rw-r--r--_conclusion.qmd14
-rw-r--r--_filter.qmd2
-rw-r--r--_intro.qmd46
-rw-r--r--_markdown.qmd194
-rw-r--r--_pandoc.qmd110
-rw-r--r--ref.bib19
6 files changed, 228 insertions, 157 deletions
diff --git a/_conclusion.qmd b/_conclusion.qmd
index 8749612..cdcc591 100644
--- a/_conclusion.qmd
+++ b/_conclusion.qmd
@@ -179,6 +179,20 @@ for example, about authors and publishers,
might reduce the amount of custom restructuring
needed downstream, for example, in Quarto.
+### Alternative implementations
+
+It might be interesting to try implement as a Pandoc Reader,
+both to see if that can be done notably simpler,
+and also to explore possibilities of use with wrapper tools such as Quarto.
+In other words,
+try to prove the pessimistic usability analysis in @sec-pandoc-filter-versatile wrong.
+
+It might be helpful to try implement Semantic Markdown
+for other extensible Markdown parsers.
+
+Might also be fun to try implement Semantic djot
+[@MacFarlane2024djot].
+
### Nuanced citations in scholarly papers
Pandoc and Quarto (see @sec-quarto) support annotating scholarly citations.
diff --git a/_filter.qmd b/_filter.qmd
index 5851f5a..051ef18 100644
--- a/_filter.qmd
+++ b/_filter.qmd
@@ -14,7 +14,7 @@ As described in @sec-improve,
of priority for this project is to improve existing tools
rather than implement parallel competing ones,
despite the latter potentially being easier to do
-or lead to a simpler product.
+or leading to a simpler product.
Pandoc offers an API specifically for custom-implementing a source format
(see @sec-pandoc-apis).
In @sec-pandoc-complex
diff --git a/_intro.qmd b/_intro.qmd
index 7697ddb..ac0b38e 100644
--- a/_intro.qmd
+++ b/_intro.qmd
@@ -78,21 +78,19 @@ by the following early design decisions:
* The current delivery is intentionally scoped
as a minimum viable product,
with additional features planned as separate future works.
+* Newly introduced syntax is kept in line
+ with the core principles of Markdown.
* Programming language and integration design
are largely dictated by existing actively used systems,
rather than convenience of personal familiarity or efficiency.
* The project solely involves freely licensed tools and resources,
and is itself licensed under Free licences that encourage collaboration.
-### Scope limited to authoring
+### Scope limited to use for authoring
-The scope of this project is to enable authoring,
-with a further aim of extending to more complex processing in the future.
-
-The primary aim is to enable authors to annotate semantics
-as integral part of their creative writing process.
-Future works may expand on this,
-enhancing renderings of the authored content,
+The scope of this project is to enable annotating while authoring.
+Rendering that makes use of annotations is outside this scope,
+but planned for future works
as discussed in @sec-rdf.
The idea is to introduce semantic text annotation to Markdown authoring
@@ -104,6 +102,28 @@ as illustrated in @fig-phase1.
![A filter to strip annotations from content.](workflow/phase1.svg){#fig-phase1}
+### Added syntax in the spirit of Markdown {#sec-spirit}
+
+*FIXME: rewrite as more fluent prose*
+
+In the source format of Markdown with annotations...
+
+* the syntax for annotation **should** be relatively comprehensible to humans
+ (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*)
+* annotation syntax **must not** conflict with core Markdown
+ (i.e. must not cause ambiguities)
+* annotation syntax **should not** conflict with other Markdown extensions
+
+For syntactically correct and structurally supported annotations...
+
+* visual output **must** be identical to the source without the annotation
+* metadata of output **must** contain the annotation
+
+For syntactically incorrect or structurally unsupported annotations...
+
+* the annotation **must not** disappear from visual output
+* visual output **should** include the annotation in its source form
+
### Improve accepted formats and tools {#sec-improve}
Markdown is a widely adopted authoring format,
@@ -143,8 +163,8 @@ under the Creative Commons crediting share-alike 4.0.
## Implementation plan
-*FIXME: sharpen this end summary,
-to better conclude the previous parts*
+*FIXME: sharpen this chapter summary
+to explicitly conclude the previous parts*
* extention to existing widespread authoring language,
not an alternative one
@@ -158,9 +178,7 @@ Existing Markdown syntax is analysed in @sec-commonmark
and proposed new coverage of annotation in @sec-semantic-markdown.
(Eventual rendering of annotations is outside the scope of this paper
and will only be briefly discussed later in @sec-rdf.)
-Then existing Markdown parsing is described in @sec-pandoc,
-and implementation of extending that to parse annotations
+Then existing Markdown parsing with the tool Pandoc is analysed in @sec-pandoc,
+and the implementation of a filter to parse annotations
is described in @sec-filter,
and evaluated and discussed in the later chapters.
-
-*TODO: Maybe rephrase Pandoc as being not "described" but "analysed"?*
diff --git a/_markdown.qmd b/_markdown.qmd
index 2e57b18..5d9a15f 100644
--- a/_markdown.qmd
+++ b/_markdown.qmd
@@ -1,20 +1,29 @@
-This chapter will provide two analyses
-of the markup language Markdown;
-first an analysis of a widely used subset of the language
+This chapter will introduce a grammar
+represented as visual diagrams,
+and then provide two analyses
+of the markup language Markdown by use of that grammar.
+First an analysis of a widely used subset of the language
that is relevant for annotation,
and then an analysis of a not yet used subset
proposed specifically for the intent of marking up annotations.
-Before these analyses,
-it is helpful to define terminology of "dialect" and "extension":
+The purpose of these analyses is to identify
+what is needed for a Markdown parser
+to cover semantic text annotations.
+
+## Dialects as sets of extensions
+
+As prerequisite for these analyses,
+it is helpful to clarify a bit of terminology,
+specifically the terms "dialect" and "extension":
Markdown is not a single, strictly defined language
but rather a range of slightly varying languages,
-all derived from the same somewhat ambiguous original specification.
+all derived from one somewhat ambiguous original specification.
One way to approach this variability,
-used among other places in the documentation of the Markdown processor Pandoc
-and reused here,
-is to call each language variation a "dialect",
-and each set of markup patterns not generally agreed on an "extension".
+adopted from the documentation of the Markdown processor Pandoc,
+is to refer to each language variation as a "dialect",
+and to each set of markup patterns not generally agreed on
+as an "extension".
The first analysis covers the Markdown dialect CommonMark,
chosen both because it is widely used --
@@ -35,19 +44,13 @@ since it is abstractly equivalent
to metadata embedding formats of both PDF and HTML
(as discussed in more detail at @sec-rdf).
-## Syntax of dialect CommonMark {#sec-commonmark}
+## Formal and visual syntax description
Markdown consists of blocks of content,
optionally prepended a set of Metadata blocks.
+Here is an example as @fig-hello:
-Visually, this can be described using a syntax diagram
-where the possible order of elements are laid out
-like trains on rails,
-as seen in @fig-def-Markdown.
-
-![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown}
-
-Here is an example:
+::: {#fig-hello}
```markdown
---
@@ -58,9 +61,92 @@ author: Jonas Smedegaard
Hello, world!
```
-This example involves
-the syntax for the block types `Header` and `Paragraph`
-(and for `MetaBlock` which will not be covered here).
+Simple CommonMark-flavored Markdown text
+
+:::
+
+The example contains two different types of blocks
+(as well as a set of metadata blocks which will not be covered here).
+First a headline (see @fig-hello-header)
+and then a regular paragraph (see @fig-hello-paragraph).
+
+::: {#fig-hello-header}
+
+```markdown
+# Greeting
+```
+
+Example header block in CommonMark-flavored Markdown
+
+:::
+
+::: {#fig-hello-paragraph}
+
+```markdown
+Hello, world!
+```
+
+Example paragraph block in CommonMark-flavored Markdown
+
+:::
+
+Describing syntax using text snippets quickly becomes tedious, however,
+and cumbersome distinguishing which parts of an example are the syntax
+and which is dummy text.
+A formal parsing expression grammar
+[PEG, described in @Ford2004]
+was formulated for this project,
+based on the CommonMark specification version 0.31.2
+[@MacFarlane2024]
+and on the Semantic Markdown draft specification
+[@Francart2020].
+The PEG grammar,
+included as [Appendix @sec-def-peg],
+was then used as basis for the visual syntax diagrams
+used onwards in this paper.
+For example, the above example is visually described
+as seen in @fig-def-Markdown.
+
+![Syntax of `Markdown`.](syntax/def_Markdown.svg){#fig-def-Markdown}
+
+The diagram describes the possible order of elements,
+laid out like trains on rails.
+An oval element (called "terminal" in PEG)
+contains either a regular expression or a string,
+and a rectangle element (called "nonterminal" in PEG)
+contains a label referencing another element.
+A syntax label written in all capital letters indicates
+that the referenced syntax is in turn terminal.
+An element containing a sub-element
+prefixed with an exclamation mark
+excludes that sub-element.
+
+### Reading order matters
+
+Syntax diagrams like used here commonly represent context-free EBNF grammars,
+but here they importantly represent a parsing expression grammar,
+chosen because context-free grammars are unlikely
+to be able to cover Markdown
+[@MacFarlane2014].
+
+The significance is clear even in the above quite general diagram:
+If you try first to follow the lower "track" instead of the upper one,
+then the grammar (wrongly) describes @fig-hello
+as a text with no metadata and three content blocks:
+Existence of a metadata block is optional,
+but checking for its existence is not.
+
+These syntax diagrams are read from left to right and top to bottom,
+including at points with choices.
+
+## Syntax of dialect CommonMark {#sec-commonmark}
+
+
+More specifically,
+the example @fig-hello contains two different types of blocks
+(as well as a set of metadata blocks which will not be covered here),
+first a headline and then a regular paragraph.
+These are defined in the visual syntax as block types `Header` and `Paragraph`
`Header` consists of space-delimited words followed a line break,
and `Paragraph` consists of lines of space-delimited words
followed by two or more line breaks.
@@ -78,24 +164,8 @@ Syntax of `Block`, `Header` and `Paragraph`.
:::
-Reading order matters:
-The syntax diagrams should be read from left to right and top to bottom,
-including at points with choices --
-for example, the block type `Header` should be tried before `Paragraph`.
-Otherwise if `Paragraph` syntax was parsed first,
-then it would match both blocks
-because that block type begins with any words,
-including the characters definitive for the `Header` block type.
-In other words,
-these syntax diagrams do *not* represent the more common EBNF grammars
-but instead, a parsing expression grammar [@Ford2004],
-chosen because context-free grammars are unlikely
-to be able to cover Markdown
-[@MacFarlane2014].
-The PEG grammar covering all syntax diagrams shown here
-is included as [Appendix @sec-def-peg].
-
-Words are sets of printable characters, including punctuation.
+In this formalisation,
+"Words" are sets of printable characters, including punctuation.
They can be styled,
have a hyperlink annotated
and have CSS structure and styling annotated.
@@ -130,7 +200,7 @@ each containing a block
Other content blocks and inline types exist
but are omitted in this description,
-which is limited to the comonents affected
+which is limited to the components affected
by extending the Markdown language with additional types of annotation.
Syntax diagrams for some additional Markdown components are included
as [Appendix @sec-def-dia].
@@ -183,32 +253,22 @@ Syntax of `SemWords`, `Curie`, `SEMPREFIX` and `NAME`.
## Expectations of processors
-*FIXME: write this!*
-
-<!--
-* hvad skal med
-* hvad skal ikke med
-* hvis PDF..., hvad så?
-
-Dette afsnit udgør "requirements"
--->
-
-### Readability
-
-In the source format of Markdown with annotations...
-
-* the syntax for annotation **should** be relatively comprehensible to humans
- (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*)
-* annotation syntax **must not** conflict with core Markdown
- (i.e. must not cause ambiguities)
-* annotation syntax **should not** conflict with other Markdown extensions
-
-For syntactically correct and structurally supported annotations...
+Parsing should be human-friendly,
+in the spirit of Markdown
+(see @sec-spirit).
+This may translate to the annotations being dropped,
+unlike Markdown in general but like link definition blocks.
-* visual output **must** be identical to the source without the annotation
-* metadata of output **must** contain the annotation
+For a Markdown parser to cover the Markdown extension Semantic Markdown,
+it needs to cover the existing extension AnnotatedWords,
+extended to contain URIs and CURIEs,
+and it needs to cover AnnotatedWords not only immediately after Words,
+but also as leading or trailing Words for a block.
-For syntactically incorrect or structurally unsupported annotations...
+Additionally, a new block type needs to be covered,
+similar to LinkDefinition but a simpler structure
+with a CURIE as initial element.
-* the annotation **must not** disappear from visual output
-* visual output **should** include the annotation in its source form
+These new Word and Block syntaxes should be prioritized,
+as the restricted patterns tied to CURIEs is unlikely to collide
+with existing Markdown or non-markup plain text.
diff --git a/_pandoc.qmd b/_pandoc.qmd
index 95ba8e3..5909818 100644
--- a/_pandoc.qmd
+++ b/_pandoc.qmd
@@ -1,69 +1,12 @@
-*FIXME: This chapter is unfinished --
-currently contains 3-4 large chunks (separated by horisontal lines)
-that need to be merged or maybe some parts dropped altogether...*
-This chapter will provide an analysis
-of the Markdown processor Pandoc.
-
-Many dialects of Markdown have evolved,
-some tightening the language for parsing efficiency and disambiguation,
-some extending to cover additional structures,
-and some including support for a metadata header section.
-
-Pandoc is a tool that can convert texts in Markdown dialects
-into many document formats including HTML and (via LaTeX) PDF,
-applying visual style and positioning throught templates.
-Such document workflows,
-including minimal structural markup as part of the creative writing,
-applying visual layout as an automated templating process
-tied to a target document format,
-have been further streamlined for academic texts
-in the Quarto document publishing system.
-
-----
-
-reads a text document,
-parses its structural components into an AST,
-and serialises and writes back into a text document.
-The Pandoc AST deliberately prioritises structural information
-and is relaxed about visual information,
-to preserve literal content
-while reducing format-specific stylistic details,
-relevant especially when processing between different formats.
-The most common use
-is to read plaintext Markdown files and write LaTeX code,
-which is then compiled into a PDF file.
-
-----
-
-The Markdown processor Pandoc can transform Markdown not only to HTML
-but also to other output formats like PDF.
-Pandoc offers an API for adapting its content processing
-as well as a templating structure for customizing layout,
-which is streamlined in the document authoring framework Quarto:
-Pandoc with a set of plugins and templates
-enables rendering of scholarly papers
-conforming to prescribed style guides and document formats.
-
----
-
-Pandoc is a document converter built around the Markdown markup language,
+Pandoc is a versatile and extensible document converter,
able to parse from and serialise to many Markdown dialects
as well as equivalent subsets of other text markup languages
-including HTML, LaTeX (and by extension PDF),
-Office Open XML (as used by recent releases of Microsoft Word),
-and OpenDocument (as used by OpenOffice and LibreOffice).
-
-Pandoc supports redefining input and output formats
-and manipulating the internal document structure
-as part of the automated parts of the framework.
+including HTML and (via LaTeX or other intermediaries) PDF.
+Pandoc offers several APIs for adapting its content processing
+as well as a templating structure for styling of output.
-Pandoc is extensible.
-Source and output format can be changed or completely redefined
-and the internal document structure manipulated,
-in the automated parts of the framework.
-
-Collection of interrelated POSIX scripts and Pandoc extensions
-for enabling semantic annotations in Markdown-based authoring workflows.
+This chapter provides an analysis of Pandoc,
+with a focus on ways to extend it to support a new Markdown extension.
## An AST in the spirit of Markdown
@@ -146,17 +89,34 @@ The author may then need to violate the separation of concerns
often done by adding compositional information
to the bibliographic metadata.
-Pandoc generally maintain a separation of concern
+## The AST filter API is effectively more versatile {#sec-pandoc-filter-versatile}
+
+Pandoc strives to maintain a separation of concern
between content, structure and layout,
-in a workflow of isolated functions and APIs
-for import, filtering and export,
-but cannot guarantee this separation.
-While Pandoc offers an import extension API
-for adding support for new source formats,
-the formats implemented in Pandoc itself are maintained in sync
-with other parts of the workflow,
-ensuring a tighter degree of coordination than the API can offer.
-Such potential for the Pandoc import API
-being inferior to builtin source format parsers
-has influenced the choice of implementation for this project,
+offering separate APIs for import, filtering and export,
+but since some functionality cross those boundaries,
+this separation is not guaranteed,
+and complex features are more likely to work correctly
+with the most commonly used Reader and Writer components.
+
+This is exacerbated when using wrapper tools
+since their extensions are likely tailored to those same
+most commonly used Reader and Writer components.
+The Quarto framework, specifically,
+allows using any Reader same as Pandoc itself,
+but only when using default Reader --
+a custom derivative of the Pandoc default Markdown Reader --
+do you benefit from Quarto-specific macro expansions.
+
+Despite the Reader API being specifically tailored
+for implementing a custom source format parser,
+when integration with complex functions is relevant,
+either within Pandoc itself or part of wrapper tools,
+that API effectively is discouraged.
+
+Since a core principle of Markdown is treat unknown markup as content,
+a viable alternative for an extended Markdown format
+is to use a simpler Markdown Reader and then "finish up" the parsing
+in a filter.
+This is the approach chosen for this project,
as covered next in @sec-filter.
diff --git a/ref.bib b/ref.bib
index 8bac1a7..8567e06 100644
--- a/ref.bib
+++ b/ref.bib
@@ -218,6 +218,16 @@
editor = {Ivan Herman and Ben Adida and Manu Sporny and Mark Birbeck},
}
+@TechReport{MacFarlane2024,
+ author = {John {MacFarlane}},
+ date = {2024-01-28},
+ title = {{CommonMark} Spec},
+ language = {English},
+ url = {https://spec.commonmark.org/0.31.2/},
+ urldate = {2025-05-25},
+ version = {0.31.2},
+}
+
@Online{Francart2020,
author = {Thomas Francart},
date = {2020-02-20},
@@ -313,6 +323,7 @@
pages = {111--122},
subtitle = {a recognition-based syntactic foundation},
volume = {39},
+ file = {:Ford2004 - Parsing Expression Grammars.pdf:PDF},
publisher = {Association for Computing Machinery (ACM)},
}
@@ -342,6 +353,14 @@
urldate = {2025-05-23},
}
+@Online{MacFarlane2024djot,
+ author = {John {MacFarlane}},
+ date = {2024-02-04},
+ title = {djot},
+ url = {https://djot.org/},
+ urldate = {2025-05-25},
+}
+
@Online{Smedegaard2025,
author = {Jonas Smedegaard},
date = {2025-05-14},