diff options
| author | Jonas Smedegaard <dr@jones.dk> | 2025-05-25 21:49:48 +0200 |
|---|---|---|
| committer | Jonas Smedegaard <dr@jones.dk> | 2025-05-25 21:49:48 +0200 |
| commit | bb62c80bc253bf977b3718cd27da2fd074fab18c (patch) | |
| tree | 92096f7838166b725492eda359104d267bf5d202 | |
| parent | d8eafc0a82b63c4b61d32b0fe2a62e2bb7dc1184 (diff) | |
misc content updates
| -rw-r--r-- | _conclusion.qmd | 14 | ||||
| -rw-r--r-- | _filter.qmd | 2 | ||||
| -rw-r--r-- | _intro.qmd | 46 | ||||
| -rw-r--r-- | _markdown.qmd | 194 | ||||
| -rw-r--r-- | _pandoc.qmd | 110 | ||||
| -rw-r--r-- | ref.bib | 19 |
6 files changed, 228 insertions, 157 deletions
diff --git a/_conclusion.qmd b/_conclusion.qmd index 8749612..cdcc591 100644 --- a/_conclusion.qmd +++ b/_conclusion.qmd @@ -179,6 +179,20 @@ for example, about authors and publishers, might reduce the amount of custom restructuring needed downstream, for example, in Quarto. +### Alternative implementations + +It might be interesting to try implement as a Pandoc Reader, +both to see if that can be done notably simpler, +and also to explore possibilities of use with wrapper tools such as Quarto. +In other words, +try to prove the pessimistic usability analysis in @sec-pandoc-filter-versatile wrong. + +It might be helpful to try implement Semantic Markdown +for other extensible Markdown parsers. + +Might also be fun to try implement Semantic djot +[@MacFarlane2024djot]. + ### Nuanced citations in scholarly papers Pandoc and Quarto (see @sec-quarto) support annotating scholarly citations. diff --git a/_filter.qmd b/_filter.qmd index 5851f5a..051ef18 100644 --- a/_filter.qmd +++ b/_filter.qmd @@ -14,7 +14,7 @@ As described in @sec-improve, of priority for this project is to improve existing tools rather than implement parallel competing ones, despite the latter potentially being easier to do -or lead to a simpler product. +or leading to a simpler product. Pandoc offers an API specifically for custom-implementing a source format (see @sec-pandoc-apis). In @sec-pandoc-complex @@ -78,21 +78,19 @@ by the following early design decisions: * The current delivery is intentionally scoped as a minimum viable product, with additional features planned as separate future works. +* Newly introduced syntax is kept in line + with the core principles of Markdown. * Programming language and integration design are largely dictated by existing actively used systems, rather than convenience of personal familiarity or efficiency. * The project solely involves freely licensed tools and resources, and is itself licensed under Free licences that encourage collaboration. -### Scope limited to authoring +### Scope limited to use for authoring -The scope of this project is to enable authoring, -with a further aim of extending to more complex processing in the future. - -The primary aim is to enable authors to annotate semantics -as integral part of their creative writing process. -Future works may expand on this, -enhancing renderings of the authored content, +The scope of this project is to enable annotating while authoring. +Rendering that makes use of annotations is outside this scope, +but planned for future works as discussed in @sec-rdf. The idea is to introduce semantic text annotation to Markdown authoring @@ -104,6 +102,28 @@ as illustrated in @fig-phase1. {#fig-phase1} +### Added syntax in the spirit of Markdown {#sec-spirit} + +*FIXME: rewrite as more fluent prose* + +In the source format of Markdown with annotations... + +* the syntax for annotation **should** be relatively comprehensible to humans + (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*) +* annotation syntax **must not** conflict with core Markdown + (i.e. must not cause ambiguities) +* annotation syntax **should not** conflict with other Markdown extensions + +For syntactically correct and structurally supported annotations... + +* visual output **must** be identical to the source without the annotation +* metadata of output **must** contain the annotation + +For syntactically incorrect or structurally unsupported annotations... + +* the annotation **must not** disappear from visual output +* visual output **should** include the annotation in its source form + ### Improve accepted formats and tools {#sec-improve} Markdown is a widely adopted authoring format, @@ -143,8 +163,8 @@ under the Creative Commons crediting share-alike 4.0. ## Implementation plan -*FIXME: sharpen this end summary, -to better conclude the previous parts* +*FIXME: sharpen this chapter summary +to explicitly conclude the previous parts* * extention to existing widespread authoring language, not an alternative one @@ -158,9 +178,7 @@ Existing Markdown syntax is analysed in @sec-commonmark and proposed new coverage of annotation in @sec-semantic-markdown. (Eventual rendering of annotations is outside the scope of this paper and will only be briefly discussed later in @sec-rdf.) -Then existing Markdown parsing is described in @sec-pandoc, -and implementation of extending that to parse annotations +Then existing Markdown parsing with the tool Pandoc is analysed in @sec-pandoc, +and the implementation of a filter to parse annotations is described in @sec-filter, and evaluated and discussed in the later chapters. - -*TODO: Maybe rephrase Pandoc as being not "described" but "analysed"?* diff --git a/_markdown.qmd b/_markdown.qmd index 2e57b18..5d9a15f 100644 --- a/_markdown.qmd +++ b/_markdown.qmd @@ -1,20 +1,29 @@ -This chapter will provide two analyses -of the markup language Markdown; -first an analysis of a widely used subset of the language +This chapter will introduce a grammar +represented as visual diagrams, +and then provide two analyses +of the markup language Markdown by use of that grammar. +First an analysis of a widely used subset of the language that is relevant for annotation, and then an analysis of a not yet used subset proposed specifically for the intent of marking up annotations. -Before these analyses, -it is helpful to define terminology of "dialect" and "extension": +The purpose of these analyses is to identify +what is needed for a Markdown parser +to cover semantic text annotations. + +## Dialects as sets of extensions + +As prerequisite for these analyses, +it is helpful to clarify a bit of terminology, +specifically the terms "dialect" and "extension": Markdown is not a single, strictly defined language but rather a range of slightly varying languages, -all derived from the same somewhat ambiguous original specification. +all derived from one somewhat ambiguous original specification. One way to approach this variability, -used among other places in the documentation of the Markdown processor Pandoc -and reused here, -is to call each language variation a "dialect", -and each set of markup patterns not generally agreed on an "extension". +adopted from the documentation of the Markdown processor Pandoc, +is to refer to each language variation as a "dialect", +and to each set of markup patterns not generally agreed on +as an "extension". The first analysis covers the Markdown dialect CommonMark, chosen both because it is widely used -- @@ -35,19 +44,13 @@ since it is abstractly equivalent to metadata embedding formats of both PDF and HTML (as discussed in more detail at @sec-rdf). -## Syntax of dialect CommonMark {#sec-commonmark} +## Formal and visual syntax description Markdown consists of blocks of content, optionally prepended a set of Metadata blocks. +Here is an example as @fig-hello: -Visually, this can be described using a syntax diagram -where the possible order of elements are laid out -like trains on rails, -as seen in @fig-def-Markdown. - -{#fig-def-Markdown} - -Here is an example: +::: {#fig-hello} ```markdown --- @@ -58,9 +61,92 @@ author: Jonas Smedegaard Hello, world! ``` -This example involves -the syntax for the block types `Header` and `Paragraph` -(and for `MetaBlock` which will not be covered here). +Simple CommonMark-flavored Markdown text + +::: + +The example contains two different types of blocks +(as well as a set of metadata blocks which will not be covered here). +First a headline (see @fig-hello-header) +and then a regular paragraph (see @fig-hello-paragraph). + +::: {#fig-hello-header} + +```markdown +# Greeting +``` + +Example header block in CommonMark-flavored Markdown + +::: + +::: {#fig-hello-paragraph} + +```markdown +Hello, world! +``` + +Example paragraph block in CommonMark-flavored Markdown + +::: + +Describing syntax using text snippets quickly becomes tedious, however, +and cumbersome distinguishing which parts of an example are the syntax +and which is dummy text. +A formal parsing expression grammar +[PEG, described in @Ford2004] +was formulated for this project, +based on the CommonMark specification version 0.31.2 +[@MacFarlane2024] +and on the Semantic Markdown draft specification +[@Francart2020]. +The PEG grammar, +included as [Appendix @sec-def-peg], +was then used as basis for the visual syntax diagrams +used onwards in this paper. +For example, the above example is visually described +as seen in @fig-def-Markdown. + +{#fig-def-Markdown} + +The diagram describes the possible order of elements, +laid out like trains on rails. +An oval element (called "terminal" in PEG) +contains either a regular expression or a string, +and a rectangle element (called "nonterminal" in PEG) +contains a label referencing another element. +A syntax label written in all capital letters indicates +that the referenced syntax is in turn terminal. +An element containing a sub-element +prefixed with an exclamation mark +excludes that sub-element. + +### Reading order matters + +Syntax diagrams like used here commonly represent context-free EBNF grammars, +but here they importantly represent a parsing expression grammar, +chosen because context-free grammars are unlikely +to be able to cover Markdown +[@MacFarlane2014]. + +The significance is clear even in the above quite general diagram: +If you try first to follow the lower "track" instead of the upper one, +then the grammar (wrongly) describes @fig-hello +as a text with no metadata and three content blocks: +Existence of a metadata block is optional, +but checking for its existence is not. + +These syntax diagrams are read from left to right and top to bottom, +including at points with choices. + +## Syntax of dialect CommonMark {#sec-commonmark} + + +More specifically, +the example @fig-hello contains two different types of blocks +(as well as a set of metadata blocks which will not be covered here), +first a headline and then a regular paragraph. +These are defined in the visual syntax as block types `Header` and `Paragraph` `Header` consists of space-delimited words followed a line break, and `Paragraph` consists of lines of space-delimited words followed by two or more line breaks. @@ -78,24 +164,8 @@ Syntax of `Block`, `Header` and `Paragraph`. ::: -Reading order matters: -The syntax diagrams should be read from left to right and top to bottom, -including at points with choices -- -for example, the block type `Header` should be tried before `Paragraph`. -Otherwise if `Paragraph` syntax was parsed first, -then it would match both blocks -because that block type begins with any words, -including the characters definitive for the `Header` block type. -In other words, -these syntax diagrams do *not* represent the more common EBNF grammars -but instead, a parsing expression grammar [@Ford2004], -chosen because context-free grammars are unlikely -to be able to cover Markdown -[@MacFarlane2014]. -The PEG grammar covering all syntax diagrams shown here -is included as [Appendix @sec-def-peg]. - -Words are sets of printable characters, including punctuation. +In this formalisation, +"Words" are sets of printable characters, including punctuation. They can be styled, have a hyperlink annotated and have CSS structure and styling annotated. @@ -130,7 +200,7 @@ each containing a block Other content blocks and inline types exist but are omitted in this description, -which is limited to the comonents affected +which is limited to the components affected by extending the Markdown language with additional types of annotation. Syntax diagrams for some additional Markdown components are included as [Appendix @sec-def-dia]. @@ -183,32 +253,22 @@ Syntax of `SemWords`, `Curie`, `SEMPREFIX` and `NAME`. ## Expectations of processors -*FIXME: write this!* - -<!-- -* hvad skal med -* hvad skal ikke med -* hvis PDF..., hvad så? - -Dette afsnit udgør "requirements" ---> - -### Readability - -In the source format of Markdown with annotations... - -* the syntax for annotation **should** be relatively comprehensible to humans - (in the same spirit as Markdown -- for example, \*\*strong emphasis\*\*) -* annotation syntax **must not** conflict with core Markdown - (i.e. must not cause ambiguities) -* annotation syntax **should not** conflict with other Markdown extensions - -For syntactically correct and structurally supported annotations... +Parsing should be human-friendly, +in the spirit of Markdown +(see @sec-spirit). +This may translate to the annotations being dropped, +unlike Markdown in general but like link definition blocks. -* visual output **must** be identical to the source without the annotation -* metadata of output **must** contain the annotation +For a Markdown parser to cover the Markdown extension Semantic Markdown, +it needs to cover the existing extension AnnotatedWords, +extended to contain URIs and CURIEs, +and it needs to cover AnnotatedWords not only immediately after Words, +but also as leading or trailing Words for a block. -For syntactically incorrect or structurally unsupported annotations... +Additionally, a new block type needs to be covered, +similar to LinkDefinition but a simpler structure +with a CURIE as initial element. -* the annotation **must not** disappear from visual output -* visual output **should** include the annotation in its source form +These new Word and Block syntaxes should be prioritized, +as the restricted patterns tied to CURIEs is unlikely to collide +with existing Markdown or non-markup plain text. diff --git a/_pandoc.qmd b/_pandoc.qmd index 95ba8e3..5909818 100644 --- a/_pandoc.qmd +++ b/_pandoc.qmd @@ -1,69 +1,12 @@ -*FIXME: This chapter is unfinished -- -currently contains 3-4 large chunks (separated by horisontal lines) -that need to be merged or maybe some parts dropped altogether...* -This chapter will provide an analysis -of the Markdown processor Pandoc. - -Many dialects of Markdown have evolved, -some tightening the language for parsing efficiency and disambiguation, -some extending to cover additional structures, -and some including support for a metadata header section. - -Pandoc is a tool that can convert texts in Markdown dialects -into many document formats including HTML and (via LaTeX) PDF, -applying visual style and positioning throught templates. -Such document workflows, -including minimal structural markup as part of the creative writing, -applying visual layout as an automated templating process -tied to a target document format, -have been further streamlined for academic texts -in the Quarto document publishing system. - ----- - -reads a text document, -parses its structural components into an AST, -and serialises and writes back into a text document. -The Pandoc AST deliberately prioritises structural information -and is relaxed about visual information, -to preserve literal content -while reducing format-specific stylistic details, -relevant especially when processing between different formats. -The most common use -is to read plaintext Markdown files and write LaTeX code, -which is then compiled into a PDF file. - ----- - -The Markdown processor Pandoc can transform Markdown not only to HTML -but also to other output formats like PDF. -Pandoc offers an API for adapting its content processing -as well as a templating structure for customizing layout, -which is streamlined in the document authoring framework Quarto: -Pandoc with a set of plugins and templates -enables rendering of scholarly papers -conforming to prescribed style guides and document formats. - ---- - -Pandoc is a document converter built around the Markdown markup language, +Pandoc is a versatile and extensible document converter, able to parse from and serialise to many Markdown dialects as well as equivalent subsets of other text markup languages -including HTML, LaTeX (and by extension PDF), -Office Open XML (as used by recent releases of Microsoft Word), -and OpenDocument (as used by OpenOffice and LibreOffice). - -Pandoc supports redefining input and output formats -and manipulating the internal document structure -as part of the automated parts of the framework. +including HTML and (via LaTeX or other intermediaries) PDF. +Pandoc offers several APIs for adapting its content processing +as well as a templating structure for styling of output. -Pandoc is extensible. -Source and output format can be changed or completely redefined -and the internal document structure manipulated, -in the automated parts of the framework. - -Collection of interrelated POSIX scripts and Pandoc extensions -for enabling semantic annotations in Markdown-based authoring workflows. +This chapter provides an analysis of Pandoc, +with a focus on ways to extend it to support a new Markdown extension. ## An AST in the spirit of Markdown @@ -146,17 +89,34 @@ The author may then need to violate the separation of concerns often done by adding compositional information to the bibliographic metadata. -Pandoc generally maintain a separation of concern +## The AST filter API is effectively more versatile {#sec-pandoc-filter-versatile} + +Pandoc strives to maintain a separation of concern between content, structure and layout, -in a workflow of isolated functions and APIs -for import, filtering and export, -but cannot guarantee this separation. -While Pandoc offers an import extension API -for adding support for new source formats, -the formats implemented in Pandoc itself are maintained in sync -with other parts of the workflow, -ensuring a tighter degree of coordination than the API can offer. -Such potential for the Pandoc import API -being inferior to builtin source format parsers -has influenced the choice of implementation for this project, +offering separate APIs for import, filtering and export, +but since some functionality cross those boundaries, +this separation is not guaranteed, +and complex features are more likely to work correctly +with the most commonly used Reader and Writer components. + +This is exacerbated when using wrapper tools +since their extensions are likely tailored to those same +most commonly used Reader and Writer components. +The Quarto framework, specifically, +allows using any Reader same as Pandoc itself, +but only when using default Reader -- +a custom derivative of the Pandoc default Markdown Reader -- +do you benefit from Quarto-specific macro expansions. + +Despite the Reader API being specifically tailored +for implementing a custom source format parser, +when integration with complex functions is relevant, +either within Pandoc itself or part of wrapper tools, +that API effectively is discouraged. + +Since a core principle of Markdown is treat unknown markup as content, +a viable alternative for an extended Markdown format +is to use a simpler Markdown Reader and then "finish up" the parsing +in a filter. +This is the approach chosen for this project, as covered next in @sec-filter. @@ -218,6 +218,16 @@ editor = {Ivan Herman and Ben Adida and Manu Sporny and Mark Birbeck}, } +@TechReport{MacFarlane2024, + author = {John {MacFarlane}}, + date = {2024-01-28}, + title = {{CommonMark} Spec}, + language = {English}, + url = {https://spec.commonmark.org/0.31.2/}, + urldate = {2025-05-25}, + version = {0.31.2}, +} + @Online{Francart2020, author = {Thomas Francart}, date = {2020-02-20}, @@ -313,6 +323,7 @@ pages = {111--122}, subtitle = {a recognition-based syntactic foundation}, volume = {39}, + file = {:Ford2004 - Parsing Expression Grammars.pdf:PDF}, publisher = {Association for Computing Machinery (ACM)}, } @@ -342,6 +353,14 @@ urldate = {2025-05-23}, } +@Online{MacFarlane2024djot, + author = {John {MacFarlane}}, + date = {2024-02-04}, + title = {djot}, + url = {https://djot.org/}, + urldate = {2025-05-25}, +} + @Online{Smedegaard2025, author = {Jonas Smedegaard}, date = {2025-05-14}, |
