add section on complex Pandoc features

author: Jonas Smedegaard <dr@jones.dk> 2025-05-25 11:16:25 +0200
committer: Jonas Smedegaard <dr@jones.dk> 2025-05-25 11:24:34 +0200
commit: 747ac65d4708487c0adad7b5c0ed5c85d814b09c (patch)
tree: 67f98ee0e1f55585e89423b6aa2a5ce8fc93ce15
parent: c948ba705c55f408074ded769ce8fdd4c125435e (diff)
5 files changed, 117 insertions, 23 deletions
diff --git a/_filter.qmd b/_filter.qmd
index 7cf343e..1384b32 100644
--- a/_filter.qmd
+++ b/_filter.qmd
@@ -10,6 +10,28 @@ which may initially seem unusual.
 *TODO: About the approach of parsing as Markdown and adjust the fallout,
 instead of writing an import extension.*
 
+As described in @sec-improve,
+of priority for this project is to improve existing tools
+rather than implement parallel competing ones,
+despite the latter potentially being easier to do
+or lead to a simpler product.
+In @sec-pandoc-complex
+
+*FIXME: tie pieces together, and continue from there
+with the consequence of it was actually tackled.
+
+<!--
+The easiest way to implement a derivation of Markdown
+is likely by writing a Pandoc import filter,
+but that would limit its usefulness with larger frameworks like Quarto.
+Although they do support alternative source languages,
+important features such as citation handling are far better streamlined
+when using the main and best-supported source format.
+This project therefore implements its deviation of Markdown
+by reading the deviant source as if it were Markdown,
+and then applying a filter to adjust the AST after the deliberate misparsing.
+-->
+
 ## The choice of Lua
 
 This project is implemented in the scripting language Lua.
diff --git a/_intro.qmd b/_intro.qmd
index 5b4a5fb..7697ddb 100644
--- a/_intro.qmd
+++ b/_intro.qmd
@@ -104,33 +104,31 @@ as illustrated in @fig-phase1.
 
 ![A filter to strip annotations from content.](workflow/phase1.svg){#fig-phase1}
 
-### Evolving accepted formats and tools
+### Improve accepted formats and tools {#sec-improve}
 
 Markdown is a widely adopted authoring format,
 and Pandoc a widely adopted Markdown processor.
 It is easier to invent a new format in a new tool
 than to convince widely adopted ones to evolve;
-however,the latter, if successful, is likely to be far more reliable.
-
-This project aims at introducing new syntax
-while staying close to existing Markdown,
+however, the latter, if successful, is likely to be far more reliable.
+This project aims at staying close to existing Markdown,
+striving to only minimally introduce new syntax,
 unlike, for example, SAM that deviates notably from Markdown
 [@SAM2018].
 
 Many Markdown processors exist,
 but few are in widespread use.
-Pandoc is widely adopted and has in recent times been integrated
-with R Markdown and the Quarto document publishing system
-to streamline the production of academic papers.
-The easiest way to implement a derivation of Markdown
-is likely by writing a Pandoc import filter,
-but that would limit its usefulness with larger frameworks like Quarto.
-Although they do support alternative source languages,
-important features such as citation handling are far better streamlined
-when using the main and best-supported source format.
-This project therefore implements its deviation of Markdown
-by reading the deviant source as if it were Markdown,
-and then applying a filter to adjust the AST after the deliberate misparsing.
+One popular Markdown processor is Pandoc,
+widely adopted and integrated with multiple wrapper tools
+such as R Markdown and the Quarto document publishing system
+to streamline the production of, for example, academic papers.
+Yet another Markdown parser would probably be easiest to implement
+as a standalone project,
+either by forking an existing one and customize,
+or by use of a library for common components
+and then subclassing as needed.
+This project, however, strives to reuse existing formats and tools,
+and has chosen to extend the existing Markdown processor Pandoc.
 
 ### Collaborative licensing
 
@@ -145,4 +143,24 @@ under the Creative Commons crediting share-alike 4.0.
 
 ## Implementation plan
 
-*FIXME: summarize next chapters*
+*FIXME: sharpen this end summary,
+to better conclude the previous parts*
+
+* extention to existing widespread authoring language,
+  not an alternative one
+* subtle extension, both to ease adoption of existing Markdown users,
+  and to keep the original principle that it "should be publishable as-is".
+* ease of user adoption rather than ease of implementation
+
+Achieving these goals requires an understanding of Markdown and annotations,
+and it requires working code.
+Existing Markdown syntax is analysed in @sec-commonmark
+and proposed new coverage of annotation in @sec-semantic-markdown.
+(Eventual rendering of annotations is outside the scope of this paper
+and will only be briefly discussed later in @sec-rdf.)  
+Then existing Markdown parsing is described in @sec-pandoc,
+and implementation of extending that to parse annotations
+is described in @sec-filter,
+and evaluated and discussed in the later chapters.
+
+*TODO: Maybe rephrase Pandoc as being not "described" but "analysed"?*
diff --git a/_markdown.qmd b/_markdown.qmd
index fff1112..2e57b18 100644
--- a/_markdown.qmd
+++ b/_markdown.qmd
@@ -35,7 +35,7 @@ since it is abstractly equivalent
 to metadata embedding formats of both PDF and HTML
 (as discussed in more detail at @sec-rdf).
 
-## Syntax of dialect CommonMark
+## Syntax of dialect CommonMark {#sec-commonmark}
 
 Markdown consists of blocks of content,
 optionally prepended a set of Metadata blocks.
@@ -135,7 +135,7 @@ by extending the Markdown language with additional types of annotation.
 Syntax diagrams for some additional Markdown components are included
 as [Appendix @sec-def-dia].
 
-## Syntax of extension Semantic Markdown
+## Syntax of extension Semantic Markdown {#sec-semantic-markdown}
 
 Semantic Markdown mainly extends the syntax for `AnnotatedWords`,
 and introduces a new syntax similar to `LinkDefinition`.
diff --git a/_pandoc.qmd b/_pandoc.qmd
index f408595..3703707 100644
--- a/_pandoc.qmd
+++ b/_pandoc.qmd
@@ -1,7 +1,6 @@
 *FIXME: This chapter is unfinished --
 currently contains 3-4 large chunks (separated by horisontal lines)
 that need to be merged or maybe some parts dropped altogether...*
-
 This chapter will provide an analysis
 of the Markdown processor Pandoc.
 
@@ -65,3 +64,58 @@ in the automated parts of the framework.
 
 Collection of interrelated POSIX scripts and Pandoc extensions
 for enabling semantic annotations in Markdown-based authoring workflows.
+
+## An AST in the spirit of Markdown
+
+The Pandoc AST evolved from the needs of tracking Markdown.
+It mainly keeps track of structure of content,
+and only to a limited extend handles layout:
+import from layout-rich document formats reduces such information,
+and export to layout-rich document formats
+use templates to effectively impose a normalized style.
+
+## Complex features {#sec-pandoc-complex}
+
+Some functionalities of Pandoc hook into multiple places of its workflow.
+
+An example of a complex functionality is citation handling.
+The source text contains citation annotations,
+referencing document-wide or externally stored bibliographic metadata.
+Rendering consists of up to three compositions from the two sources:
+an inline text annotation,
+a footnote on the same or a following page,
+and an entry in a bibliography chapter often appended the main text.
+The style of these up to three renderings vary,
+and Pandoc offers several ways to manage the composition,
+where the recommended approach is to pick one option
+from Citation Style Language,
+a collection of more than 10.000 styles,
+each containing rules, for example of how to sort and abbreviate
+a citation containing multiple authors.
+
+Ideally, citations can be separated
+into prose (annotation), structure (bibliographic metadata)
+and layout concerns (composition),
+with each concern anchored at one place in the workflow.
+In reality, however, automated composition sometimes fail short --
+e.g. the style chosen from Citation Style Language
+is missing translation of some terms in your (natural) language.
+The author may then need to violate the separation of concerns
+(until the underlying issue is properly solved),
+often done by adding compositional information
+to the bibliographic metadata.
+
+Pandoc generally maintain a separation of concern
+between content, structure and layout,
+in a workflow of isolated functions and APIs
+for import, filtering and export,
+but cannot guarantee this separation.
+While Pandoc offers an import extension API
+for adding support for new source formats,
+the formats implemented in Pandoc itself are maintained in sync
+with other parts of the workflow,
+ensuring a tighter degree of coordination than the API can offer.
+Such potential for the Pandoc import API
+being inferior to builtin source format parsers
+has influenced the choice of implementation for this project,
+as covered next in @sec-filter.
diff --git a/report.qmd b/report.qmd
index 46d539b..3f87481 100644
--- a/report.qmd
+++ b/report.qmd
@@ -88,11 +88,11 @@ are editorial comments not intended for inclusion in the final version.*
 
 {{< include _markdown.qmd >}}
 
-# How Pandoc processes Markdown
+# How Pandoc processes Markdown {#sec-pandoc}
 
 {{< include _pandoc.qmd >}}
 
-# Pandoc filter `sem-md`
+# Pandoc filter `sem-md` {#sec-filter}
 
 {{< include _filter.qmd >}}
author	Jonas Smedegaard <dr@jones.dk>	2025-05-25 11:16:25 +0200
committer	Jonas Smedegaard <dr@jones.dk>	2025-05-25 11:24:34 +0200
commit	747ac65d4708487c0adad7b5c0ed5c85d814b09c (patch)
tree	67f98ee0e1f55585e89423b6aa2a5ce8fc93ce15
parent	c948ba705c55f408074ded769ce8fdd4c125435e (diff)