diff options
| author | Jonas Smedegaard <dr@jones.dk> | 2025-05-25 11:16:25 +0200 |
|---|---|---|
| committer | Jonas Smedegaard <dr@jones.dk> | 2025-05-25 11:24:34 +0200 |
| commit | 747ac65d4708487c0adad7b5c0ed5c85d814b09c (patch) | |
| tree | 67f98ee0e1f55585e89423b6aa2a5ce8fc93ce15 | |
| parent | c948ba705c55f408074ded769ce8fdd4c125435e (diff) | |
add section on complex Pandoc features
| -rw-r--r-- | _filter.qmd | 22 | ||||
| -rw-r--r-- | _intro.qmd | 54 | ||||
| -rw-r--r-- | _markdown.qmd | 4 | ||||
| -rw-r--r-- | _pandoc.qmd | 56 | ||||
| -rw-r--r-- | report.qmd | 4 |
5 files changed, 117 insertions, 23 deletions
diff --git a/_filter.qmd b/_filter.qmd index 7cf343e..1384b32 100644 --- a/_filter.qmd +++ b/_filter.qmd @@ -10,6 +10,28 @@ which may initially seem unusual. *TODO: About the approach of parsing as Markdown and adjust the fallout, instead of writing an import extension.* +As described in @sec-improve, +of priority for this project is to improve existing tools +rather than implement parallel competing ones, +despite the latter potentially being easier to do +or lead to a simpler product. +In @sec-pandoc-complex + +*FIXME: tie pieces together, and continue from there +with the consequence of it was actually tackled. + +<!-- +The easiest way to implement a derivation of Markdown +is likely by writing a Pandoc import filter, +but that would limit its usefulness with larger frameworks like Quarto. +Although they do support alternative source languages, +important features such as citation handling are far better streamlined +when using the main and best-supported source format. +This project therefore implements its deviation of Markdown +by reading the deviant source as if it were Markdown, +and then applying a filter to adjust the AST after the deliberate misparsing. +--> + ## The choice of Lua This project is implemented in the scripting language Lua. @@ -104,33 +104,31 @@ as illustrated in @fig-phase1. {#fig-phase1} -### Evolving accepted formats and tools +### Improve accepted formats and tools {#sec-improve} Markdown is a widely adopted authoring format, and Pandoc a widely adopted Markdown processor. It is easier to invent a new format in a new tool than to convince widely adopted ones to evolve; -however,the latter, if successful, is likely to be far more reliable. - -This project aims at introducing new syntax -while staying close to existing Markdown, +however, the latter, if successful, is likely to be far more reliable. +This project aims at staying close to existing Markdown, +striving to only minimally introduce new syntax, unlike, for example, SAM that deviates notably from Markdown [@SAM2018]. Many Markdown processors exist, but few are in widespread use. -Pandoc is widely adopted and has in recent times been integrated -with R Markdown and the Quarto document publishing system -to streamline the production of academic papers. -The easiest way to implement a derivation of Markdown -is likely by writing a Pandoc import filter, -but that would limit its usefulness with larger frameworks like Quarto. -Although they do support alternative source languages, -important features such as citation handling are far better streamlined -when using the main and best-supported source format. -This project therefore implements its deviation of Markdown -by reading the deviant source as if it were Markdown, -and then applying a filter to adjust the AST after the deliberate misparsing. +One popular Markdown processor is Pandoc, +widely adopted and integrated with multiple wrapper tools +such as R Markdown and the Quarto document publishing system +to streamline the production of, for example, academic papers. +Yet another Markdown parser would probably be easiest to implement +as a standalone project, +either by forking an existing one and customize, +or by use of a library for common components +and then subclassing as needed. +This project, however, strives to reuse existing formats and tools, +and has chosen to extend the existing Markdown processor Pandoc. ### Collaborative licensing @@ -145,4 +143,24 @@ under the Creative Commons crediting share-alike 4.0. ## Implementation plan -*FIXME: summarize next chapters* +*FIXME: sharpen this end summary, +to better conclude the previous parts* + +* extention to existing widespread authoring language, + not an alternative one +* subtle extension, both to ease adoption of existing Markdown users, + and to keep the original principle that it "should be publishable as-is". +* ease of user adoption rather than ease of implementation + +Achieving these goals requires an understanding of Markdown and annotations, +and it requires working code. +Existing Markdown syntax is analysed in @sec-commonmark +and proposed new coverage of annotation in @sec-semantic-markdown. +(Eventual rendering of annotations is outside the scope of this paper +and will only be briefly discussed later in @sec-rdf.) +Then existing Markdown parsing is described in @sec-pandoc, +and implementation of extending that to parse annotations +is described in @sec-filter, +and evaluated and discussed in the later chapters. + +*TODO: Maybe rephrase Pandoc as being not "described" but "analysed"?* diff --git a/_markdown.qmd b/_markdown.qmd index fff1112..2e57b18 100644 --- a/_markdown.qmd +++ b/_markdown.qmd @@ -35,7 +35,7 @@ since it is abstractly equivalent to metadata embedding formats of both PDF and HTML (as discussed in more detail at @sec-rdf). -## Syntax of dialect CommonMark +## Syntax of dialect CommonMark {#sec-commonmark} Markdown consists of blocks of content, optionally prepended a set of Metadata blocks. @@ -135,7 +135,7 @@ by extending the Markdown language with additional types of annotation. Syntax diagrams for some additional Markdown components are included as [Appendix @sec-def-dia]. -## Syntax of extension Semantic Markdown +## Syntax of extension Semantic Markdown {#sec-semantic-markdown} Semantic Markdown mainly extends the syntax for `AnnotatedWords`, and introduces a new syntax similar to `LinkDefinition`. diff --git a/_pandoc.qmd b/_pandoc.qmd index f408595..3703707 100644 --- a/_pandoc.qmd +++ b/_pandoc.qmd @@ -1,7 +1,6 @@ *FIXME: This chapter is unfinished -- currently contains 3-4 large chunks (separated by horisontal lines) that need to be merged or maybe some parts dropped altogether...* - This chapter will provide an analysis of the Markdown processor Pandoc. @@ -65,3 +64,58 @@ in the automated parts of the framework. Collection of interrelated POSIX scripts and Pandoc extensions for enabling semantic annotations in Markdown-based authoring workflows. + +## An AST in the spirit of Markdown + +The Pandoc AST evolved from the needs of tracking Markdown. +It mainly keeps track of structure of content, +and only to a limited extend handles layout: +import from layout-rich document formats reduces such information, +and export to layout-rich document formats +use templates to effectively impose a normalized style. + +## Complex features {#sec-pandoc-complex} + +Some functionalities of Pandoc hook into multiple places of its workflow. + +An example of a complex functionality is citation handling. +The source text contains citation annotations, +referencing document-wide or externally stored bibliographic metadata. +Rendering consists of up to three compositions from the two sources: +an inline text annotation, +a footnote on the same or a following page, +and an entry in a bibliography chapter often appended the main text. +The style of these up to three renderings vary, +and Pandoc offers several ways to manage the composition, +where the recommended approach is to pick one option +from Citation Style Language, +a collection of more than 10.000 styles, +each containing rules, for example of how to sort and abbreviate +a citation containing multiple authors. + +Ideally, citations can be separated +into prose (annotation), structure (bibliographic metadata) +and layout concerns (composition), +with each concern anchored at one place in the workflow. +In reality, however, automated composition sometimes fail short -- +e.g. the style chosen from Citation Style Language +is missing translation of some terms in your (natural) language. +The author may then need to violate the separation of concerns +(until the underlying issue is properly solved), +often done by adding compositional information +to the bibliographic metadata. + +Pandoc generally maintain a separation of concern +between content, structure and layout, +in a workflow of isolated functions and APIs +for import, filtering and export, +but cannot guarantee this separation. +While Pandoc offers an import extension API +for adding support for new source formats, +the formats implemented in Pandoc itself are maintained in sync +with other parts of the workflow, +ensuring a tighter degree of coordination than the API can offer. +Such potential for the Pandoc import API +being inferior to builtin source format parsers +has influenced the choice of implementation for this project, +as covered next in @sec-filter. @@ -88,11 +88,11 @@ are editorial comments not intended for inclusion in the final version.* {{< include _markdown.qmd >}} -# How Pandoc processes Markdown +# How Pandoc processes Markdown {#sec-pandoc} {{< include _pandoc.qmd >}} -# Pandoc filter `sem-md` +# Pandoc filter `sem-md` {#sec-filter} {{< include _filter.qmd >}} |
