From a75d2ccb355e3fb489586ccaf82d780ddaeab0a7 Mon Sep 17 00:00:00 2001 From: Jonas Smedegaard Date: Mon, 19 May 2025 15:00:20 +0200 Subject: rename chapter background -> pandoc; expand title for section on usage --- _background.qmd | 214 -------------------------------------------------------- _pandoc.qmd | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ report.qmd | 6 +- 3 files changed, 217 insertions(+), 217 deletions(-) delete mode 100644 _background.qmd create mode 100644 _pandoc.qmd diff --git a/_background.qmd b/_background.qmd deleted file mode 100644 index 0116286..0000000 --- a/_background.qmd +++ /dev/null @@ -1,214 +0,0 @@ -*FIXME: This chapter is unfinished -- -currently contains 3-4 large chunks (separated by horisontal lines) -that need to be merged or maybe some parts dropped altogether...* - -This chapter will provide and analysis -of the data format Markdown -and the Markdown-based publishing system Quarto. - -This project mainly involves navigating in and altering data structures. -Main data structures are the document formats Markdown, HTML and PDF, -and the abstract data language RDF, -serialised as RDFa (embedded in HTML) and PDF (embedded in PDF). - -## Markdown - -### Structural and layout annotation, and metadata - -Original Markdown provides unobtrusive markup -for content and hypermedia structure, -to ease the authoring of style-agnostic hypermedia content. -Later dialects extends the language -to cover more content and hypermedia structure, -style annotation -and text-wide metadata. - -The separation of visual concerns from content and structure -is harnessed by the document converter Pandoc -and the Pandoc-based document authoring framework Quarto: -Pandoc with Quarto plugins and templates -allows annotating a string as a hyperlink or a citation, -declaring authorship, ownership and release date, -and rendering as a scholarly paper -conforming to a prescribed style guide and document format. - -### Semantic annotation is missing - -None of the existing Markdown dialects, -however, -covers annotation of content semantics. -You cannot -- using existing Markdown dialects -- -annotate a string as contextually related to some content domain, -in a way that Markdown processors will treat it as such: -When rendering an output document -the annotation is omitted from the text -and optionally accessible as part of document metadata. - -Example annotations might include -some numbers in meter and others in nautical miles, -or one citation being supportive and another a rebuttal, -or one quote using "she" as personal pronoun -and another using it derogatory. - -Such meta information tied not to the document as a whole -but to specific strings in the text -cannot be written as such -- -i.e. structurally part of the writing -but communicatively meta to the prose content of the text. - ---- - -Markdown is "probably the most popular markup language today" -[@Rapp2023, p. 42]. -It was originally defined by @Gruber2004 -as a superset of HTML, -improving readability and ease of writing -by adding email-style markup -for common content structure like headers, emphasis, lists and hyperlinks. - -A core principle of Markdown is readability: - -> A Markdown-formatted document should be publishable as-is, -> as plain text, -> without looking like it’s been marked up -> with tags or formatting instructions.. -> [@Gruber2004, section "Philosophy"]. - -Many dialects of Markdown have evolved, -some tightening the language for parsing efficiency and disambiguation, -some extending to cover additional structures -and some including support for a YAML or TOML metadata header section. - -Markdown as originally designed is a source format to produce HTML. -If using only Markdown-defined markup, avoiding HTML tags, -the text is however reliably translatable also to other formats. -Pandoc is a tool that can convert texts in Markdown dialects -into many document formats including HTML and (via LaTeX) PDF, -applying visual style and positioning throught templates. -Such document workflows, -including minimal structural markup as part of the creative writing, -applying visual layout as an automated templating process -tied to a target document format, -has been further streamlined for academic texts -in the Quarto document publishing system. - ----- - -Markdown is a text markup language -with an emphasis on being easy for humans to read -[@Gruber2004]. - -Compared to word processors like Microsoft Word and LibreOffice Writer, -Markdown authoring stores both content and markup together -in a human-readable tekst file. - -::: {#fig-formality} - -``` -informal /---------formatted text----------\ formal -<------v-------------v-------------v-----------------------v----> - plain text informal markup formal markup binary format - (Markdown) (HTML, XML, etc.) -``` - -Markdown is informal, ASCII-based markup -[@Leonard2016, p. 4] - -::: - -HTML is itself a plaintext format, -but is less human-readable. -Similarly the format LaTeX is also plaintext, -but its markdown arguably distracts the reading process -[@Mailund2019chap2, p. 9]. - -### Alternatives - -Other human-readable document source formats exists. - -*TODO: briefly cover reStructuredText, Org-mode and AsciiDoc.* - -### Integration - -Markdown is in widespread use. - -Major source forges use Markdown by default for `README` files -[@Github2025; @GitLab2025; @Codeberg2024]. -Some major programming languages -natively support Markdown in embedded docstrings -in core tools -[@Microsoft2023; @Oracle2025; @RustTeam2024]; -others offer optional support e.g. through plugins -[@Heesch2025; @Sphinx2025; @JSDoc2023]. - -## Pandoc and Quarto - -The Markdown processor Pandoc can transform Markdown not only to HTML -but also to other output formats like PDF. -Pandoc offers an API for adapting its content processing -as well as a templating structure for customizing layout, -which is streamlined in the document authoring framework Quarto: -Pandoc with a set of plugins and templates -enables rendering of scholarly papers -conforming to prescribed style guides and document formats. - ---- - -Pandoc is a document converter built around the markdown markup language, -able to parse from and serialise to many Markdown dialects -as well as equivalent subsets of other text markup languages -including HTML, LaTeX (and by extension PDF), -Office Open XML (as used by recent releases of Microsoft Word) -and OpenDocument (as used by OpenOffice and LibreOffice). -Pandoc is extensible, -supporting custom code loaded at runtime -either for custom parsing or serialising, -or for manipulating the intermediate internal content structure -called Abstract Syntax Tree (AST). - -Pandoc supports redefining input and output formats -and manipulating the internal document structure -as part of the automated parts of the framework. - -Pandoc is extendable. -Source and output format can be changed or completely redefined -and the internal document structure manipulated, -in the automated parts of the framework. - -Collection of interrelated POSIX scripts and Pandoc extensions -for enabling semantic annotations in Markdown-based authoring workflows. - -* filter extension to capture annotations - * identify semantic metadata in stylistic metadata part of Pandoc YAML header - * identify semantic metadata in content part of Pandoc document structure - * append semantic metadata to Pandoc YAML document header - * strip identified metadata from stylistic metadata and content -* output format extension to generate PDF - * read semantic metadata from Pandoc YAML document header - * structure semantic metadata as RDF triples - * append RDF triples serialized as part of XMP metadata in PDF -* output format extension to generate web page - * read semantic metadata from Pandoc YAML document header - * structure semantic metadata as RDF triples - * append RDF triples serialized as RDFa - -Markdown provides intuitive and unobtrusive markup syntax -for structure like headers, emphasis, lists and hyperlinks. -Pandoc extends Markdown with syntax -for citation annotation -and an optional YAML metadata header. -Quarto extends Markdown further with syntax -for some styling and some convenience macros, -and applies templates for a uniform visual styling -across target document formats. - -### Interfaces - -* Pandoc document object model (DOM) -* Resource Description Framework (RDF) - * XMP - * RDFa -* Markdown - * Semantic Markdown -* CommonMark - * Semantic CommonMark diff --git a/_pandoc.qmd b/_pandoc.qmd new file mode 100644 index 0000000..0116286 --- /dev/null +++ b/_pandoc.qmd @@ -0,0 +1,214 @@ +*FIXME: This chapter is unfinished -- +currently contains 3-4 large chunks (separated by horisontal lines) +that need to be merged or maybe some parts dropped altogether...* + +This chapter will provide and analysis +of the data format Markdown +and the Markdown-based publishing system Quarto. + +This project mainly involves navigating in and altering data structures. +Main data structures are the document formats Markdown, HTML and PDF, +and the abstract data language RDF, +serialised as RDFa (embedded in HTML) and PDF (embedded in PDF). + +## Markdown + +### Structural and layout annotation, and metadata + +Original Markdown provides unobtrusive markup +for content and hypermedia structure, +to ease the authoring of style-agnostic hypermedia content. +Later dialects extends the language +to cover more content and hypermedia structure, +style annotation +and text-wide metadata. + +The separation of visual concerns from content and structure +is harnessed by the document converter Pandoc +and the Pandoc-based document authoring framework Quarto: +Pandoc with Quarto plugins and templates +allows annotating a string as a hyperlink or a citation, +declaring authorship, ownership and release date, +and rendering as a scholarly paper +conforming to a prescribed style guide and document format. + +### Semantic annotation is missing + +None of the existing Markdown dialects, +however, +covers annotation of content semantics. +You cannot -- using existing Markdown dialects -- +annotate a string as contextually related to some content domain, +in a way that Markdown processors will treat it as such: +When rendering an output document +the annotation is omitted from the text +and optionally accessible as part of document metadata. + +Example annotations might include +some numbers in meter and others in nautical miles, +or one citation being supportive and another a rebuttal, +or one quote using "she" as personal pronoun +and another using it derogatory. + +Such meta information tied not to the document as a whole +but to specific strings in the text +cannot be written as such -- +i.e. structurally part of the writing +but communicatively meta to the prose content of the text. + +--- + +Markdown is "probably the most popular markup language today" +[@Rapp2023, p. 42]. +It was originally defined by @Gruber2004 +as a superset of HTML, +improving readability and ease of writing +by adding email-style markup +for common content structure like headers, emphasis, lists and hyperlinks. + +A core principle of Markdown is readability: + +> A Markdown-formatted document should be publishable as-is, +> as plain text, +> without looking like it’s been marked up +> with tags or formatting instructions.. +> [@Gruber2004, section "Philosophy"]. + +Many dialects of Markdown have evolved, +some tightening the language for parsing efficiency and disambiguation, +some extending to cover additional structures +and some including support for a YAML or TOML metadata header section. + +Markdown as originally designed is a source format to produce HTML. +If using only Markdown-defined markup, avoiding HTML tags, +the text is however reliably translatable also to other formats. +Pandoc is a tool that can convert texts in Markdown dialects +into many document formats including HTML and (via LaTeX) PDF, +applying visual style and positioning throught templates. +Such document workflows, +including minimal structural markup as part of the creative writing, +applying visual layout as an automated templating process +tied to a target document format, +has been further streamlined for academic texts +in the Quarto document publishing system. + +---- + +Markdown is a text markup language +with an emphasis on being easy for humans to read +[@Gruber2004]. + +Compared to word processors like Microsoft Word and LibreOffice Writer, +Markdown authoring stores both content and markup together +in a human-readable tekst file. + +::: {#fig-formality} + +``` +informal /---------formatted text----------\ formal +<------v-------------v-------------v-----------------------v----> + plain text informal markup formal markup binary format + (Markdown) (HTML, XML, etc.) +``` + +Markdown is informal, ASCII-based markup +[@Leonard2016, p. 4] + +::: + +HTML is itself a plaintext format, +but is less human-readable. +Similarly the format LaTeX is also plaintext, +but its markdown arguably distracts the reading process +[@Mailund2019chap2, p. 9]. + +### Alternatives + +Other human-readable document source formats exists. + +*TODO: briefly cover reStructuredText, Org-mode and AsciiDoc.* + +### Integration + +Markdown is in widespread use. + +Major source forges use Markdown by default for `README` files +[@Github2025; @GitLab2025; @Codeberg2024]. +Some major programming languages +natively support Markdown in embedded docstrings +in core tools +[@Microsoft2023; @Oracle2025; @RustTeam2024]; +others offer optional support e.g. through plugins +[@Heesch2025; @Sphinx2025; @JSDoc2023]. + +## Pandoc and Quarto + +The Markdown processor Pandoc can transform Markdown not only to HTML +but also to other output formats like PDF. +Pandoc offers an API for adapting its content processing +as well as a templating structure for customizing layout, +which is streamlined in the document authoring framework Quarto: +Pandoc with a set of plugins and templates +enables rendering of scholarly papers +conforming to prescribed style guides and document formats. + +--- + +Pandoc is a document converter built around the markdown markup language, +able to parse from and serialise to many Markdown dialects +as well as equivalent subsets of other text markup languages +including HTML, LaTeX (and by extension PDF), +Office Open XML (as used by recent releases of Microsoft Word) +and OpenDocument (as used by OpenOffice and LibreOffice). +Pandoc is extensible, +supporting custom code loaded at runtime +either for custom parsing or serialising, +or for manipulating the intermediate internal content structure +called Abstract Syntax Tree (AST). + +Pandoc supports redefining input and output formats +and manipulating the internal document structure +as part of the automated parts of the framework. + +Pandoc is extendable. +Source and output format can be changed or completely redefined +and the internal document structure manipulated, +in the automated parts of the framework. + +Collection of interrelated POSIX scripts and Pandoc extensions +for enabling semantic annotations in Markdown-based authoring workflows. + +* filter extension to capture annotations + * identify semantic metadata in stylistic metadata part of Pandoc YAML header + * identify semantic metadata in content part of Pandoc document structure + * append semantic metadata to Pandoc YAML document header + * strip identified metadata from stylistic metadata and content +* output format extension to generate PDF + * read semantic metadata from Pandoc YAML document header + * structure semantic metadata as RDF triples + * append RDF triples serialized as part of XMP metadata in PDF +* output format extension to generate web page + * read semantic metadata from Pandoc YAML document header + * structure semantic metadata as RDF triples + * append RDF triples serialized as RDFa + +Markdown provides intuitive and unobtrusive markup syntax +for structure like headers, emphasis, lists and hyperlinks. +Pandoc extends Markdown with syntax +for citation annotation +and an optional YAML metadata header. +Quarto extends Markdown further with syntax +for some styling and some convenience macros, +and applies templates for a uniform visual styling +across target document formats. + +### Interfaces + +* Pandoc document object model (DOM) +* Resource Description Framework (RDF) + * XMP + * RDFa +* Markdown + * Semantic Markdown +* CommonMark + * Semantic CommonMark diff --git a/report.qmd b/report.qmd index 6fe8837..cc53ff6 100644 --- a/report.qmd +++ b/report.qmd @@ -71,11 +71,11 @@ are editorial notes not intented for inclusion in the final delivery.* {{< include _markdown.qmd >}} -# Analysis of existing framework +# How Pandoc processes Markdown -{{< include _background.qmd >}} +{{< include _pandoc.qmd >}} -# Usage +# Using the Pandoc filter `semantic-markdown` {{< include _usage.qmd >}} -- cgit v1.2.3