aboutsummaryrefslogtreecommitdiff
path: root/_pandoc.qmd
diff options
context:
space:
mode:
authorJonas Smedegaard <dr@jones.dk>2025-05-19 15:00:20 +0200
committerJonas Smedegaard <dr@jones.dk>2025-05-19 15:00:20 +0200
commita75d2ccb355e3fb489586ccaf82d780ddaeab0a7 (patch)
treee8bddd54403a397fcd04a0abcdb510ac158806de /_pandoc.qmd
parent651fcfe553442965efbaec17891c4d205e0607a5 (diff)
rename chapter background -> pandoc; expand title for section on usage
Diffstat (limited to '_pandoc.qmd')
-rw-r--r--_pandoc.qmd214
1 files changed, 214 insertions, 0 deletions
diff --git a/_pandoc.qmd b/_pandoc.qmd
new file mode 100644
index 0000000..0116286
--- /dev/null
+++ b/_pandoc.qmd
@@ -0,0 +1,214 @@
+*FIXME: This chapter is unfinished --
+currently contains 3-4 large chunks (separated by horisontal lines)
+that need to be merged or maybe some parts dropped altogether...*
+
+This chapter will provide and analysis
+of the data format Markdown
+and the Markdown-based publishing system Quarto.
+
+This project mainly involves navigating in and altering data structures.
+Main data structures are the document formats Markdown, HTML and PDF,
+and the abstract data language RDF,
+serialised as RDFa (embedded in HTML) and PDF (embedded in PDF).
+
+## Markdown
+
+### Structural and layout annotation, and metadata
+
+Original Markdown provides unobtrusive markup
+for content and hypermedia structure,
+to ease the authoring of style-agnostic hypermedia content.
+Later dialects extends the language
+to cover more content and hypermedia structure,
+style annotation
+and text-wide metadata.
+
+The separation of visual concerns from content and structure
+is harnessed by the document converter Pandoc
+and the Pandoc-based document authoring framework Quarto:
+Pandoc with Quarto plugins and templates
+allows annotating a string as a hyperlink or a citation,
+declaring authorship, ownership and release date,
+and rendering as a scholarly paper
+conforming to a prescribed style guide and document format.
+
+### Semantic annotation is missing
+
+None of the existing Markdown dialects,
+however,
+covers annotation of content semantics.
+You cannot -- using existing Markdown dialects --
+annotate a string as contextually related to some content domain,
+in a way that Markdown processors will treat it as such:
+When rendering an output document
+the annotation is omitted from the text
+and optionally accessible as part of document metadata.
+
+Example annotations might include
+some numbers in meter and others in nautical miles,
+or one citation being supportive and another a rebuttal,
+or one quote using "she" as personal pronoun
+and another using it derogatory.
+
+Such meta information tied not to the document as a whole
+but to specific strings in the text
+cannot be written as such --
+i.e. structurally part of the writing
+but communicatively meta to the prose content of the text.
+
+---
+
+Markdown is "probably the most popular markup language today"
+[@Rapp2023, p. 42].
+It was originally defined by @Gruber2004
+as a superset of HTML,
+improving readability and ease of writing
+by adding email-style markup
+for common content structure like headers, emphasis, lists and hyperlinks.
+
+A core principle of Markdown is readability:
+
+> A Markdown-formatted document should be publishable as-is,
+> as plain text,
+> without looking like it’s been marked up
+> with tags or formatting instructions..
+> [@Gruber2004, section "Philosophy"].
+
+Many dialects of Markdown have evolved,
+some tightening the language for parsing efficiency and disambiguation,
+some extending to cover additional structures
+and some including support for a YAML or TOML metadata header section.
+
+Markdown as originally designed is a source format to produce HTML.
+If using only Markdown-defined markup, avoiding HTML tags,
+the text is however reliably translatable also to other formats.
+Pandoc is a tool that can convert texts in Markdown dialects
+into many document formats including HTML and (via LaTeX) PDF,
+applying visual style and positioning throught templates.
+Such document workflows,
+including minimal structural markup as part of the creative writing,
+applying visual layout as an automated templating process
+tied to a target document format,
+has been further streamlined for academic texts
+in the Quarto document publishing system.
+
+----
+
+Markdown is a text markup language
+with an emphasis on being easy for humans to read
+[@Gruber2004].
+
+Compared to word processors like Microsoft Word and LibreOffice Writer,
+Markdown authoring stores both content and markup together
+in a human-readable tekst file.
+
+::: {#fig-formality}
+
+```
+informal /---------formatted text----------\ formal
+<------v-------------v-------------v-----------------------v---->
+ plain text informal markup formal markup binary format
+ (Markdown) (HTML, XML, etc.)
+```
+
+Markdown is informal, ASCII-based markup
+[@Leonard2016, p. 4]
+
+:::
+
+HTML is itself a plaintext format,
+but is less human-readable.
+Similarly the format LaTeX is also plaintext,
+but its markdown arguably distracts the reading process
+[@Mailund2019chap2, p. 9].
+
+### Alternatives
+
+Other human-readable document source formats exists.
+
+*TODO: briefly cover reStructuredText, Org-mode and AsciiDoc.*
+
+### Integration
+
+Markdown is in widespread use.
+
+Major source forges use Markdown by default for `README` files
+[@Github2025; @GitLab2025; @Codeberg2024].
+Some major programming languages
+natively support Markdown in embedded docstrings
+in core tools
+[@Microsoft2023; @Oracle2025; @RustTeam2024];
+others offer optional support e.g. through plugins
+[@Heesch2025; @Sphinx2025; @JSDoc2023].
+
+## Pandoc and Quarto
+
+The Markdown processor Pandoc can transform Markdown not only to HTML
+but also to other output formats like PDF.
+Pandoc offers an API for adapting its content processing
+as well as a templating structure for customizing layout,
+which is streamlined in the document authoring framework Quarto:
+Pandoc with a set of plugins and templates
+enables rendering of scholarly papers
+conforming to prescribed style guides and document formats.
+
+---
+
+Pandoc is a document converter built around the markdown markup language,
+able to parse from and serialise to many Markdown dialects
+as well as equivalent subsets of other text markup languages
+including HTML, LaTeX (and by extension PDF),
+Office Open XML (as used by recent releases of Microsoft Word)
+and OpenDocument (as used by OpenOffice and LibreOffice).
+Pandoc is extensible,
+supporting custom code loaded at runtime
+either for custom parsing or serialising,
+or for manipulating the intermediate internal content structure
+called Abstract Syntax Tree (AST).
+
+Pandoc supports redefining input and output formats
+and manipulating the internal document structure
+as part of the automated parts of the framework.
+
+Pandoc is extendable.
+Source and output format can be changed or completely redefined
+and the internal document structure manipulated,
+in the automated parts of the framework.
+
+Collection of interrelated POSIX scripts and Pandoc extensions
+for enabling semantic annotations in Markdown-based authoring workflows.
+
+* filter extension to capture annotations
+ * identify semantic metadata in stylistic metadata part of Pandoc YAML header
+ * identify semantic metadata in content part of Pandoc document structure
+ * append semantic metadata to Pandoc YAML document header
+ * strip identified metadata from stylistic metadata and content
+* output format extension to generate PDF
+ * read semantic metadata from Pandoc YAML document header
+ * structure semantic metadata as RDF triples
+ * append RDF triples serialized as part of XMP metadata in PDF
+* output format extension to generate web page
+ * read semantic metadata from Pandoc YAML document header
+ * structure semantic metadata as RDF triples
+ * append RDF triples serialized as RDFa
+
+Markdown provides intuitive and unobtrusive markup syntax
+for structure like headers, emphasis, lists and hyperlinks.
+Pandoc extends Markdown with syntax
+for citation annotation
+and an optional YAML metadata header.
+Quarto extends Markdown further with syntax
+for some styling and some convenience macros,
+and applies templates for a uniform visual styling
+across target document formats.
+
+### Interfaces
+
+* Pandoc document object model (DOM)
+* Resource Description Framework (RDF)
+ * XMP
+ * RDFa
+* Markdown
+ * Semantic Markdown
+* CommonMark
+ * Semantic CommonMark