diff options
| author | Jonas Smedegaard <dr@jones.dk> | 2025-05-19 15:00:20 +0200 |
|---|---|---|
| committer | Jonas Smedegaard <dr@jones.dk> | 2025-05-19 15:00:20 +0200 |
| commit | a75d2ccb355e3fb489586ccaf82d780ddaeab0a7 (patch) | |
| tree | e8bddd54403a397fcd04a0abcdb510ac158806de /_pandoc.qmd | |
| parent | 651fcfe553442965efbaec17891c4d205e0607a5 (diff) | |
rename chapter background -> pandoc; expand title for section on usage
Diffstat (limited to '_pandoc.qmd')
| -rw-r--r-- | _pandoc.qmd | 214 |
1 files changed, 214 insertions, 0 deletions
diff --git a/_pandoc.qmd b/_pandoc.qmd new file mode 100644 index 0000000..0116286 --- /dev/null +++ b/_pandoc.qmd @@ -0,0 +1,214 @@ +*FIXME: This chapter is unfinished -- +currently contains 3-4 large chunks (separated by horisontal lines) +that need to be merged or maybe some parts dropped altogether...* + +This chapter will provide and analysis +of the data format Markdown +and the Markdown-based publishing system Quarto. + +This project mainly involves navigating in and altering data structures. +Main data structures are the document formats Markdown, HTML and PDF, +and the abstract data language RDF, +serialised as RDFa (embedded in HTML) and PDF (embedded in PDF). + +## Markdown + +### Structural and layout annotation, and metadata + +Original Markdown provides unobtrusive markup +for content and hypermedia structure, +to ease the authoring of style-agnostic hypermedia content. +Later dialects extends the language +to cover more content and hypermedia structure, +style annotation +and text-wide metadata. + +The separation of visual concerns from content and structure +is harnessed by the document converter Pandoc +and the Pandoc-based document authoring framework Quarto: +Pandoc with Quarto plugins and templates +allows annotating a string as a hyperlink or a citation, +declaring authorship, ownership and release date, +and rendering as a scholarly paper +conforming to a prescribed style guide and document format. + +### Semantic annotation is missing + +None of the existing Markdown dialects, +however, +covers annotation of content semantics. +You cannot -- using existing Markdown dialects -- +annotate a string as contextually related to some content domain, +in a way that Markdown processors will treat it as such: +When rendering an output document +the annotation is omitted from the text +and optionally accessible as part of document metadata. + +Example annotations might include +some numbers in meter and others in nautical miles, +or one citation being supportive and another a rebuttal, +or one quote using "she" as personal pronoun +and another using it derogatory. + +Such meta information tied not to the document as a whole +but to specific strings in the text +cannot be written as such -- +i.e. structurally part of the writing +but communicatively meta to the prose content of the text. + +--- + +Markdown is "probably the most popular markup language today" +[@Rapp2023, p. 42]. +It was originally defined by @Gruber2004 +as a superset of HTML, +improving readability and ease of writing +by adding email-style markup +for common content structure like headers, emphasis, lists and hyperlinks. + +A core principle of Markdown is readability: + +> A Markdown-formatted document should be publishable as-is, +> as plain text, +> without looking like it’s been marked up +> with tags or formatting instructions.. +> [@Gruber2004, section "Philosophy"]. + +Many dialects of Markdown have evolved, +some tightening the language for parsing efficiency and disambiguation, +some extending to cover additional structures +and some including support for a YAML or TOML metadata header section. + +Markdown as originally designed is a source format to produce HTML. +If using only Markdown-defined markup, avoiding HTML tags, +the text is however reliably translatable also to other formats. +Pandoc is a tool that can convert texts in Markdown dialects +into many document formats including HTML and (via LaTeX) PDF, +applying visual style and positioning throught templates. +Such document workflows, +including minimal structural markup as part of the creative writing, +applying visual layout as an automated templating process +tied to a target document format, +has been further streamlined for academic texts +in the Quarto document publishing system. + +---- + +Markdown is a text markup language +with an emphasis on being easy for humans to read +[@Gruber2004]. + +Compared to word processors like Microsoft Word and LibreOffice Writer, +Markdown authoring stores both content and markup together +in a human-readable tekst file. + +::: {#fig-formality} + +``` +informal /---------formatted text----------\ formal +<------v-------------v-------------v-----------------------v----> + plain text informal markup formal markup binary format + (Markdown) (HTML, XML, etc.) +``` + +Markdown is informal, ASCII-based markup +[@Leonard2016, p. 4] + +::: + +HTML is itself a plaintext format, +but is less human-readable. +Similarly the format LaTeX is also plaintext, +but its markdown arguably distracts the reading process +[@Mailund2019chap2, p. 9]. + +### Alternatives + +Other human-readable document source formats exists. + +*TODO: briefly cover reStructuredText, Org-mode and AsciiDoc.* + +### Integration + +Markdown is in widespread use. + +Major source forges use Markdown by default for `README` files +[@Github2025; @GitLab2025; @Codeberg2024]. +Some major programming languages +natively support Markdown in embedded docstrings +in core tools +[@Microsoft2023; @Oracle2025; @RustTeam2024]; +others offer optional support e.g. through plugins +[@Heesch2025; @Sphinx2025; @JSDoc2023]. + +## Pandoc and Quarto + +The Markdown processor Pandoc can transform Markdown not only to HTML +but also to other output formats like PDF. +Pandoc offers an API for adapting its content processing +as well as a templating structure for customizing layout, +which is streamlined in the document authoring framework Quarto: +Pandoc with a set of plugins and templates +enables rendering of scholarly papers +conforming to prescribed style guides and document formats. + +--- + +Pandoc is a document converter built around the markdown markup language, +able to parse from and serialise to many Markdown dialects +as well as equivalent subsets of other text markup languages +including HTML, LaTeX (and by extension PDF), +Office Open XML (as used by recent releases of Microsoft Word) +and OpenDocument (as used by OpenOffice and LibreOffice). +Pandoc is extensible, +supporting custom code loaded at runtime +either for custom parsing or serialising, +or for manipulating the intermediate internal content structure +called Abstract Syntax Tree (AST). + +Pandoc supports redefining input and output formats +and manipulating the internal document structure +as part of the automated parts of the framework. + +Pandoc is extendable. +Source and output format can be changed or completely redefined +and the internal document structure manipulated, +in the automated parts of the framework. + +Collection of interrelated POSIX scripts and Pandoc extensions +for enabling semantic annotations in Markdown-based authoring workflows. + +* filter extension to capture annotations + * identify semantic metadata in stylistic metadata part of Pandoc YAML header + * identify semantic metadata in content part of Pandoc document structure + * append semantic metadata to Pandoc YAML document header + * strip identified metadata from stylistic metadata and content +* output format extension to generate PDF + * read semantic metadata from Pandoc YAML document header + * structure semantic metadata as RDF triples + * append RDF triples serialized as part of XMP metadata in PDF +* output format extension to generate web page + * read semantic metadata from Pandoc YAML document header + * structure semantic metadata as RDF triples + * append RDF triples serialized as RDFa + +Markdown provides intuitive and unobtrusive markup syntax +for structure like headers, emphasis, lists and hyperlinks. +Pandoc extends Markdown with syntax +for citation annotation +and an optional YAML metadata header. +Quarto extends Markdown further with syntax +for some styling and some convenience macros, +and applies templates for a uniform visual styling +across target document formats. + +### Interfaces + +* Pandoc document object model (DOM) +* Resource Description Framework (RDF) + * XMP + * RDFa +* Markdown + * Semantic Markdown +* CommonMark + * Semantic CommonMark |
