_intro.qmd



The process of authoring a conventional text-based content,
i.e. texts consisting mainly of a contiguous set of paragraphs,
can in some sense be described
as a task of materialising thoughts into a linear form expressed in words.
The story may consist of moves in time (e.g. a flashback)
or may not move in time at all (e.g. a dictionary entry),
but the telling of the story - the text itself - is linear.
TODO: reference some supportive Writing Process Research
During such reductive transformation of complex thoughts into linear text,
the authoring may be aided by the ability
to annotate some of the contexts omitted.
A train of thought may contain multiple interconnected trajectories,
where some are left out when reshaping into the written storyline,
but the contexts they represent may still be helpful
during the ongoing writing process,
even if not intended as part of the storyline specifically.
A concrete example common in academic writing is that of citation.
The source of an included theory or argument or counterpoint
is not part of the storyline,
but a reference to it needs to be maintained
for accurately compiling a reference list later appended to the text.
For that specific type of author annotation,
a range of helper tools exist,
integrated to various degree with various authoring environments.
Support for author annotations more generically is less common, however.
TODO: reference later chapter covering known existing tools
The choice of authoring environment limit choices of functionality.
Some authors prefer
a what-you-see-is-what-you-get (WYSIWYG) authoring environment
where the words when written
are visually presented as they would appear in the final document,
e.g. the word processor Microsoft Word or LibreOffice Writer.
Other authors favor
a what-you-see-is-what-you-mean (WYSIWYM) authoring environment
where the words when written
are visually presented to emphasize their structural function in the text.
e.g. the word processor RStudio or the document processor LyX.
Yet others appreciate
an environment with technical oversight of both structure and layout
where prose is intermixed with structural and positioning control commands,
e.g. directly editing of code for the LaTeX typesetting system.
Each class of authoring system enables a different set of options
for author annotations.
Whereas the document formats
for the commonly used tools Microsoft Word and LibreOffice
are binary,
the document formats used with RStudio, LyX and LaTeX
are plaintext,
which means the data format avoid the use of control characters
allowing for editing with general-purpose text editors.
A fundamental benefit of plaintext formatted texts
is freedom of choice regarding authoring tools
[@White2022, p. 3].
Problem formulation
The author of this project prefers
a console-based WYSIWYM authoring environment,
and for political reasons an environment
consisting solely of freely licensed code.
More specifically,
the authoring system chosen of choice is Quarto,
which allows either a WYSIWYM environment (RStudio among others)
or a more technical console-based environment (any modern text editor),
as it uses the plaintext markup format Markdown as source format,
which is automatically processed by the tool Pandoc
into any of a range of output formats
including LaTeX with further postprocessing into PDF.
The aim of this project is to extend Pandoc and Quarto
to support inclusion of generic annotations
as integral part of the authoring process.
This aim has been framed with the following problem statement:
How can Unix-style tools for wiritng technical prose
be extended to support ontological annotations?
Implementation idea and brief plan
Implement plugins for the Pandoc document converter
to enable authoring of ontological annotations in the text content,
inspired by the conceptual idea in @Francart2020,
and publish the plugins
for easy use with the Quarto document publishing system.
Pandoc reads a text document,
parses its structural components into an internal data structure
called Abstract Syntax Tree (AST),
and serialises and writes back into a text document.
The AST is deliberately prioritises structural information
and is relaxed about visual information,
to preserve literal content
while reducing format-specific stylistic details,
relevant especially when processing between different formats.
Most common is to read plaintext Markdown files
and write LaTeX code further compiled into a PDF file.
Pandoc allows supplying custom reader and writer functions
as well as plugging into and manipulating the AST,
which this project will exploit:
This project will write an extension
to adjust the AST
when abusing the default Markdown reader
to read Markdown with added markup for ontological annotations,
as proposed in @Francart2020
and further sketched as a draft markup format in @Smedegaard2022.
The implemented Pandoc extensions will be designed
both for use standalone and as part of the document authoring framework Quarto,
which uses Pandoc as central tool with a large set of extensions and templates.
First milestone is reached
when the filter can simply suppress the added markup.
A further milestone is to embed the expressed annotations
in supported output formats,
e.g. as XMP metadata in PDF output
[@PDFAssociation2020 chapter 14.3]
and as RDFa in html output
[@Herman2015].
Another further milestone is to make use of the added markup,
e.g. to annotate purpose of scholarly citations
as presented in @Daquino2023.
As mentioned above,
a draft specification has already been drafted
in @Smedegaard2022
for the syntax of embedding ontological annotations in Markdown.
The main challenge of this project is to implement that specification
as extensions for the existing Pandoc tool and Quarto framework,
and as part of that potentially also refine the draft specification.