aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJohn MacFarlane <jgm@berkeley.edu>2014-12-23 17:24:14 -0700
committerJohn MacFarlane <jgm@berkeley.edu>2014-12-23 17:27:39 -0700
commit8b44dab7b3465445ac4137dc7893665f2336024b (patch)
treefba43806eeed0e3f2f13af5bbe1b001d2fcba5c6
parentc25ca523790f1e7ed222bd7de2245be1c60bf441 (diff)
Added definitions of whitespace and other character classes.
Closes #108.
-rw-r--r--spec.txt162
1 files changed, 100 insertions, 62 deletions
diff --git a/spec.txt b/spec.txt
index 3217e6c..bb7e620 100644
--- a/spec.txt
+++ b/spec.txt
@@ -189,17 +189,61 @@ Markdown, which can then be converted into other formats.
In the examples, the `→` character is used to represent tabs.
-# Preprocessing
+# Preliminaries
+
+## Characters and lines
+
+The input is a sequence of zero or more [lines](#line).
A [line](@line)
is a sequence of zero or more [characters](#character) followed by a
-line ending (CR, LF, or CRLF) or by the end of file.
+[line ending](#line-ending) or by the end of file.
A [character](@character) is a unicode code point.
This spec does not specify an encoding; it thinks of lines as composed
of characters rather than bytes. A conforming parser may be limited
to a certain encoding.
+A [line ending](@line-ending) is, depending on the platform, a
+newline (`U+000A`), carriage return (`U+000D`), or
+carriage return + newline.
+
+For security reasons, a conforming parser must strip or replace the
+Unicode character `U+0000`.
+
+A line containing no characters, or a line containing only spaces
+(`U+0020`) or tabs (`U+0009`), is called a [blank line](@blank-line).
+
+The following definitions of character classes will be used in this spec:
+
+A [whitespace character](@whitespace-character) is a space
+(`U+0020`), tab (`U+0009`), carriage return (`U+000D`), or
+newline (`U+000A`).
+
+[Whitespace](@whitespace) is a sequence of one or more [whitespace
+characters](#whitespace-character).
+
+A [unicode whitespace character](@unicode-whitespace-character) is
+any code point in the unicode `Zs` class, or a tab (`U+0009`),
+carriage return (`U+000D`), newline (`U+000A`), or form feed
+(`U+000C`).
+
+[Unicode whitespace](@unicode-whitespace) is a sequence of one
+or more [unicode whitespace characters](#unicode-whitespace-character).
+
+A [non-space character](@non-space-character) is anything but `U+0020`.
+
+A [punctuation character](@punctuation-character) is anything in
+the unicode classes `Pc`, `Pd`, `Pe`,` `Pf`, `Pi`, `Po`, or `Ps`.
+
+An [ASCII punctuation character](@ascii-punctuation-character)
+is a [punctuation character](#punctuation-character) in the
+ASCII class: that is, `!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`,
+`*`, `+`, `,`, `-`, `.`, `/`, `:`, `;`, `<`, `=`, `>`, `?`, `@`,
+`[`, `\`, `]`, `^`, `_`, `` ` ``, `{`, `|`, `}`, or `~`.
+
+## Tab expansion
+
Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
.
@@ -218,14 +262,6 @@ Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
</code></pre>
.
-Line endings are replaced by newline characters (LF).
-
-A line containing no characters, or a line containing only spaces (after
-tab expansion), is called a [blank line](@blank-line).
-
-For security reasons, a conforming parser must strip or replace the
-Unicode character `U+0000`.
-
# Blocks and inlines
We can think of a document as a sequence of
@@ -394,7 +430,8 @@ a------
<p>---a---</p>
.
-It is required that all of the non-space characters be the same.
+It is required that all of the
+[non-space characters](#non-space-character) be the same.
So, this is not a horizontal rule:
.
@@ -952,9 +989,9 @@ An [indented code block](@indented-code-block) is composed of one or more
[indented chunks](#indented-chunk) separated by blank lines.
An [indented chunk](@indented-chunk) is a sequence of non-blank lines,
each indented four or more spaces. The contents of the code block are
-the literal contents of the lines, including trailing newlines,
-minus four spaces of indentation. An indented code block has no
-attributes.
+the literal contents of the lines, including trailing
+[line endings](#line-ending), minus four spaces of indentation.
+An indented code block has no attributes.
An indented code block cannot interrupt a paragraph, so there must be
a blank line between a paragraph and a following indented code block.
@@ -1750,14 +1787,14 @@ So there is no important loss of expressive power with the new rule.
## Link reference definitions
A [link reference definition](@link-reference-definition)
-consists of a [link
-label](#link-label), indented up to three spaces, followed
-by a colon (`:`), optional blank space (including up to one
-newline), a [link destination](#link-destination), optional
-blank space (including up to one newline), and an optional [link
+consists of a [link label](#link-label), indented up to three spaces, followed
+by a colon (`:`), optional [whitespace](#whitespace) (including up to one
+[line ending](#line-ending)), a [link destination](#link-destination),
+optional [whitespace](#whitespace) (including up to one
+[line ending](#line-ending)), and an optional [link
title](#link-title), which if it is present must be separated
-from the [link destination](#link-destination) by whitespace.
-No further non-space characters may occur on the line.
+from the [link destination](#link-destination) by [whitespace](#whitespace).
+No further [non-space characters](#non-space-character) may occur on the line.
A [link reference-definition](#link-reference-definition)
does not correspond to a structural element of a document. Instead, it
@@ -1874,7 +1911,7 @@ It contributes nothing to the document.
.
This is not a link reference definition, because there are
-non-space characters after the title:
+[non-space characters](#non-space-character) after the title:
.
[foo]: /url "title" ok
@@ -2133,7 +2170,8 @@ The following rules define [block quotes](@block-quote):
2. **Laziness.** If a string of lines *Ls* constitute a [block
quote](#block-quote) with contents *Bs*, then the result of deleting
the initial [block quote marker](#block-quote-marker) from one or
- more lines in which the next non-space character after the [block
+ more lines in which the next
+ [non-space character](#non-space-character) after the [block
quote marker](#block-quote-marker) is [paragraph continuation
text](#paragraph-continuation-text) is a block quote with *Bs* as
its content.
@@ -2494,7 +2532,8 @@ is a sequence of one of more digits (`0-9`), followed by either a
The following rules define [list items](@list-item):
1. **Basic case.** If a sequence of lines *Ls* constitute a sequence of
- blocks *Bs* starting with a non-space character and not separated
+ blocks *Bs* starting with a [non-space character](#non-space-character)
+ and not separated
from each other by more than one blank line, and *M* is a list
marker *M* of width *W* followed by 0 < *N* < 5 spaces, then the result
of prepending *M* and the following spaces to the first line of
@@ -2972,7 +3011,7 @@ Four spaces indent gives a code block:
4. **Laziness.** If a string of lines *Ls* constitute a [list
item](#list-item) with contents *Bs*, then the result of deleting
some or all of the indentation from one or more lines in which the
- next non-space character after the indentation is
+ next [non-space character](#non-space-character) after the indentation is
[paragraph continuation text](#paragraph-continuation-text) is a
list item with the same contents and attributes. The unindented
lines are called
@@ -4174,11 +4213,11 @@ A [backtick string](@backtick-string)
is a string of one or more backtick characters (`` ` ``) that is neither
preceded nor followed by a backtick.
-A [code span](@code-span) begins with a backtick string and ends with a backtick
-string of equal length. The contents of the code span are the
-characters between the two backtick strings, with leading and trailing
-spaces and newlines removed, and consecutive spaces and newlines
-collapsed to single spaces.
+A [code span](@code-span) begins with a backtick string and ends with
+a backtick string of equal length. The contents of the code span are
+the characters between the two backtick strings, with leading and
+trailing spaces and [line endings](#line-ending) removed, and
+[whitespace](#whitespace) collapsed to single spaces.
This is a simple code span:
@@ -4206,7 +4245,7 @@ spaces:
<p><code>``</code></p>
.
-Newlines are treated like spaces:
+[Line endings](#line-ending) are treated like spaces:
.
``
@@ -4216,8 +4255,8 @@ foo
<p><code>foo</code></p>
.
-Interior spaces and newlines are collapsed into single spaces, just
-as they would be by a browser:
+Interior spaces and [line endings](#line-ending) are collapsed into
+single spaces, just as they would be by a browser:
.
`foo bar
@@ -4231,13 +4270,13 @@ anyway? A: Because we might be targeting a non-HTML format, and we
shouldn't rely on HTML-specific rendering assumptions.
(Existing implementations differ in their treatment of internal
-spaces and newlines. Some, including `Markdown.pl` and
-`showdown`, convert an internal newline into a `<br />` tag.
-But this makes things difficult for those who like to hard-wrap
-their paragraphs, since a line break in the midst of a code
-span will cause an unintended line break in the output. Others
-just leave internal spaces as they are, which is fine if only
-HTML is being targeted.)
+spaces and [line endings](#line-ending). Some, including `Markdown.pl` and
+`showdown`, convert an internal [line ending](#line-ending) into a
+`<br />` tag. But this makes things difficult for those who like to
+hard-wrap their paragraphs, since a line break in the midst of a code
+span will cause an unintended line break in the output. Others just
+leave internal spaces as they are, which is fine if only HTML is being
+targeted.)
.
`foo `` bar`
@@ -4355,34 +4394,32 @@ The following rules capture all of these patterns, while allowing
for efficient parsing strategies that do not backtrack:
1. A single `*` character [can open emphasis](@can-open-emphasis)
- iff it is not followed by whitespace. (For these purposes,
- any unicode space character counts as whitespace.)
+ iff it is not followed by [unicode whitespace](#unicode-whitespace).
2. A single `_` character [can open emphasis](#can-open-emphasis) iff
- it is not followed by whitespace and it is not preceded by an
- ASCII alphanumeric character.
+ it is not followed by [unicode whitespace](#unicode-whitespace)
+ and it is not preceded by an ASCII alphanumeric character.
3. A single `*` character [can close emphasis](@can-close-emphasis)
- iff it is not preceded by whitespace.
+ iff it is not preceded by [unicode whitespace](#unicode-whitespace).
4. A single `_` character [can close emphasis](#can-close-emphasis) iff
- it is not preceded by whitespace and it is not followed by an
- ASCII alphanumeric character.
+ it is not preceded by [unicode whitespace](#unicode-whitespace)
+ and it is not followed by an ASCII alphanumeric character.
5. A double `**` [can open strong emphasis](@can-open-strong-emphasis)
- iff it is not followed by
- whitespace.
+ iff it is not followed by [unicode whitespace](#unicode-whitespace).
6. A double `__` [can open strong emphasis](#can-open-strong-emphasis)
- iff it is not followed by whitespace and it is not preceded by an
- ASCII alphanumeric character.
+ iff it is not followed by [unicode whitespace](#unicode-whitespace)
+ and it is not preceded by an ASCII alphanumeric character.
7. A double `**` [can close strong emphasis](@can-close-strong-emphasis)
- iff it is not preceded by whitespace.
+ iff it is not preceded by [unicode whitespace](#unicode-whitespace).
8. A double `__` [can close strong emphasis](#can-close-strong-emphasis)
- iff it is not preceded by whitespace and it is not followed by an
- ASCII alphanumeric character.
+ iff it is not preceded by [unicode whitespace](#unicode-whitespace)
+ and it is not followed by an ASCII alphanumeric character.
9. Emphasis begins with a delimiter that [can open
emphasis](#can-open-emphasis) and ends with a delimiter that [can close
@@ -6610,8 +6647,8 @@ baz
baz</p>
.
-For a more visible alternative, a backslash before the newline may be
-used instead of two spaces:
+For a more visible alternative, a backslash before the
+[line ending](#line-ending) may be used instead of two spaces:
.
foo\
@@ -6734,9 +6771,10 @@ foo
A regular line break (not in a code span or HTML tag) that is not
preceded by two or more spaces is parsed as a softbreak. (A
-softbreak may be rendered in HTML either as a newline or as a space.
-The result will be the same in browsers. In the examples here, a
-newline will be used.)
+softbreak may be rendered in HTML either as a
+[line ending](#line-ending) or as a space. The result will be the same
+in browsers. In the examples here, a [line ending](#line-ending) will
+be used.)
.
foo
@@ -6971,9 +7009,9 @@ document
str "aliquando id"
```
-Notice how the newline in the first paragraph has been parsed as
-a `softbreak`, and the asterisks in the first list item have become
-an `emph`.
+Notice how the [line ending](#line-ending) in the first paragraph has
+been parsed as a `softbreak`, and the asterisks in the first list item
+have become an `emph`.
The document can be rendered as HTML, or in any other format, given
an appropriate renderer.