aboutsummaryrefslogtreecommitdiff
path: root/spec.txt
diff options
context:
space:
mode:
Diffstat (limited to 'spec.txt')
-rw-r--r--spec.txt194
1 files changed, 135 insertions, 59 deletions
diff --git a/spec.txt b/spec.txt
index 82ae0b6..7b447f1 100644
--- a/spec.txt
+++ b/spec.txt
@@ -2,8 +2,8 @@
title: CommonMark Spec
author:
- John MacFarlane
-version: 1
-date: 2014-09-06
+version: 2
+date: 2014-09-19
...
# Introduction
@@ -1058,7 +1058,7 @@ a blank line either before or after.
The content of a code fence is treated as literal text, not parsed
as inlines. The first word of the info string is typically used to
specify the language of the code sample, and rendered in the `class`
-attribute of the `pre` tag. However, this spec does not mandate any
+attribute of the `code` tag. However, this spec does not mandate any
particular treatment of the info string.
Here is a simple example with backticks:
@@ -1682,7 +1682,7 @@ them.
[Foo bar]
.
-<p><a href="my url" title="title">Foo bar</a></p>
+<p><a href="my%20url" title="title">Foo bar</a></p>
.
The title may be omitted:
@@ -1745,7 +1745,7 @@ case-insensitive (see [matches](#matches)).
[αγω]
.
-<p><a href="/φου">αγω</a></p>
+<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p>
.
Here is a link reference definition with no corresponding link.
@@ -1994,11 +1994,11 @@ form of the definition is:
> transforming X in such-and-such a way is a container of type Y
> with these blocks as its content.
-So, we explain what counts as a block quote or list item by
-explaining how these can be *generated* from their contents.
-This should suffice to define the syntax, although it does not
-give a recipe for *parsing* these constructions. (A recipe is
-provided below in the section entitled [A parsing strategy].)
+So, we explain what counts as a block quote or list item by explaining
+how these can be *generated* from their contents. This should suffice
+to define the syntax, although it does not give a recipe for *parsing*
+these constructions. (A recipe is provided below in the section entitled
+[A parsing strategy](#appendix-a-a-parsing-strategy).)
## Block quotes
@@ -2010,9 +2010,9 @@ The following rules define [block quotes](#block-quote):
<a id="block-quote"></a>
1. **Basic case.** If a string of lines *Ls* constitute a sequence
- of blocks *Bs*, then the result of appending a [block quote marker]
- to the beginning of each line in *Ls* is a [block quote](#block-quote)
- containing *Bs*.
+ of blocks *Bs*, then the result of prepending a [block quote
+ marker](#block-quote-marker) to the beginning of each line in *Ls*
+ is a [block quote](#block-quote) containing *Bs*.
2. **Laziness.** If a string of lines *Ls* constitute a [block
quote](#block-quote) with contents *Bs*, then the result of deleting
@@ -3686,9 +3686,9 @@ raw HTML:
.
.
-<http://google.com?find=\*>
+<http://example.com?find=\*>
.
-<p><a href="http://google.com?find=\*">http://google.com?find=\*</a></p>
+<p><a href="http://example.com?find=%5C*">http://example.com?find=\*</a></p>
.
.
@@ -3727,47 +3727,65 @@ foo
## Entities
-Entities are parsed as entities, not as literal text, in all contexts
-except code spans and code blocks. Three kinds of entities are recognized.
+With the goal of making this standard as HTML-agnostic as possible, all
+valid HTML entities in any context are recognized as such and
+converted into unicode characters before they are stored in the AST.
+
+This allows implementations that target HTML output to trivially escape
+the entities when generating HTML, and simplifies the job of
+implementations targetting other languages, as these will only need to
+handle the unicode chars and need not be HTML-entity aware.
[Named entities](#name-entities) <a id="named-entities"></a> consist of `&`
-+ a string of 2-32 alphanumerics beginning with a letter + `;`.
++ any of the valid HTML5 entity names + `;`. The
+[following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json)
+is used as an authoritative source of the valid entity names and their
+corresponding codepoints.
+
+Conforming implementations that target HTML don't need to generate
+entities for all the valid named entities that exist, with the exception
+of `"` (`&quot;`), `&` (`&amp;`), `<` (`&lt;`) and `>` (`&gt;`), which
+always need to be written as entities for security reasons.
.
&nbsp; &amp; &copy; &AElig; &Dcaron; &frac34; &HilbertSpace; &DifferentialD; &ClockwiseContourIntegral;
.
-<p>&nbsp; &amp; &copy; &AElig; &Dcaron; &frac34; &HilbertSpace; &DifferentialD; &ClockwiseContourIntegral;</p>
+<p>  &amp; © Æ Ď ¾ ℋ ⅆ ∲</p>
.
[Decimal entities](#decimal-entities) <a id="decimal-entities"></a>
-consist of `&#` + a string of 1--8 arabic digits + `;`.
+consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these
+entities need to be recognised and tranformed into their corresponding
+UTF8 codepoints. Invalid Unicode codepoints will be written as the
+"unknown codepoint" character (`0xFFFD`)
.
-&#1; &#35; &#1234; &#992; &#98765432;
+&#35; &#1234; &#992; &#98765432;
.
-<p>&#1; &#35; &#1234; &#992; &#98765432;</p>
+<p># Ӓ Ϡ �</p>
.
[Hexadecimal entities](#hexadecimal-entities) <a id="hexadecimal-entities"></a>
consist of `&#` + either `X` or `x` + a string of 1-8 hexadecimal digits
-+ `;`.
++ `;`. They will also be parsed and turned into their corresponding UTF8 values in the AST.
.
-&#x1; &#X22; &#XD06; &#xcab;
+&#X22; &#XD06; &#xcab;
.
-<p>&#x1; &#X22; &#XD06; &#xcab;</p>
+<p>&quot; ആ ಫ</p>
.
Here are some nonentities:
.
-&nbsp &x; &#; &#x; &#123456789; &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?;
+&nbsp &x; &#; &#x; &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?;
.
-<p>&amp;nbsp &amp;x; &amp;#; &amp;#x; &amp;#123456789; &amp;ThisIsWayTooLongToBeAnEntityIsntIt; &amp;hi?;</p>
+<p>&amp;nbsp &amp;x; &amp;#; &amp;#x; &amp;ThisIsWayTooLongToBeAnEntityIsntIt; &amp;hi?;</p>
.
Although HTML5 does accept some entities without a trailing semicolon
-(such as `&copy`), these are not recognized as entities here:
+(such as `&copy`), these are not recognized as entities here, because it
+makes the grammar too ambiguous:
.
&copy
@@ -3775,13 +3793,13 @@ Although HTML5 does accept some entities without a trailing semicolon
<p>&amp;copy</p>
.
-On the other hand, many strings that are not on the list of HTML5
-named entities are recognized as entities here:
+Strings that are not on the list of HTML5 named entities are not
+recognized as entities either:
.
&MadeUpEntity;
.
-<p>&MadeUpEntity;</p>
+<p>&amp;MadeUpEntity;</p>
.
Entities are recognized in any context besides code spans or
@@ -3797,7 +3815,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and
.
[foo](/f&ouml;&ouml; "f&ouml;&ouml;")
.
-<p><a href="/f&ouml;&ouml;" title="f&ouml;&ouml;">foo</a></p>
+<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p>
.
.
@@ -3805,7 +3823,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and
[foo]: /f&ouml;&ouml; "f&ouml;&ouml;"
.
-<p><a href="/f&ouml;&ouml;" title="f&ouml;&ouml;">foo</a></p>
+<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p>
.
.
@@ -3813,7 +3831,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and
foo
```
.
-<pre><code class="language-f&ouml;&ouml;">foo
+<pre><code class="language-föö">foo
</code></pre>
.
@@ -3946,7 +3964,7 @@ But this is a link:
.
<http://foo.bar.`baz>`
.
-<p><a href="http://foo.bar.`baz">http://foo.bar.`baz</a>`</p>
+<p><a href="http://foo.bar.%60baz">http://foo.bar.`baz</a>`</p>
.
And this is an HTML tag:
@@ -4024,15 +4042,15 @@ for efficient parsing strategies that do not backtrack:
(a) it is not part of a sequence of four or more unescaped `*`s,
(b) it is not followed by whitespace, and
(c) either it is not followed by a `*` character or it is
- followed immediately by strong emphasis.
+ followed immediately by emphasis or strong emphasis.
2. A single `_` character [can open emphasis](#can-open-emphasis) iff
(a) it is not part of a sequence of four or more unescaped `_`s,
(b) it is not followed by whitespace,
- (c) is is not preceded by an ASCII alphanumeric character, and
+ (c) it is not preceded by an ASCII alphanumeric character, and
(d) either it is not followed by a `_` character or it is
- followed immediately by strong emphasis.
+ followed immediately by emphasis or strong emphasis.
3. A single `*` character [can close emphasis](#can-close-emphasis)
<a id="can-close-emphasis"></a> iff
@@ -4088,6 +4106,11 @@ for efficient parsing strategies that do not backtrack:
emphasis](#can-close-strong-emphasis), and that uses the
same character (`_` or `*`) as the opening delimiter, is reached.
+11. In case of ambiguity, strong emphasis takes precedence. Thus,
+ `**foo**` is `<strong>foo</strong>`, not `<em><em>foo</em></em>`,
+ and `***foo***` is `<strong><em>foo</em></strong>`, not
+ `<em><strong>foo</strong></em>` or `<em><em><em>foo</em></em></em>`.
+
These rules can be illustrated through a series of examples.
Simple emphasis:
@@ -4334,6 +4357,32 @@ __this is a double underscore (`__`)__
<p><strong>this is a double underscore (<code>__</code>)</strong></p>
.
+Or use the other emphasis character:
+
+.
+*_*
+.
+<p><em>_</em></p>
+.
+
+.
+_*_
+.
+<p><em>*</em></p>
+.
+
+.
+*__*
+.
+<p><em>__</em></p>
+.
+
+.
+_**_
+.
+<p><em>**</em></p>
+.
+
`*` delimiters allow intra-word emphasis; `_` delimiters do not:
.
@@ -4509,6 +4558,36 @@ __foo _bar_ baz__
<p><strong>foo <em>bar</em> baz</strong></p>
.
+.
+**foo, *bar*, baz**
+.
+<p><strong>foo, <em>bar</em>, baz</strong></p>
+.
+
+.
+__foo, _bar_, baz__
+.
+<p><strong>foo, <em>bar</em>, baz</strong></p>
+.
+
+But note:
+
+.
+*foo**bar**baz*
+.
+<p><em>foo</em><em>bar</em><em>baz</em></p>
+.
+
+.
+**foo*bar*baz**
+.
+<p><em><em>foo</em>bar</em>baz**</p>
+.
+
+The difference is that in the two preceding cases,
+the internal delimiters [can close emphasis](#can-close-emphasis),
+while in the cases with spaces, they cannot.
+
Note that you cannot nest emphasis directly inside emphasis
using the same delimeter, or strong emphasis directly inside
strong emphasis:
@@ -4590,7 +4669,7 @@ However, a string of four or more `****` can never close emphasis:
<p>*foo****</p>
.
-Note that there are some asymmetries here:
+We retain symmetry in these cases:
.
*foo**
@@ -4598,7 +4677,7 @@ Note that there are some asymmetries here:
**foo*
.
<p><em>foo</em>*</p>
-<p>**foo*</p>
+<p>*<em>foo</em></p>
.
.
@@ -4607,18 +4686,12 @@ Note that there are some asymmetries here:
**foo* bar*
.
<p><em>foo <em>bar</em></em></p>
-<p>**foo* bar*</p>
+<p><em><em>foo</em> bar</em></p>
.
More cases with mismatched delimiters:
.
-**foo* bar*
-.
-<p>**foo* bar*</p>
-.
-
-.
*bar***
.
<p><em>bar</em>**</p>
@@ -4627,7 +4700,7 @@ More cases with mismatched delimiters:
.
***foo*
.
-<p>***foo*</p>
+<p>**<em>foo</em></p>
.
.
@@ -4639,7 +4712,7 @@ More cases with mismatched delimiters:
.
***foo**
.
-<p>***foo**</p>
+<p>*<strong>foo</strong></p>
.
.
@@ -4755,7 +4828,7 @@ braces:
.
[link](</my uri>)
.
-<p><a href="/my uri">link</a></p>
+<p><a href="/my%20uri">link</a></p>
.
The destination cannot contain line breaks, even with pointy braces:
@@ -4806,12 +4879,15 @@ in Markdown:
<p><a href="foo):">link</a></p>
.
-URL-escaping and entities should be left alone inside the destination:
+URL-escaping should be left alone inside the destination, as all
+URL-escaped characters are also valid URL characters. HTML entities in
+the destination will be parsed into their UTF-8 codepoints, as usual, and
+optionally URL-escaped when written as HTML.
.
[link](foo%20b&auml;)
.
-<p><a href="foo%20b&auml;">link</a></p>
+<p><a href="foo%20b%C3%A4">link</a></p>
.
Note that, because titles can often be parsed as destinations,
@@ -4821,7 +4897,7 @@ get unexpected results:
.
[link]("title")
.
-<p><a href="&quot;title&quot;">link</a></p>
+<p><a href="%22title%22">link</a></p>
.
Titles may be in single quotes, double quotes, or parentheses:
@@ -5491,9 +5567,9 @@ spec](http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#e-m
Examples of email autolinks:
.
-<foo@bar.baz.com>
+<foo@bar.example.com>
.
-<p><a href="mailto:foo@bar.baz.com">foo@bar.baz.com</a></p>
+<p><a href="mailto:foo@bar.example.com">foo@bar.example.com</a></p>
.
.
@@ -5535,15 +5611,15 @@ These are not autolinks:
.
.
-http://google.com
+http://example.com
.
-<p>http://google.com</p>
+<p>http://example.com</p>
.
.
-foo@bar.baz.com
+foo@bar.example.com
.
-<p>foo@bar.baz.com</p>
+<p>foo@bar.example.com</p>
.
## Raw HTML