diff options
Diffstat (limited to 'spec.txt')
-rw-r--r-- | spec.txt | 434 |
1 files changed, 364 insertions, 70 deletions
@@ -2,8 +2,8 @@ title: CommonMark Spec author: - John MacFarlane -version: 2 -date: 2014-09-19 +version: 0.5 +date: 2014-10-25 ... # Introduction @@ -192,10 +192,10 @@ In the examples, the `→` character is used to represent tabs. # Preprocessing A [line](#line) <a id="line"></a> -is a sequence of zero or more characters followed by a line -ending (CR, LF, or CRLF) or by the end of -file. +is a sequence of zero or more [characters](#character) followed by a +line ending (CR, LF, or CRLF) or by the end of file. +A [character](#character)<a id="character"></a> is a unicode code point. This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding. @@ -377,16 +377,18 @@ Spaces are allowed at the end: <hr /> . -However, no other characters may occur at the end or the -beginning: +However, no other characters may occur in the line: . _ _ _ _ a a------ + +---a--- . <p>_ _ _ _ a</p> <p>a------</p> +<p>---a---</p> . It is required that all of the non-space characters be the same. @@ -426,8 +428,11 @@ bar <p>bar</p> . -Note, however, that this is a setext header, not a paragraph followed -by a horizontal rule: +If a line of dashes that meets the above conditions for being a +horizontal rule could also be interpreted as the underline of a [setext +header](#setext-header), the interpretation as a +[setext-header](#setext-header) takes precedence. Thus, for example, +this is a setext header, not a paragraph followed by a horizontal rule: . Foo @@ -662,7 +667,10 @@ ATX headers can be empty: A [setext header](#setext-header) <a id="setext-header"></a> consists of a line of text, containing at least one nonspace character, with no more than 3 spaces indentation, followed by a [setext header -underline](#setext-header-underline). A [setext header +underline](#setext-header-underline). The line of text must be +one that, were it not followed by the setext header underline, +would be interpreted as part of a paragraph: it cannot be a code +block, header, blockquote, horizontal rule, or list. A [setext header underline](#setext-header-underline) <a id="setext-header-underline"></a> is a sequence of `=` characters or a sequence of `-` characters, with no more than 3 spaces indentation and any number of trailing @@ -807,7 +815,8 @@ of dashes"/> <p>of dashes"/></p> . -The setext header underline cannot be a lazy line: +The setext header underline cannot be a [lazy continuation +line](#lazy-continuation-line) in a list item or block quote: . > Foo @@ -819,6 +828,16 @@ The setext header underline cannot be a lazy line: <hr /> . +. +- Foo +--- +. +<ul> +<li>Foo</li> +</ul> +<hr /> +. + A setext header cannot interrupt a paragraph: . @@ -863,6 +882,56 @@ Setext headers cannot be empty: <p>====</p> . +Setext header text lines must not be interpretable as block +constructs other than paragraphs. So, the line of dashes +in these examples gets interpreted as a horizontal rule: + +. +--- +--- +. +<hr /> +<hr /> +. + +. +- foo +----- +. +<ul> +<li>foo</li> +</ul> +<hr /> +. + +. + foo +--- +. +<pre><code>foo +</code></pre> +<hr /> +. + +. +> foo +----- +. +<blockquote> +<p>foo</p> +</blockquote> +<hr /> +. + +If you want a header with `> foo` as its literal text, you can +use backslash escapes: + +. +\> foo +------ +. +<h2>> foo</h2> +. ## Indented code blocks @@ -1355,8 +1424,8 @@ name is one of the following (case-insensitive): `output`, `col`, `p`, `colgroup`, `pre`, `dd`, `progress`, `div`, `section`, `dl`, `table`, `td`, `dt`, `tbody`, `embed`, `textarea`, `fieldset`, `tfoot`, `figcaption`, `th`, `figure`, `thead`, `footer`, -`footer`, `tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, -`video`, `script`, `style`. +`tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `video`, +`script`, `style`. An [HTML block](#html-block) <a id="html-block"></a> begins with an [HTML block tag](#html-block-tag), [HTML comment](#html-comment), @@ -1401,7 +1470,7 @@ okay. <foo><a> . -Here we have two code blocks with a Markdown paragraph between them: +Here we have two HTML blocks with a Markdown paragraph between them: . <DIV CLASS="foo"> @@ -1447,11 +1516,11 @@ A processing instruction: . <?php - echo 'foo' + echo '>'; ?> . <?php - echo 'foo' + echo '>'; ?> . @@ -1946,8 +2015,8 @@ bbb . Final spaces are stripped before inline parsing, so a paragraph -that ends with two or more spaces will not end with a hard line -break: +that ends with two or more spaces will not end with a [hard line +break](#hard-line-break): . aaa @@ -2375,7 +2444,8 @@ An [ordered list marker](#ordered-list-marker) <a id="ordered-list-marker"></a> is a sequence of one of more digits (`0-9`), followed by either a `.` character or a `)` character. -The following rules define [list items](#list-item): +The following rules define [list items](#list-item):<a +id="list-item"></a> 1. **Basic case.** If a sequence of lines *Ls* constitute a sequence of blocks *Bs* starting with a non-space character and not separated @@ -2826,9 +2896,11 @@ Four spaces indent gives a code block: some or all of the indentation from one or more lines in which the next non-space character after the indentation is [paragraph continuation text](#paragraph-continuation-text) is a - list item with the same contents and attributes. + list item with the same contents and attributes.<a + id="lazy-continuation-line"></a> -Here is an example with lazy continuation lines: +Here is an example with [lazy continuation +lines](#lazy-continuation-line): . 1. A paragraph @@ -3005,6 +3077,21 @@ A list item may be empty: </ul> . +A list item can contain a header: + +. +- # Foo +- Bar + --- + baz +. +<ul> +<li><h1>Foo</h1></li> +<li><h2>Bar</h2> +<p>baz</p></li> +</ul> +. + ### Motivation John Gruber's Markdown spec says the following about list items: @@ -3210,12 +3297,12 @@ of an [ordered list](#ordered-list) is determined by the list number of its initial list item. The numbers of subsequent list items are disregarded. -A list is [loose](#loose) if it any of its constituent list items are -separated by blank lines, or if any of its constituent list items -directly contain two block-level elements with a blank line between -them. Otherwise a list is [tight](#tight). (The difference in HTML output -is that paragraphs in a loose with are wrapped in `<p>` tags, while -paragraphs in a tight list are not.) +A list is [loose](#loose)<a id="loose"></a> if it any of its constituent +list items are separated by blank lines, or if any of its constituent +list items directly contain two block-level elements with a blank line +between them. Otherwise a list is [tight](#tight).<a id="tight"></a> +(The difference in HTML output is that paragraphs in a loose list are +wrapped in `<p>` tags, while paragraphs in a tight list are not.) Changing the bullet or ordered list delimiter starts a new list: @@ -3247,6 +3334,87 @@ Changing the bullet or ordered list delimiter starts a new list: </ol> . +In CommonMark, a list can interrupt a paragraph. That is, +no blank line is needed to separate a paragraph from a following +list: + +. +Foo +- bar +- baz +. +<p>Foo</p> +<ul> +<li>bar</li> +<li>baz</li> +</ul> +. + +`Markdown.pl` does not allow this, through fear of triggering a list +via a numeral in a hard-wrapped line: + +. +The number of windows in my house is +14. The number of doors is 6. +. +<p>The number of windows in my house is</p> +<ol start="14"> +<li>The number of doors is 6.</li> +</ol> +. + +Oddly, `Markdown.pl` *does* allow a blockquote to interrupt a paragraph, +even though the same considerations might apply. We think that the two +cases should be treated the same. Here are two reasons for allowing +lists to interrupt paragraphs: + +First, it is natural and not uncommon for people to start lists without +blank lines: + + I need to buy + - new shoes + - a coat + - a plane ticket + +Second, we are attracted to a + +> [principle of uniformity](#principle-of-uniformity):<a +> id="principle-of-uniformity"></a> if a span of text has a certain +> meaning, it will continue to have the same meaning when put into a list +> item. + +(Indeed, the spec for [list items](#list-item) presupposes this.) +This principle implies that if + + * I need to buy + - new shoes + - a coat + - a plane ticket + +is a list item containing a paragraph followed by a nested sublist, +as all Markdown implementations agree it is (though the paragraph +may be rendered without `<p>` tags, since the list is "tight"), +then + + I need to buy + - new shoes + - a coat + - a plane ticket + +by itself should be a paragraph followed by a nested sublist. + +Our adherence to the [principle of uniformity](#principle-of-uniformity) +thus inclines us to think that there are two coherent packages: + +1. Require blank lines before *all* lists and blockquotes, + including lists that occur as sublists inside other list items. + +2. Require blank lines in none of these places. + +[reStructuredText](http://docutils.sourceforge.net/rst.html) takes +the first approach, for which there is much to be said. But the second +seems more consistent with established practice with Markdown. + There can be blank lines between items, but two blank lines end a list: @@ -3463,8 +3631,8 @@ This is a tight list, because the blank lines are in a code block: . This is a tight list, because the blank line is between two -paragraphs of a sublist. So the inner list is loose while -the other list is tight: +paragraphs of a sublist. So the sublist is loose while +the outer list is tight: . - a @@ -3650,7 +3818,8 @@ If a backslash is itself escaped, the following character is not: <p>\<em>emphasis</em></p> . -A backslash at the end of the line is a hard line break: +A backslash at the end of the line is a [hard line +break](#hard-line-break): . foo\ @@ -3727,21 +3896,25 @@ foo ## Entities -With the goal of making this standard as HTML-agnostic as possible, all HTML valid HTML Entities in any -context are recognized as such and converted into their actual values (i.e. the UTF8 characters representing -the entity itself) before they are stored in the AST. +With the goal of making this standard as HTML-agnostic as possible, all +valid HTML entities in any context are recognized as such and +converted into unicode characters before they are stored in the AST. -This allows implementations that target HTML output to trivially escape the entities when generating HTML, -and simplifies the job of implementations targetting other languages, as these will only need to handle the -UTF8 chars and need not be HTML-entity aware. +This allows implementations that target HTML output to trivially escape +the entities when generating HTML, and simplifies the job of +implementations targetting other languages, as these will only need to +handle the unicode chars and need not be HTML-entity aware. [Named entities](#name-entities) <a id="named-entities"></a> consist of `&` -+ any of the valid HTML5 entity names + `;`. The [following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json) -is used as an authoritative source of the valid entity names and their corresponding codepoints. ++ any of the valid HTML5 entity names + `;`. The +[following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json) +is used as an authoritative source of the valid entity names and their +corresponding codepoints. -Conforming implementations that target Markdown don't need to generate entities for all the valid -named entities that exist, with the exception of `"` (`"`), `&` (`&`), `<` (`<`) and `>` (`>`), -which always need to be written as entities for security reasons. +Conforming implementations that target HTML don't need to generate +entities for all the valid named entities that exist, with the exception +of `"` (`"`), `&` (`&`), `<` (`<`) and `>` (`>`), which +always need to be written as entities for security reasons. . & © Æ Ď ¾ ℋ ⅆ ∲ @@ -3750,9 +3923,10 @@ which always need to be written as entities for security reasons. . [Decimal entities](#decimal-entities) <a id="decimal-entities"></a> -consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these entities need to be recognised -and tranformed into their corresponding UTF8 codepoints. Invalid Unicode codepoints will be written -as the "unknown codepoint" character (`0xFFFD`) +consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these +entities need to be recognised and tranformed into their corresponding +UTF8 codepoints. Invalid Unicode codepoints will be written as the +"unknown codepoint" character (`0xFFFD`) . # Ӓ Ϡ � @@ -3779,7 +3953,8 @@ Here are some nonentities: . Although HTML5 does accept some entities without a trailing semicolon -(such as `©`), these are not recognized as entities here, because it makes the grammar too ambiguous: +(such as `©`), these are not recognized as entities here, because it +makes the grammar too ambiguous: . © @@ -3787,7 +3962,8 @@ Although HTML5 does accept some entities without a trailing semicolon <p>&copy</p> . -Strings that are not on the list of HTML5 named entities are not recognized as entities either: +Strings that are not on the list of HTML5 named entities are not +recognized as entities either: . &MadeUpEntity; @@ -4035,7 +4211,7 @@ for efficient parsing strategies that do not backtrack: (a) it is not part of a sequence of four or more unescaped `*`s, (b) it is not followed by whitespace, and (c) either it is not followed by a `*` character or it is - followed immediately by strong emphasis. + followed immediately by emphasis or strong emphasis. 2. A single `_` character [can open emphasis](#can-open-emphasis) iff @@ -4043,7 +4219,7 @@ for efficient parsing strategies that do not backtrack: (b) it is not followed by whitespace, (c) it is not preceded by an ASCII alphanumeric character, and (d) either it is not followed by a `_` character or it is - followed immediately by strong emphasis. + followed immediately by emphasis or strong emphasis. 3. A single `*` character [can close emphasis](#can-close-emphasis) <a id="can-close-emphasis"></a> iff @@ -4088,16 +4264,42 @@ for efficient parsing strategies that do not backtrack: (c) it is not followed by an ASCII alphanumeric character. 9. Emphasis begins with a delimiter that [can open - emphasis](#can-open-emphasis) and includes inlines parsed - sequentially until a delimiter that [can close + emphasis](#can-open-emphasis) and ends with a delimiter that [can close emphasis](#can-close-emphasis), and that uses the same - character (`_` or `*`) as the opening delimiter, is reached. + character (`_` or `*`) as the opening delimiter. The inlines + between the open delimiter and the closing delimiter are the + contents of the emphasis inline. 10. Strong emphasis begins with a delimiter that [can open strong - emphasis](#can-open-strong-emphasis) and includes inlines parsed - sequentially until a delimiter that [can close strong - emphasis](#can-close-strong-emphasis), and that uses the - same character (`_` or `*`) as the opening delimiter, is reached. + emphasis](#can-open-strong-emphasis) and ends with a delimiter that + [can close strong emphasis](#can-close-strong-emphasis), and that uses the + same character (`_` or `*`) as the opening delimiter. The inlines + between the open delimiter and the closing delimiter are the + contents of the strong emphasis inline. + +Where rules 1--10 above are compatible with multiple parsings, +the following principles resolve ambiguity: + +11. An interpretation `<strong>...</strong>` is always preferred to + `<em><em>...</em></em>`. + +12. An interpretation `<strong><em>...</em></strong>` is always + preferred to `<em><strong>..</strong></em>`. + +13. Earlier closings are preferred to later closings. Thus, + when two potential emphasis or strong emphasis spans overlap, + the first takes precedence: for example, `*foo _bar* baz_` + is parsed as `<em>foo _bar</em> baz_` rather than + `*foo <em>bar* baz</em>`. For the same reason, + `**foo*bar**` is parsed as `<em><em>foo</em>bar</em>*` + rather than `<strong>foo*bar</strong>`. + +14. Inline code spans, links, images, and HTML tags group more tightly + than emphasis. So, when there is a choice between an interpretation + that contains one of these elements and one that does not, the + former always wins. Thus, for example, `*[foo*](bar)` is + parsed as `*<a href="bar">foo*</a>` rather than as + `<em>[foo</em>](bar)`. These rules can be illustrated through a series of examples. @@ -4345,6 +4547,32 @@ __this is a double underscore (`__`)__ <p><strong>this is a double underscore (<code>__</code>)</strong></p> . +Or use the other emphasis character: + +. +*_* +. +<p><em>_</em></p> +. + +. +_*_ +. +<p><em>*</em></p> +. + +. +*__* +. +<p><em>__</em></p> +. + +. +_**_ +. +<p><em>**</em></p> +. + `*` delimiters allow intra-word emphasis; `_` delimiters do not: . @@ -4520,6 +4748,36 @@ __foo _bar_ baz__ <p><strong>foo <em>bar</em> baz</strong></p> . +. +**foo, *bar*, baz** +. +<p><strong>foo, <em>bar</em>, baz</strong></p> +. + +. +__foo, _bar_, baz__ +. +<p><strong>foo, <em>bar</em>, baz</strong></p> +. + +But note: + +. +*foo**bar**baz* +. +<p><em>foo</em><em>bar</em><em>baz</em></p> +. + +. +**foo*bar*baz** +. +<p><em><em>foo</em>bar</em>baz**</p> +. + +The difference is that in the two preceding cases, +the internal delimiters [can close emphasis](#can-close-emphasis), +while in the cases with spaces, they cannot. + Note that you cannot nest emphasis directly inside emphasis using the same delimeter, or strong emphasis directly inside strong emphasis: @@ -4601,7 +4859,7 @@ However, a string of four or more `****` can never close emphasis: <p>*foo****</p> . -Note that there are some asymmetries here: +We retain symmetry in these cases: . *foo** @@ -4609,7 +4867,7 @@ Note that there are some asymmetries here: **foo* . <p><em>foo</em>*</p> -<p>**foo*</p> +<p>*<em>foo</em></p> . . @@ -4618,18 +4876,12 @@ Note that there are some asymmetries here: **foo* bar* . <p><em>foo <em>bar</em></em></p> -<p>**foo* bar*</p> +<p><em><em>foo</em> bar</em></p> . More cases with mismatched delimiters: . -**foo* bar* -. -<p>**foo* bar*</p> -. - -. *bar*** . <p><em>bar</em>**</p> @@ -4638,7 +4890,7 @@ More cases with mismatched delimiters: . ***foo* . -<p>***foo*</p> +<p>**<em>foo</em></p> . . @@ -4650,7 +4902,7 @@ More cases with mismatched delimiters: . ***foo** . -<p>***foo**</p> +<p>*<strong>foo</strong></p> . . @@ -4659,6 +4911,46 @@ More cases with mismatched delimiters: <p>***foo <em>bar</em></p> . +The following cases illustrate rule 13: + +. +*foo _bar* baz_ +. +<p><em>foo _bar</em> baz_</p> +. + +. +**foo bar* baz** +. +<p><em><em>foo bar</em> baz</em>*</p> +. + +The following cases illustrate rule 14: + +. +*[foo*](bar) +. +<p>*<a href="bar">foo*</a></p> +. + +. +* +. +<p>*<img src="bar" alt="foo*" /></p> +. + +. +*<img src="foo" title="*"/> +. +<p>*<img src="foo" title="*"/></p> +. + +. +*a`a*` +. +<p>*a<code>a*</code></p> +. + ## Links A link contains a [link label](#link-label) (the visible text), @@ -4817,9 +5109,10 @@ in Markdown: <p><a href="foo):">link</a></p> . -URL-escaping and should be left alone inside the destination, as all URL-escaped characters -are also valid URL characters. HTML entities in the destination will be parsed into their UTF8 -codepoints, as usual, and optionally URL-escaped when written as HTML. +URL-escaping should be left alone inside the destination, as all +URL-escaped characters are also valid URL characters. HTML entities in +the destination will be parsed into their UTF-8 codepoints, as usual, and +optionally URL-escaped when written as HTML. . [link](foo%20bä) @@ -5796,7 +6089,8 @@ Backslash escapes do not work in HTML attributes: ## Hard line breaks A line break (not in a code span or HTML tag) that is preceded -by two or more spaces is parsed as a linebreak (rendered +by two or more spaces is parsed as a [hard line +break](#hard-line-break)<a id="hard-line-break"></a> (rendered in HTML as a `<br />` tag): . |