From 646a9823671b0e4011d8e40d03d99194def9ccc3 Mon Sep 17 00:00:00 2001 From: John MacFarlane Date: Fri, 10 Jul 2015 14:26:34 -0700 Subject: New spec for HTML blocks. --- spec.txt | 511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 441 insertions(+), 70 deletions(-) diff --git a/spec.txt b/spec.txt index 9989d21..ed9b8e2 100644 --- a/spec.txt +++ b/spec.txt @@ -210,8 +210,7 @@ of characters rather than bytes. A conforming parser may be limited to a certain encoding. A [line](@line) is a sequence of zero or more [character]s -following a [line ending] (or the beginning of the file) and -followed by a [line ending] (or the end of the file). +followed by a [line ending] or by the end of file. A [line ending](@line-ending) is a newline (`U+000A`), carriage return (`U+000D`), or carriage return + newline. @@ -229,7 +228,9 @@ form feed (`U+000C`), or carriage return (`U+000D`). character]s. A [unicode whitespace character](@unicode-whitespace-character) is -any [whitespace character] or code point in the unicode `Zs` class. +any code point in the unicode `Zs` class, or a tab (`U+0009`), +carriage return (`U+000D`), newline (`U+000A`), or form feed +(`U+000C`). [Unicode whitespace](@unicode-whitespace) is a sequence of one or more [unicode whitespace character]s. @@ -1592,27 +1593,65 @@ Closing code fences cannot have [info string]s: ## HTML blocks -An [HTML block tag](@html-block-tag) is -an [open tag] or [closing tag] whose tag -name is one of the following (case-insensitive): -`article`, `header`, `aside`, `hgroup`, `blockquote`, `hr`, `iframe`, -`body`, `li`, `map`, `button`, `object`, `canvas`, `ol`, `caption`, -`output`, `col`, `p`, `colgroup`, `pre`, `dd`, `progress`, `div`, -`section`, `dl`, `table`, `td`, `dt`, `tbody`, `embed`, `textarea`, -`fieldset`, `tfoot`, `figcaption`, `th`, `figure`, `thead`, `footer`, -`tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `video`, -`script`, `style`. - -An [HTML block](@html-block) begins with an -[HTML block tag], [HTML comment], [processing instruction], -[declaration], or [CDATA section]. -It ends when a [blank line] or the end of the -input is encountered. The initial line may be indented up to three -spaces, and subsequent lines may have any indentation. The contents -of the HTML block are interpreted as raw HTML, and will not be escaped -in HTML output. - -Some simple examples: +An [HTML block](@html-block) is a group of lines that is treated +as raw HTML (and will not be escaped in HTML output). + +There are seven kinds of [HTML block], which can be defined +by their start and end conditions. The block begins with a line that +meets a [start condition](@start-condition) (after up to three spaces +optional indentation). It ends with the first subsequent line that +meets a matching [end condition](@end-condition), or the last line of +the document, if no line is encountered that meets the +[end condition]. If the first line meets both the [start condition] +and the [end condition], the block will contain just that line. + +1. **Start condition:** line begins with the string ``, or the end of the line.\ +**End condition:** line contains an end tag +``, ``, or `` (case-insensitive; it +need not match the start tag). + +2. **Start condition:** line begins with the string ``. + +3. **Start condition:** line begins with the string ``. + +4. **Start condition:** line begins with the string ``. + +5. **Start condition:** line begins with the string +``. + +6. **Start condition:** line begins the string `<` or ``, or +the string `/>`.\ +**End condition:** line is followed by a [blank line]. + +7. **Start condition:** line begins with an [open tag] +(with any [tag name]) followed only by [whitespace] or the end +of the line.\ +**End condition:** line is followed by a [blank line]. + +All types of [HTML blocks] except type 7 may interrupt +a paragraph. Blocks of type 7 may not interrupt a paragraph. +(This restricted is intended to prevent unwanted interpretation +of long tags inside a wrapped paragraph as starting HTML blocks.) + +Some simple examples follow. Here are some basic HTML blocks +of type 6: . @@ -1645,6 +1684,16 @@ okay. . +A block can also start with a closing tag: + +. + +*foo* +. + +*foo* +. + Here we have two HTML blocks with a Markdown paragraph between them: . @@ -1659,7 +1708,94 @@ Here we have two HTML blocks with a Markdown paragraph between them: . -In the following example, what looks like a Markdown code block +The tag on the first line can be partial, as long +as it is split where there would be whitespace: + +. +
+
+. +
+
+. + +. +
+
+. +
+
+. + +An open tag need not be closed: +. +
+*foo* + +*bar* +. +
+*foo* +

bar

+. + + +A partial tag need not even be completed (garbage +in, garbage out): + +. +
+. + +. + +. +
+foo +
+. +
+foo +
+. + +Everything until the next blank line or end of document +gets included in the HTML block. So, in the following +example, what looks like a Markdown code block is actually part of the HTML block, which continues until a blank line or the end of the document is reached: @@ -1675,43 +1811,241 @@ int x = 33; ``` . -A comment: +To start an [HTML block] with a tag that is *not* in the +list of block-level tags in (6), you must put the tag by +itself on the first line (and it must be complete): + +. + +*bar* + +. + +*bar* + +. + +In type 7 blocks, the [tag name] can be anything: + +. + +*bar* + +. + +*bar* + +. + +. + +*bar* + +. + +*bar* + +. + +These rules are designed to allow us to work with tags that +can function as either block-level or inline-level tags. +The `` tag is a nice example. We can surround content with +`` tags in three different ways. In this case, we get a raw +HTML block, because the `` tag is on a line by itself: + +. + +*foo* + +. + +*foo* + +. + +In this case, we get a raw HTML block that just includes +the `` tag (because it ends with the following blank +line). So the contents get interpreted as CommonMark: + +. + + +*foo* + + +. + +

foo

+
+. + +Finally, in this case, the `` tags are interpreted +as [raw HTML] *inside* the CommonMark paragraph. (Because +the tag is not on a line by itself, we get inline HTML +rather than an [HTML block].) + +. +*foo* +. +

foo

+. + +HTML tags designed to contain literal content +(`script`, `style`, `pre`), comments, processing instructions, +and declarations are treated somewhat differently. +Instead of ending at the first blank line, these blocks +end at the first line containing a corresponding end tag. +As a result, these blocks can contain blank lines: + +A pre tag (type 1): + +. +

+import Text.HTML.TagSoup
+
+main :: IO ()
+main = print $ parseTags "bar"
+
+. +

+import Text.HTML.TagSoup
+
+main :: IO ()
+main = print $ parseTags "bar"
+
+. + +A script tag (type 1): + +. + +. + +. + +A style tag (type 1): + +. + +. + +. + +If there is no matching end tag, the block will end at the +end of the document: + +. + +*foo* +. + +

foo

+. + +. +*bar* +*baz* +. +*bar* +

baz

+. + +Note that anything on the last line after the +end tag will be included in the [HTML block]: + +. +1. *bar* +. +1. *bar* +. + +A comment (type 2): . . . -A processing instruction: + +A processing instruction (type 3): . '; + ?> . '; + ?> . -CDATA: +A declaration (type 4): + +. + +. + +. + +CDATA (type 5): . @@ -1719,13 +2053,12 @@ else @@ -1743,8 +2076,18 @@ The opening tag can be indented 1-3 spaces, but not 4: . -An HTML block can interrupt a paragraph, and need not be preceded -by a blank line. +. +
+ +
+. +
+
<div>
+
+. + +An HTML block of types 1--6 can interrupt a paragraph, and need not be +preceded by a blank line. . Foo @@ -1758,8 +2101,8 @@ bar
. -However, a following blank line is always needed, except at the end of -a document: +However, a following blank line is needed, except at the end of +a document, and except for blocks of types 1--5, above: .
@@ -1773,14 +2116,16 @@ bar *foo* . -An incomplete HTML block tag may also start an HTML block: +HTML blocks of type 7 cannot interrupt a paragraph: . -
+baz . -
Foo + +baz

. This rule differs from John Gruber's original Markdown syntax @@ -1799,8 +2144,8 @@ here: - It requires a matching end tag, which it also does not allow to be indented. -Indeed, most Markdown implementations, including some of Gruber's -own perl implementations, do not impose these restrictions. +Most Markdown implementations (including some of Gruber's own) do not +respect all of these restrictions. There is one respect, however, in which Gruber's rule is more liberal than the one given here, since it allows blank lines to occur inside @@ -1811,6 +2156,8 @@ if no matching end tag is found. Second, it provides a very simple and flexible way of including Markdown content inside HTML tags: simply separate the Markdown from the HTML using blank lines: +Compare: + .
@@ -1823,8 +2170,6 @@ simply separate the Markdown from the HTML using blank lines:
. -Compare: - .
*Emphasized* text. @@ -1868,11 +2213,37 @@ Hi . -Moreover, blank lines are usually not necessary and can be -deleted. The exception is inside `
` tags; here, one can
-replace the blank lines with `
` entities.
+There are problems, however, if the inner tags are indented
+*and* separated by spaces, as then they will be interpreted as
+an indented code block:
+
+.
+
+
+  
+
+    
+
+  
+
+
+ Hi +
+. + + +
<td>
+  Hi
+</td>
+
+ +
+. -So there is no important loss of expressive power with the new rule. +Fortunately, blank lines are usually not necessary and can be +deleted. The exception is inside `
` tags, but as described
+above, raw HTML blocks starting with `
` *can* contain blank
+lines.
 
 ## Link reference definitions
 
@@ -4359,7 +4730,7 @@ raw HTML:
 .
 
 .
-

+ . But they work in all other contexts, including URLs and link titles, @@ -4399,8 +4770,8 @@ than HTML need not be HTML-entity aware. HTML renderers may either escape unicode characters as entities or leave them as they are. (However, `"`, `&`, `<`, and `>` must always be rendered as entities.) -[Named entities](@name-entities) consist of `&` + any -of the valid HTML5 entity names + `;`. The +[Named entities](@name-entities) consist of `&` ++ any of the valid HTML5 entity names + `;`. The [following document](https://html.spec.whatwg.org/multipage/entities.json) is used as an authoritative source of the valid entity names and their corresponding codepoints. @@ -4429,8 +4800,8 @@ the codepoint `U+0000` will also be replaced by `U+FFFD`. . [Hexadecimal entities](@hexadecimal-entities) -consist of `&#` + either `X` or `x` + a string of 1-8 hexadecimal -digits + `;`. They will also be parsed and turned into the corresponding +consist of `&#` + either `X` or `x` + a string of 1-8 hexadecimal digits ++ `;`. They will also be parsed and turned into the corresponding unicode codepoints in the AST. . @@ -4473,7 +4844,7 @@ code blocks, including raw HTML, URLs, [link title]s, and . . -

+ . . @@ -6041,7 +6412,7 @@ A link can contain fragment identifiers and queries: .

link

link

-

link

+

link

. Note that a backslash before a non-escapable character is @@ -7233,8 +7604,8 @@ Closing tags: . -

-

+ + . Illegal attributes in closing tag: @@ -7301,7 +7672,7 @@ Entities are preserved in HTML attributes: . . -

+ . Backslash escapes do not work in HTML attributes: @@ -7309,7 +7680,7 @@ Backslash escapes do not work in HTML attributes: . . -

+ . . -- cgit v1.2.3