PlainMark - Simple Humane Markup.txt

# PlainMark:
# a Simple Humane Markup

The following specifies a text markup inspired by Markdown, but simplified to ease parsing, learning and usage.
It aims more at writing short comments than writing big, complex documents. For the latter, something like [CommonMark](http://commonmark.org/) is better suited.

Newline characters do a line break in the rendering, equivalent to the `br` HTML tag.
Two consecutive newlines (an empty line) before a non-block line create a new paragraph, like the `p` tag in HTML.

The markup uses Ascii characters in a given context to apply a special style (eg. HTML markup + CSS) to whole lines of text (blocks, div-like in HTML) or to fragments of text (span-like in HTML).
These special characters can loose their meaning in some context, and can always be escaped with the tilde `~~` sign preceding them.
If tilde precedes a non-markup character, it is kept literal. It can also be doubled to figure a literal tilde.

Block markup characters are defined at the start of a line, regardless of initial spaces. These initial spaces are just skipped. They are always followed by a space or a tab. Additional whitespace before the start of the text is ignored.

Fragment markup characters loose their meaning if surrounded by spaces or by letters or digits.
They go by pairs, correctly nested. A single markup sign (without a closing one before the end of the line) looses its meaning (is kept literal).
Currently, there is an exception for an unterminated code fragment: it runs to the end of the line (limitation of the parser).

### Limitations

There is no support for blockquotes (I prefer to use double quotes surrounding an italicized citation), tables or images. No HTML markup can be used, `&` ,`<` and `>` signs are escaped (kept literal) in an HTML rendering. HTML entities are not supported (might be, later).


## Styles

Fragments of text can be bold, italic, stroked through or with fixed font.
In HTML, they are rendered respectively with the `strong`, `em` (emphasis), `del` (deleted) and `code` tags.

The markup uses respectively star `*`, underscore `_`, dash `-` and backtick `~`` surrounding the fragment.
There can be no space after the initial sign, and no space before the ending sign.
There can be no letter or digit before the initial sign, or after the ending one.
Empty fragments like ** are kept literally.

So `*bold*`, `this *is bold* too` and `*20 %*` are valid markup, seen as *bold*, this *is bold* too and *20 %*. But x*y and x * y remain literal.
`"_This is a citation_"` is also valid markup (shown as "_This is a citation_"), but CONST_NAME is kept as is.
`-striked through-` is shown as -striked through-, but in-line or a - b are kept as is.
Code fragments can be shown with a fixed font (`code` tag in HTML) by surrounding them with backticks: `~`int x = 0;~`` will show as `int x = 0;`.
Inside a code fragment, markup characters (except tilde) loose their meaning.

Fragment styling can be nested:
This sentence has `_italic parts *and bold* too_`.
becomes:
This sentence has _italic parts *and bold* too_.

The ending signs must be in reverse order of the starting ones:
This is `*_strong emphasized*_` text.
will be displayed as:
This is *_strong emphasized*_ text.


## Links

A link can be made explicitly by wrapping the link text in brackets `[]`, followed by the link itself in parentheses `()`.
Example: `[A _well known_ destination](http://www.google.com)` or `[*Popular* programming site](https://github.com)` or `[Relative -to *this* site- link](../foo/bar.html)` become:
[A _well known_ destination](http://www.google.com) and [*Popular* programming site](https://github.com)  or [Relative -to *this* site- link](../foo/bar.html).
The link text can have markup signs in it, and balanced (or escaped) square brackets `[]` are allowed too.
The URL can have balanced parentheses `()` in it.
If an explicit link is inside the link text of an explicit link, the external one is not rendered as link.

Characters accepted raw in a link are:
		`A-Z` `a-z` `0-9` `-` `.` `_` `~~` (unreserved)
		`:` `/` `?` `#` `[` `]` `@` (reserved, gen-delims)
		`!` `$` `&` `\`` `(` `)` `*` `+` `,` `;` `=` (reserved, sub-delims)
		`%` (escape)


### URL autolinking

URLs starting with a common schema (http://, https://, ftp://, ftps://, sftp://) are automatically turned into a link to that URL. For other schemas, use the explicit link form.
The URL conversion stops on some characters, that should be escaped if they are part of the URL. Unlike some autolinking libraries, PlainMark doesn't attempt to guess an URL if it has no schema (ie. we don't do autolinking of google.com or www.example.com/whatever).
Parentheses in the URL are accepted if they are balanced, otherwise escape them: `%28` for `(`, `%29` for `)`.

Markup signs are ignored while parsing an URL.
The link text will be the URL without the schema (to be shorter). If the URL is longer than a predefined (ajustable) length, it will be shortened with ellipsis.

Example: `http://daringfireball.net/projects/markdown/dingus` becomes http://daringfireball.net/projects/markdown/dingus


## Paragraph

A line break is rendered by a simple line break, ie. in HTML a `br` tag.
An empty line (or several consecutive ones) separates paragraphs, rendered in HTML with a `p` tag.


## Titles

PlainMark has only three levels of title.
In HTML rendering, they are not necessarily mapped to `h1` to `h3`. They might be mapped to `h3` to `h5`, for example, or even be just `div` s with their own classes.
These levels are denoted as a series of one to three sharp signs `#` at the start of the line, followed by a space. One `#` denotes the highest level, three is for the lowest one.
If two consecutive lines are titles with the same level, it makes a multi-line title (ie. with a line break inside).
Titles should be rendered with a bolder font, with size bigger than main text, and some vertical space before and after the line.

Example:
`## Second level title`
`### Third level title`
`### on two lines`


## Lists

Unordered lists are made with lines starting with a dash `-` or a plus `+` or a star `*`, followed by a space
Ordered lists are made with lines starting with numbers (one or several digits) followed by a dot and a space. The numbers are actually ignored, numbering is done automatically from 1.
No nesting is handled. A list stops with the first non-list line, so there can be two distinct consecutive lists if separated by an empty line.
The list sign of unordered lists isn't relevant, they render to the same kind of item.

~* Item
~* Other item
~* Last item
becomes:
* Item
* Other item
* Last item

~- Item
~- Other item
~- Last item
becomes:
- Item
- Other item
- Last item

~+ Item
~+ Other item
~+ Last item
becomes:
+ Item
+ Other item
+ Last item

~* Item
~+ Other item
~- Last item
becomes:
* Item
+ Other item
- Last item

~1. Item 1
~1. Item 2
~10. Item n
becomes:
1. Item 1
2. Item 2
3. Item n


## Code blocks

Like in GitHub, a series of three backticks `~`~`~`` on their own line renders all the following lines as code (in a `pre` block in HTML, with `code` style), until another line with `~`~`~`` is met.
Between these marks, all lines are rendered literally, no markup interpretation is attempted (not even tilde escape), empty lines are kept as is, no whitespace is skipped (indentation is preserved).
The code block signs *must* start on the first column.
An unterminated code block goes down to the end of the text.

Example:
```
		BlockType.Visitor<VisitorContext> blockStartVisitor = new HTMLBlockStartVisitor()
		{
			@Override
			public void visitDocument(VisitorContext context)
			{
				context.append("<div class='mark'>");
			}
		};
```


# Parsing rules / implementation details

Markup signs are:
- For fragments: ~ * _ - ` [ ] ( )
- For blocks (at start of line): # * - + digit (followed by dot) ` (followed by two others)

Outside URLs, tilde characters allow to remove a special meaning to markup signs, anywhere they are found (including code fragments, excluding code blocks). If not followed by such markup sign, tildes are literal.

The fragment (in-line) parser is autonomous, this allows to have an even simpler parser, eg. for writing short comments a la Stackoverflow.

A ParsingParameters class allows to customize a bit the parser, like which URL schemas are supported in auto-linking, what is the escape character, what is the length of shortened URLs when auto-linking, etc.

The HTML renderer can get a parameter to define the tab length (tabs are converted to the given number of spaces, 4 by default).


# How to use (User Manual)

Marked text is just plain text where line breaks are preserved, and some characters, in a given context, change the style of the rendered text.
There are two categories of markup signs: fragment markup and block markup.
Fragments are within a line, they change the style of a portion of the line.
Blocks extends over one or more consecutive lines. They change the style of the whole line(s).

Markup signs can be escaped with the tilde `~~` sign. A tilde preceding a non-markup sign is kept literally.

### Fragment markup

| This is ~*strong~* text | This is *strong* text |
| This is ~_emphasized~_ text | This is _emphasized_ text |
| This is ~-deleted~- text | This is -deleted- text |
| This is ~`code~` text | This is `code` text |
| No ~`~_markup~_ inside code~` | No `_markup_ inside code` |
| This is ~*strong and ~_emphasized~_~* text | This is *strong and _emphasized_* text |

Opening signs must have no letter or digit before them and no space after them
Closing signs must have no space before them and no letter or digit before them.
Empty fragments like ** are kept literally.

Examples of inactive markup signs:
No-nonsense rules
A formula: x*x-y
Same formula: x * x - y
A constant: ESCAPE_SIGN

URLs starting with http://, https://, ftp://, ftps:// or sftp:// are automatically converted to links. Long URLs are shortened with ellipsis.
Explicit URLs can be made like: ~[link text](valid_url). This allows to use other URL schemas and relative links.

| `http://stackoverflow.com` | http://stackoverflow.com |
| `http://stackoverflow.com/questions/659227/compare-and-contrast-the-lightweight-markup-languages` | http://stackoverflow.com/questions/659227/compare-and-contrast-the-lightweight-markup-languages |
| `[Link text](http://url.com/example/)` | [Link text](http://url.com/example/) |
| `[Link [4] text _with *markup*_](relative/url/example/?foo=(5)#anchor)` | [Link [4] text _with *markup*_](relative/url/example/?foo=(5)#anchor) |

### Block markup

These signs are always at the start of a line, regardless of indentation. They must always be followed by at least one space.
Code blocks are different: they must start on first column and must be alone on their line.

|||
~# Title level 1
|
# Title level 1
||

||
~## Title level 2
|
## Title level 2
||

||
~### Title level 3
|
### Title level 3
||

||
~- Unordered list item 1
~- Unordered list item 2
|
- Unordered list item 1
- Unordered list item 2
||

||
~+ Unordered list item 1
~+ Unordered list item 2
|
+ Unordered list item 1
+ Unordered list item 2
||

||
~* Unordered list item 1
~* Unordered list item 2
|
* Unordered list item 1
* Unordered list item 2
||

||
~1. Ordered list item 1
~1. Ordered list item 2
|
1. Ordered list item 1
2. Ordered list item 2
||

||
~```
Code block
Where ~*markup~* is ignored
    But _indentation_ is preserved
~```
|
```
Code block
Where *markup* is ignored
    But _indentation_ is preserved
```
|||

## The nitty-gritty details

As said, line breaks are preserved (in HTML, they render as `<br>` tags).
Two consecutive line breaks before a non-block line result in a paragraph (`<p>` tag in HTML).

Unterminated fragment markup are ignored, rendered literally (exception for code fragment, for technical reasons).
Nested fragments must terminate in reverse order of their opening, otherwise one sign will be deactivated:

| This is ~*~_strong emphasized*_ text. | This is *_strong emphasized*_ text. |

Square brackets in the link text of explicit links are accepted if they are balanced or escaped.
Parentheses in the URLs are accepted if they are balanced, otherwise escape them: `%28` for `(`, `%29` for `)`.