Update README

author: Ryo Nihei <nihei.dev@gmail.com> 2022-03-21 15:27:54 +0900
committer: Ryo Nihei <nihei.dev@gmail.com> 2022-03-21 16:50:16 +0900
commit: 1adcb17f8196a873f2a0bccb45640440ecb2f964 (patch)
tree: 1c905de8a7e97f2641b92b157fb9fd4f0c35aafc /README.md
parent: Use golangci-lint (diff)
download: tre-1adcb17f8196a873f2a0bccb45640440ecb2f964.tar.gz
tre-1adcb17f8196a873f2a0bccb45640440ecb2f964.tar.xz
1 files changed, 97 insertions, 71 deletions
diff --git a/README.md b/README.md
index 7289fa1..6388d91 100644
--- a/README.md
+++ b/README.md
@@ -44,7 +44,9 @@ First, define your lexical specification in JSON format. As an example, let's wr
 }
 ```
 
-Save the above specification to a file in UTF-8. In this explanation, the file name is `statement.json`.
+Save the above specification to a file. In this explanation, the file name is `statement.json`.
+
+⚠️ The input file must be encoded in UTF-8.
 
 ### 2. Compile the lexical specification
 
@@ -77,18 +79,18 @@ $ echo -n 'The truth is out there.' | maleeni lex statementc.json | jq -r '[.kin
 
 The JSON format of tokens that `maleeni lex` command prints is as follows:
 
-| Field        | Type              | Description                                                                                                                                           |
-|--------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
-| mode_id      | integer           | An ID of a lex mode.                                                                                                                                  |
-| mode_name    | string            | A name of a lex mode.                                                                                                                                 |
-| kind_id      | integer           | An ID of a kind. This is unique among all modes.                                                                                                      |
-| mode_kind_id | integer           | An ID of a lexical kind. This is unique only within a mode. Note that you need to use `KindID` field if you want to identify a kind across all modes. |
-| kind_name    | string            | A name of a lexical kind.                                                                                                                             |
-| row          | integer           | A row number where a lexeme appears.                                                                                                                  |
-| col          | integer           | A column number where a lexeme appears. Note that `col` is counted in code points, not bytes.                                                         |
-| lexeme       | array of integers | A byte sequense of a lexeme.                                                                                                                          |
-| eof          | bool              | When this field is `true`, it means the token is the EOF token.                                                                                       |
-| invalid      | bool              | When this field is `true`, it means the token is an error token.                                                                                      |
+| Field        | Type              | Description                                                                                                                                            |
+|--------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
+| mode_id      | integer           | An ID of a lex mode.                                                                                                                                   |
+| mode_name    | string            | A name of a lex mode.                                                                                                                                  |
+| kind_id      | integer           | An ID of a kind. This is unique among all modes.                                                                                                       |
+| mode_kind_id | integer           | An ID of a lexical kind. This is unique only within a mode. Note that you need to use `kind_id` field if you want to identify a kind across all modes. |
+| kind_name    | string            | A name of a lexical kind.                                                                                                                              |
+| row          | integer           | A row number where a lexeme appears.                                                                                                                   |
+| col          | integer           | A column number where a lexeme appears. Note that `col` is counted in code points, not bytes.                                                          |
+| lexeme       | array of integers | A byte sequense of a lexeme.                                                                                                                           |
+| eof          | bool              | When this field is `true`, it means the token is the EOF token.                                                                                        |
+| invalid      | bool              | When this field is `true`, it means the token is an error token.                                                                                       |
 
 ### 4. Generate the lexer
 
@@ -189,6 +191,7 @@ See [Identifier](#identifier) and [Regular Expression](#regular-expression) for
 
 * `id` must be a lower snake case. It can contain only `a` to `z`, `0` to `9`, and `_`.
 * The first and last characters must be one of `a` to `z`.
+* `_` cannot appear consecutively.
 
 ## Regular Expression
 
@@ -196,14 +199,16 @@ See [Identifier](#identifier) and [Regular Expression](#regular-expression) for
 
 ⚠️ In JSON, you need to write `\` as `\\`.
 
+⚠️ maleeni doesn't allow you to use some code points. See [Unavailable Code Points](#unavailable-code-points).
+
 ### Composites
 
 Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.
 
-| Example  | Description           |
-|----------|-----------------------|
-| abc      | matches just 'abc'    |
-| abc\|def | one of 'abc' or 'def' |
+| Pattern    | Matches        |
+|------------|----------------|
+| `abc`      | `abc`          |
+| `abc\|def` | `abc` or `def` |
 
 ### Single Characters
 
@@ -215,89 +220,98 @@ In addition to using ordinary characters, there are other ways to represent a si
 * character property expressions
 * escape sequences
 
+#### Dot Expression
+
 The dot expression matches any one chracter.
 
-| Example | Description       |
+| Pattern | Matches           |
 |---------|-------------------|
-| .       | any one character |
+| `.`     | any one character |
+
+#### Bracket Expressions
 
-The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'.
+The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of `a` or `b`, and `[^ab]` matches any one character except `a` and `b`.
 
-| Example | Description                                      |
-|---------|--------------------------------------------------|
-| [abc]   | one of 'a', 'b', or 'c'                          |
-| [^abc]  | any one character except 'a', 'b', or 'c'        |
-| [a-z]   | one in the range of 'a' to 'z'                   |
-| [a-]    | 'a' or '-'                                       |
-| [-z]    | '-' or 'z'                                       |
-| [-]     | '-'                                              |
-| [^a-z]  | any one character except the range of 'a' to 'z' |
-| [a^]    | 'a' or '^'                                       |
+| Pattern  | Matches                                          |
+|----------|--------------------------------------------------|
+| `[abc]`  | `a`, `b`, or `c`                                 |
+| `[^abc]` | any one character except `a`, `b`, and `c`       |
+| `[a-z]`  | one in the range of `a` to `z`                   |
+| `[a-]`   | `a` or `-`                                       |
+| `[-z]`   | `-` or `z`                                       |
+| `[-]`    | `-`                                              |
+| `[^a-z]` | any one character except the range of `a` to `z` |
+| `[a^]`   | `a` or `^`                                       |
+
+#### Code Point Expressions
 
 The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.
 
-| Example    | Description               |
-|------------|---------------------------|
-| \u{000A}   | U+0A (LF)                 |
-| \u{3042}   | U+3042 (hiragana あ)      |
-| \u{01F63A} | U+1F63A (grinning cat 😺) |
+| Pattern      | Matches                     |
+|--------------|-----------------------------|
+| `\u{000A}`   | U+000A (LF)                 |
+| `\u{3042}`   | U+3042 (hiragana `あ`)      |
+| `\u{01F63A}` | U+1F63A (grinning cat `😺`) |
+
+#### Character Property Expressions
 
 The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports `General_Category`, `Script`, `Alphabetic`, `Lowercase`, `Uppercase`, and `White_Space`. When you omitted the equal symbol and a right-side value, maleeni interprets a symbol in `\p{...}` as the `General_Category` value.
 
-| Example                     | Description                                        |
-|-----------------------------|----------------------------------------------------|
-| \p{General_Category=Letter} | any one character whose General_Category is Letter |
-| \p{gc=Letter}               | the same as \p{General_Category=Letter}            |
-| \p{Letter}                  | the same as \p{General_Category=Letter}            |
-| \p{l}                       | the same as \p{General_Category=Letter}            |
-| \p{Script=Latin}            | any one character whose Script is Latin            |
-| \p{Alphabetic}              | any one character whose Alphabetic is yes          |
-| \p{Lowercase=yes}           | any one character whose Lowercase is yes           |
-| \p{Uppercase=yes}           | any one character whose Uppercase is yes           |
-| \p{White_Space=yes}         | any one character whose White_Space is yes         |
-| \p{wspace=yes}              | the same as \p{White_Space=yes}                    |
+| Pattern                       | Matches                                                |
+|-------------------------------|--------------------------------------------------------|
+| `\p{General_Category=Letter}` | any one character whose `General_Category` is `Letter` |
+| `\p{gc=Letter}`               | the same as `\p{General_Category=Letter}`              |
+| `\p{Letter}`                  | the same as `\p{General_Category=Letter}`              |
+| `\p{l}`                       | the same as `\p{General_Category=Letter}`              |
+| `\p{Script=Latin}`            | any one character whose `Script` is `Latin`            |
+| `\p{Alphabetic=yes}`          | any one character whose `Alphabetic` is `yes`          |
+| `\p{Lowercase=yes}`           | any one character whose `Lowercase` is `yes`           |
+| `\p{Uppercase=yes}`           | any one character whose `Uppercase` is `yes`           |
+| `\p{White_Space=yes}`         | any one character whose `White_Space` is `yes`         |
+
+#### Escape Sequences
 
 As you escape the special character with `\`, you can write a rule that matches the special character itself.
 The following escape sequences are available outside of bracket expressions.
 
-| Example | Description |
-|---------|-------------|
-| \\.     | '.'         |
-| \\?     | '?'         |
-| \\*     | '*'         |
-| \\+     | '+'         |
-| \\(     | '('         |
-| \\)     | ')'         |
-| \\[     | '['         |
-| \\\|    | '\|'        |
-| \\\\    | '\\'        |
+| Pattern | Matches |
+|---------|---------|
+| `\\.`   | `.`     |
+| `\\?`   | `?`     |
+| `\\*`   | `*`     |
+| `\\+`   | `+`     |
+| `\\(`   | `(`     |
+| `\\)`   | `)`     |
+| `\\[`   | `[`     |
+| `\\\|`  | `\|`    |
+| `\\\\`  | `\\`    |
 
 The following escape sequences are available inside bracket expressions.
 
-| Example | Description |
-|---------|-------------|
-| \\^     | '^'         |
-| \\-     | '-'         |
-| \\]     | ']'         |
+| Pattern | Matches |
+|---------|---------|
+| `\\^`   | `^`     |
+| `\\-`   | `-`     |
+| `\\]`   | `]`     |
 
 ### Repetitions
 
 The repetitions match a string that repeats the previous single character or group.
 
-| Example | Description      |
+| Pattern | Matches          |
 |---------|------------------|
-| a*      | zero or more 'a' |
-| a+      | one or more 'a'  |
-| a?      | zero or one 'a'  |
+| `a*`    | zero or more `a` |
+| `a+`    | one or more `a`  |
+| `a?`    | zero or one `a`  |
 
 ### Grouping
 
 `(` and `)` groups any patterns.
 
-| Example   | Description                                            |
-|-----------|--------------------------------------------------------|
-| a(bc)*d   | matches 'ad', 'abcd', 'abcbcd', and so on              |
-| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on |
+| Pattern     | Matches                                         |
+|-------------|-------------------------------------------------|
+| `a(bc)*d`   | `ad`, `abcd`, `abcbcd`, and so on               |
+| `(ab\|cd)+` | `ab`, `cd`, `abcd`, `cdab`, `abcdab`, and so on |
 
 ### Fragment
 
@@ -334,6 +348,14 @@ For instance, you can define [an identifier of golang](https://golang.org/ref/sp
 }
 ```
 
+### Unavailable Code Points
+
+Lexical specifications and source files to be analyzed cannot contain the following code points.
+
+When you write a pattern that implicitly contains the unavailable code points, maleeni will automatically generate a pattern that doesn't contain the unavailable code points and replaces the original pattern. However, when you explicitly use the unavailable code points (like `\u{U+D800}` or `\p{General_Category=Cs}`), maleeni will occur an error.
+
+* surrogate code points: U+D800..U+DFFF
+
 ## Lex Mode
 
 Lex Mode is a feature that allows you to separate a DFA transition table for each mode.
@@ -399,3 +421,7 @@ $ echo -n '"foo\nbar"foo' | maleeni lex stringc.json | jq -r '[.mode_name, .kind
 ```
 
 The input string enclosed in the `"` mark (`foo\nbar`) are interpreted as the `char_seq` and the `escaped_char`, while the outer string (`foo`) is interpreted as the `identifier`. The same string `foo` is interpreted as different types because of the different modes in which they are interpreted.
+
+## Unicode Version
+
+maleeni references [Unicode 13.0.0](https://unicode.org/versions/Unicode13.0.0/).
author	Ryo Nihei <nihei.dev@gmail.com>	2022-03-21 15:27:54 +0900
committer	Ryo Nihei <nihei.dev@gmail.com>	2022-03-21 16:50:16 +0900
commit	1adcb17f8196a873f2a0bccb45640440ecb2f964 (patch)
tree	1c905de8a7e97f2641b92b157fb9fd4f0c35aafc /README.md
parent	Use golangci-lint (diff)
download	tre-1adcb17f8196a873f2a0bccb45640440ecb2f964.tar.gz tre-1adcb17f8196a873f2a0bccb45640440ecb2f964.tar.xz