Update README and godoc

author: Ryo Nihei <nihei.dev@gmail.com> 2021-05-09 17:21:35 +0900
committer: Ryo Nihei <nihei.dev@gmail.com> 2021-05-10 20:46:28 +0900
commit: b5c574778533c50459c48cbd81478874c3d64dfb (patch)
tree: 1119e065e1f3b0d131245f4fbee65cea41cc1011 /README.md
parent: Change package structure (diff)
download: tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.gz
tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.xz
1 files changed, 204 insertions, 1 deletions
diff --git a/README.md b/README.md
index 62bab50..c142bab 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,205 @@
 # maleeni
-A lexer generator
+
+maleeni provides a compiler that generates a portable DFA for lexical analysis and a driver for golang.
+
+## Installation
+
+```sh
+$ go install ./cmd/maleeni
+```
+
+## Usage
+
+First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.
+
+```json
+{
+    "entries": [
+        {
+            "kind": "whitespace",
+            "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
+        },
+        {
+            "kind": "word",
+            "pattern": "[0-9A-Za-z]+"
+        },
+        {
+            "kind": "punctuation",
+            "pattern": "[.,:;]"
+        }
+    ]
+}
+```
+
+Save the above specification to a file. In this explanation, the file name is lexspec.json.
+
+Next, generate a DFA from the lexical specification using `maleeni compile` command.
+
+```sh
+$ maleeni compile -l lexspec.json -o clexspec.json
+```
+
+If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to implement a driver.
+`maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.
+
+```sh
+$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind, .text, .eof] | @csv'
+"word","The",false
+"whitespace"," ",false
+"word","truth",false
+"whitespace"," ",false
+"word","is",false
+"whitespace"," ",false
+"word","out",false
+"whitespace"," ",false
+"word","there",false
+"punctuation",".",false
+"","",true
+```
+
+When using the driver, please import `github.com/nihei9/maleeni/driver` and `github.com/nihei9/maleeni/spec` package.
+You can use the driver easily in the following way:
+
+```go
+// Read your lexical specification file.
+f, err := os.Open(path)
+if err != nil {
+    // error handling
+}
+data, err := ioutil.ReadAll(f)
+if err != nil {
+    // error handling
+}
+clexspec := &spec.CompiledLexSpec{}
+err = json.Unmarshal(data, clexspec)
+if err != nil {
+    // error handling
+}
+
+// Generate a lexer.
+lex, err := driver.NewLexer(clexspec, src)
+if err != nil {
+    // error handling
+}
+
+// Perform lexical analysis.
+for {
+    tok, err := lex.Next()
+    if err != nil {
+        // error handling
+    }
+    if tok.Invalid {
+        // An error token appeared.
+        // error handling
+    }
+    if tok.EOF {
+        // The EOF token appeared.
+        break
+    }
+
+    // Do something using `tok`.
+}
+```
+
+## Lexical Specification Format
+
+The lexical specification format to be passed to `maleeni compile` command is as follows:
+
+top level object:
+
+| Field   | Type                   | Nullable | Description                                                                                                           |
+|---------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------------|
+| entries | array of entry objects | false    | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. |
+
+entry object:
+
+| Field   | Type             | Nullable | Description                                                                                   |
+|---------|------------------|----------|-----------------------------------------------------------------------------------------------|
+| kinds   | string           | false    | A name of a token kind                                                                        |
+| pattern | string           | false    | A pattern in a regular expression                                                             |
+| modes   | array of strings | true     | Mode names that an entry is enabled in (default: "default")                                   |
+| push    | string           | true     | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears |
+| pop     | bool             | true     | When `pop` is true, the lexer pops a mode from own mode stack.                                |
+
+See [Regular Expression Syntax](#regular-expression-syntax) for more details on the regular expression syntax.
+
+## Regular Expression Syntax
+
+### Composites
+
+Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.
+
+| Example  | Description           |
+|----------|-----------------------|
+| abc      | matches just 'abc'    |
+| abc\|def | one of 'abc' or 'def' |
+
+### Single Characters
+
+In addition to using ordinary characters, there are other ways to represent a single character:
+
+* dot expression
+* bracket expressions
+* code point expressions
+* character property expressions
+
+The dot expression matches any one chracter.
+
+| Example | Description       |
+|---------|-------------------|
+| .       | any one character |
+
+The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'.
+
+| Example | Description                                      |
+|---------|--------------------------------------------------|
+| [abc]   | one of 'a', 'b', or 'c'                          |
+| [^abc]  | any one character except 'a', 'b', or 'c'        |
+| [a-z]   | one in the range of 'a' to 'z'                   |
+| [a-]    | 'a' or '-'                                       |
+| [-z]    | '-' or 'z'                                       |
+| [-]     | '-'                                              |
+| [^a-z]  | any one character except the range of 'a' to 'z' |
+| [a^]    | 'a' or '^'                                       |
+
+The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.
+
+| Example    | Description               |
+|------------|---------------------------|
+| \u{000A}   | U+0A (LF)                 |
+| \u{3042}   | U+3042 (hiragana あ)      |
+| \u{01F63A} | U+1F63A (grinning cat 😺) |
+
+The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports only General_Category.
+
+| Example                     | Description                                        |
+|-----------------------------|----------------------------------------------------|
+| \p{General_Category=Letter} | any one character whose General_Category is Letter |
+| \p{gc=Letter}               | the same as \p{General_Category=Letter}            |
+| \p{Letter}                  | the same as \p{General_Category=Letter}            |
+| \p{l}                       | the same as \p{General_Category=Letter}            |
+
+### Repetitions
+
+The repetitions match a string that repeats the previous single character or group.
+
+| Example | Description      |
+|---------|------------------|
+| a*      | zero or more 'a' |
+| a+      | one or more 'a'  |
+| a?      | zero or one 'a'  |
+
+### Grouping
+
+`(` and `)` groups any patterns.
+
+| Example   | Description                                            |
+|-----------|--------------------------------------------------------|
+| a(bc)*d   | matches 'ad', 'abcd', 'abcbcd', and so on              |
+| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on |
+
+## Lex Mode
+
+Lex Mode is a feature that allows you to separate a DFA transition table for each mode.
+
+`modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry.
author	Ryo Nihei <nihei.dev@gmail.com>	2021-05-09 17:21:35 +0900
committer	Ryo Nihei <nihei.dev@gmail.com>	2021-05-10 20:46:28 +0900
commit	b5c574778533c50459c48cbd81478874c3d64dfb (patch)
tree	1119e065e1f3b0d131245f4fbee65cea41cc1011 /README.md
parent	Change package structure (diff)
download	tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.gz tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.xz