aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorRyo Nihei <nihei.dev@gmail.com>2021-05-09 17:21:35 +0900
committerRyo Nihei <nihei.dev@gmail.com>2021-05-10 20:46:28 +0900
commitb5c574778533c50459c48cbd81478874c3d64dfb (patch)
tree1119e065e1f3b0d131245f4fbee65cea41cc1011 /README.md
parentChange package structure (diff)
downloadtre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.gz
tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.xz
Update README and godoc
Diffstat (limited to 'README.md')
-rw-r--r--README.md205
1 files changed, 204 insertions, 1 deletions
diff --git a/README.md b/README.md
index 62bab50..c142bab 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,205 @@
# maleeni
-A lexer generator
+
+maleeni provides a compiler that generates a portable DFA for lexical analysis and a driver for golang.
+
+## Installation
+
+```sh
+$ go install ./cmd/maleeni
+```
+
+## Usage
+
+First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.
+
+```json
+{
+ "entries": [
+ {
+ "kind": "whitespace",
+ "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
+ },
+ {
+ "kind": "word",
+ "pattern": "[0-9A-Za-z]+"
+ },
+ {
+ "kind": "punctuation",
+ "pattern": "[.,:;]"
+ }
+ ]
+}
+```
+
+Save the above specification to a file. In this explanation, the file name is lexspec.json.
+
+Next, generate a DFA from the lexical specification using `maleeni compile` command.
+
+```sh
+$ maleeni compile -l lexspec.json -o clexspec.json
+```
+
+If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to implement a driver.
+`maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.
+
+```sh
+$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind, .text, .eof] | @csv'
+"word","The",false
+"whitespace"," ",false
+"word","truth",false
+"whitespace"," ",false
+"word","is",false
+"whitespace"," ",false
+"word","out",false
+"whitespace"," ",false
+"word","there",false
+"punctuation",".",false
+"","",true
+```
+
+When using the driver, please import `github.com/nihei9/maleeni/driver` and `github.com/nihei9/maleeni/spec` package.
+You can use the driver easily in the following way:
+
+```go
+// Read your lexical specification file.
+f, err := os.Open(path)
+if err != nil {
+ // error handling
+}
+data, err := ioutil.ReadAll(f)
+if err != nil {
+ // error handling
+}
+clexspec := &spec.CompiledLexSpec{}
+err = json.Unmarshal(data, clexspec)
+if err != nil {
+ // error handling
+}
+
+// Generate a lexer.
+lex, err := driver.NewLexer(clexspec, src)
+if err != nil {
+ // error handling
+}
+
+// Perform lexical analysis.
+for {
+ tok, err := lex.Next()
+ if err != nil {
+ // error handling
+ }
+ if tok.Invalid {
+ // An error token appeared.
+ // error handling
+ }
+ if tok.EOF {
+ // The EOF token appeared.
+ break
+ }
+
+ // Do something using `tok`.
+}
+```
+
+## Lexical Specification Format
+
+The lexical specification format to be passed to `maleeni compile` command is as follows:
+
+top level object:
+
+| Field | Type | Nullable | Description |
+|---------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------------|
+| entries | array of entry objects | false | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. |
+
+entry object:
+
+| Field | Type | Nullable | Description |
+|---------|------------------|----------|-----------------------------------------------------------------------------------------------|
+| kinds | string | false | A name of a token kind |
+| pattern | string | false | A pattern in a regular expression |
+| modes | array of strings | true | Mode names that an entry is enabled in (default: "default") |
+| push | string | true | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears |
+| pop | bool | true | When `pop` is true, the lexer pops a mode from own mode stack. |
+
+See [Regular Expression Syntax](#regular-expression-syntax) for more details on the regular expression syntax.
+
+## Regular Expression Syntax
+
+### Composites
+
+Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.
+
+| Example | Description |
+|----------|-----------------------|
+| abc | matches just 'abc' |
+| abc\|def | one of 'abc' or 'def' |
+
+### Single Characters
+
+In addition to using ordinary characters, there are other ways to represent a single character:
+
+* dot expression
+* bracket expressions
+* code point expressions
+* character property expressions
+
+The dot expression matches any one chracter.
+
+| Example | Description |
+|---------|-------------------|
+| . | any one character |
+
+The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'.
+
+| Example | Description |
+|---------|--------------------------------------------------|
+| [abc] | one of 'a', 'b', or 'c' |
+| [^abc] | any one character except 'a', 'b', or 'c' |
+| [a-z] | one in the range of 'a' to 'z' |
+| [a-] | 'a' or '-' |
+| [-z] | '-' or 'z' |
+| [-] | '-' |
+| [^a-z] | any one character except the range of 'a' to 'z' |
+| [a^] | 'a' or '^' |
+
+The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.
+
+| Example | Description |
+|------------|---------------------------|
+| \u{000A} | U+0A (LF) |
+| \u{3042} | U+3042 (hiragana あ) |
+| \u{01F63A} | U+1F63A (grinning cat 😺) |
+
+The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports only General_Category.
+
+| Example | Description |
+|-----------------------------|----------------------------------------------------|
+| \p{General_Category=Letter} | any one character whose General_Category is Letter |
+| \p{gc=Letter} | the same as \p{General_Category=Letter} |
+| \p{Letter} | the same as \p{General_Category=Letter} |
+| \p{l} | the same as \p{General_Category=Letter} |
+
+### Repetitions
+
+The repetitions match a string that repeats the previous single character or group.
+
+| Example | Description |
+|---------|------------------|
+| a* | zero or more 'a' |
+| a+ | one or more 'a' |
+| a? | zero or one 'a' |
+
+### Grouping
+
+`(` and `)` groups any patterns.
+
+| Example | Description |
+|-----------|--------------------------------------------------------|
+| a(bc)*d | matches 'ad', 'abcd', 'abcbcd', and so on |
+| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on |
+
+## Lex Mode
+
+Lex Mode is a feature that allows you to separate a DFA transition table for each mode.
+
+`modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry.