diff options
author | Ryo Nihei <nihei.dev@gmail.com> | 2021-05-09 17:21:35 +0900 |
---|---|---|
committer | Ryo Nihei <nihei.dev@gmail.com> | 2021-05-10 20:46:28 +0900 |
commit | b5c574778533c50459c48cbd81478874c3d64dfb (patch) | |
tree | 1119e065e1f3b0d131245f4fbee65cea41cc1011 /README.md | |
parent | Change package structure (diff) | |
download | tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.gz tre-b5c574778533c50459c48cbd81478874c3d64dfb.tar.xz |
Update README and godoc
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 205 |
1 files changed, 204 insertions, 1 deletions
@@ -1,2 +1,205 @@ # maleeni -A lexer generator + +maleeni provides a compiler that generates a portable DFA for lexical analysis and a driver for golang. + +## Installation + +```sh +$ go install ./cmd/maleeni +``` + +## Usage + +First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation. + +```json +{ + "entries": [ + { + "kind": "whitespace", + "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+" + }, + { + "kind": "word", + "pattern": "[0-9A-Za-z]+" + }, + { + "kind": "punctuation", + "pattern": "[.,:;]" + } + ] +} +``` + +Save the above specification to a file. In this explanation, the file name is lexspec.json. + +Next, generate a DFA from the lexical specification using `maleeni compile` command. + +```sh +$ maleeni compile -l lexspec.json -o clexspec.json +``` + +If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to implement a driver. +`maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command. + +```sh +$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind, .text, .eof] | @csv' +"word","The",false +"whitespace"," ",false +"word","truth",false +"whitespace"," ",false +"word","is",false +"whitespace"," ",false +"word","out",false +"whitespace"," ",false +"word","there",false +"punctuation",".",false +"","",true +``` + +When using the driver, please import `github.com/nihei9/maleeni/driver` and `github.com/nihei9/maleeni/spec` package. +You can use the driver easily in the following way: + +```go +// Read your lexical specification file. +f, err := os.Open(path) +if err != nil { + // error handling +} +data, err := ioutil.ReadAll(f) +if err != nil { + // error handling +} +clexspec := &spec.CompiledLexSpec{} +err = json.Unmarshal(data, clexspec) +if err != nil { + // error handling +} + +// Generate a lexer. +lex, err := driver.NewLexer(clexspec, src) +if err != nil { + // error handling +} + +// Perform lexical analysis. +for { + tok, err := lex.Next() + if err != nil { + // error handling + } + if tok.Invalid { + // An error token appeared. + // error handling + } + if tok.EOF { + // The EOF token appeared. + break + } + + // Do something using `tok`. +} +``` + +## Lexical Specification Format + +The lexical specification format to be passed to `maleeni compile` command is as follows: + +top level object: + +| Field | Type | Nullable | Description | +|---------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------------| +| entries | array of entry objects | false | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. | + +entry object: + +| Field | Type | Nullable | Description | +|---------|------------------|----------|-----------------------------------------------------------------------------------------------| +| kinds | string | false | A name of a token kind | +| pattern | string | false | A pattern in a regular expression | +| modes | array of strings | true | Mode names that an entry is enabled in (default: "default") | +| push | string | true | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears | +| pop | bool | true | When `pop` is true, the lexer pops a mode from own mode stack. | + +See [Regular Expression Syntax](#regular-expression-syntax) for more details on the regular expression syntax. + +## Regular Expression Syntax + +### Composites + +Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern. + +| Example | Description | +|----------|-----------------------| +| abc | matches just 'abc' | +| abc\|def | one of 'abc' or 'def' | + +### Single Characters + +In addition to using ordinary characters, there are other ways to represent a single character: + +* dot expression +* bracket expressions +* code point expressions +* character property expressions + +The dot expression matches any one chracter. + +| Example | Description | +|---------|-------------------| +| . | any one character | + +The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'. + +| Example | Description | +|---------|--------------------------------------------------| +| [abc] | one of 'a', 'b', or 'c' | +| [^abc] | any one character except 'a', 'b', or 'c' | +| [a-z] | one in the range of 'a' to 'z' | +| [a-] | 'a' or '-' | +| [-z] | '-' or 'z' | +| [-] | '-' | +| [^a-z] | any one character except the range of 'a' to 'z' | +| [a^] | 'a' or '^' | + +The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string. + +| Example | Description | +|------------|---------------------------| +| \u{000A} | U+0A (LF) | +| \u{3042} | U+3042 (hiragana あ) | +| \u{01F63A} | U+1F63A (grinning cat 😺) | + +The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports only General_Category. + +| Example | Description | +|-----------------------------|----------------------------------------------------| +| \p{General_Category=Letter} | any one character whose General_Category is Letter | +| \p{gc=Letter} | the same as \p{General_Category=Letter} | +| \p{Letter} | the same as \p{General_Category=Letter} | +| \p{l} | the same as \p{General_Category=Letter} | + +### Repetitions + +The repetitions match a string that repeats the previous single character or group. + +| Example | Description | +|---------|------------------| +| a* | zero or more 'a' | +| a+ | one or more 'a' | +| a? | zero or one 'a' | + +### Grouping + +`(` and `)` groups any patterns. + +| Example | Description | +|-----------|--------------------------------------------------------| +| a(bc)*d | matches 'ad', 'abcd', 'abcbcd', and so on | +| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on | + +## Lex Mode + +Lex Mode is a feature that allows you to separate a DFA transition table for each mode. + +`modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry. |