# maleeni maleeni is a lexer generator for golang. maleeni also provides a command to perform lexical analysis to allow easy debugging of your lexical specification. [![ci](https://github.com/nihei9/maleeni/actions/workflows/ci.yaml/badge.svg)](https://github.com/nihei9/maleeni/actions/workflows/ci.yaml) ## Installation Compiler: ```sh $ go install github.com/nihei9/maleeni/cmd/maleeni@latest ``` Code Generator: ```sh $ go install github.com/nihei9/maleeni/cmd/maleeni-go@latest ``` ## Usage ### 1. Define your lexical specification First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation. ```json { "name": "statement", "entries": [ { "kind": "whitespace", "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+" }, { "kind": "word", "pattern": "[0-9A-Za-z]+" }, { "kind": "punctuation", "pattern": "[.,:;]" } ] } ``` Save the above specification to a file. In this explanation, the file name is `statement.json`. ⚠️ The input file must be encoded in UTF-8. ### 2. Compile the lexical specification Next, generate a DFA from the lexical specification using `maleeni compile` command. ```sh $ maleeni compile statement.json -o statementc.json ``` ### 3. Debug (Optional) If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to generate a lexer. `maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command. ⚠️ An encoding that `maleeni lex` and the driver can handle is only UTF-8. ```sh $ echo -n 'The truth is out there.' | maleeni lex statementc.json | jq -r '[.kind_name, .lexeme, .eof] | @csv' "word","The",false "whitespace"," ",false "word","truth",false "whitespace"," ",false "word","is",false "whitespace"," ",false "word","out",false "whitespace"," ",false "word","there",false "punctuation",".",false "","",true ``` The JSON format of tokens that `maleeni lex` command prints is as follows: | Field | Type | Description | |--------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------| | mode_id | integer | An ID of a lex mode. | | mode_name | string | A name of a lex mode. | | kind_id | integer | An ID of a kind. This is unique among all modes. | | mode_kind_id | integer | An ID of a lexical kind. This is unique only within a mode. Note that you need to use `kind_id` field if you want to identify a kind across all modes. | | kind_name | string | A name of a lexical kind. | | row | integer | A row number where a lexeme appears. | | col | integer | A column number where a lexeme appears. Note that `col` is counted in code points, not bytes. | | lexeme | array of integers | A byte sequense of a lexeme. | | eof | bool | When this field is `true`, it means the token is the EOF token. | | invalid | bool | When this field is `true`, it means the token is an error token. | ### 4. Generate the lexer Using `maleeni-go` command, you can generate a source code of the lexer to recognize your lexical specification. ```sh $ maleeni-go statementc.json ``` The above command generates the lexer and saves it to `statement_lexer.go` file. By default, the file name will be `{spec name}_lexer.json`. To use the lexer, you need to call `NewLexer` function defined in `statement_lexer.go`. The following code is a simple example. In this example, the lexer reads a source code from stdin and writes the result, tokens, to stdout. ```go package main import ( "fmt" "os" ) func main() { lex, err := NewLexer(NewLexSpec(), os.Stdin) if err != nil { fmt.Fprintln(os.Stderr, err) os.Exit(1) } for { tok, err := lex.Next() if err != nil { fmt.Fprintln(os.Stderr, err) os.Exit(1) } if tok.EOF { break } if tok.Invalid { fmt.Printf("invalid: %#v\n", string(tok.Lexeme)) } else { fmt.Printf("valid: %v: %#v\n", KindIDToName(tok.KindID), string(tok.Lexeme)) } } } ``` Please save the above source code to `main.go` and create a directory structure like the one below. ``` /project_root ├── statement_lexer.go ... Lexer generated from the compiled lexical specification (the result of `maleeni-go`). └── main.go .............. Caller of the lexer. ``` Now, you can perform the lexical analysis. ```sh $ echo -n 'I want to believe.' | go run main.go statement_lexer.go valid: word: "I" valid: whitespace: " " valid: word: "want" valid: whitespace: " " valid: word: "to" valid: whitespace: " " valid: word: "believe" valid: punctuation: "." ``` ## More Practical Usage See also [this example](example/README.md). ## Lexical Specification Format The lexical specification format to be passed to `maleeni compile` command is as follows: top level object: | Field | Type | Domain | Nullable | Description | |---------|------------------------|--------|----------|---------------------------------------------------------------------------------------------------------------------------| | name | string | id | false | A specification name. | | entries | array of entry objects | N/A | false | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. | entry object: | Field | Type | Domain | Nullable | Description | |----------|------------------|--------|----------|-----------------------------------------------------------------------------------------------------------------------| | kind | string | id | false | A name of a token kind. The name must be unique, but duplicate names between fragments and non-fragments are allowed. | | pattern | string | regexp | false | A pattern in a regular expression | | modes | array of strings | N/A | true | Mode names that an entry is enabled in (default: "default") | | push | string | id | true | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears | | pop | bool | N/A | true | When `pop` is `true`, the lexer pops a mode from own mode stack. | | fragment | bool | N/A | true | When `fragment` is `true`, its entry is a fragment. | See [Identifier](#identifier) and [Regular Expression](#regular-expression) for more details on `id` domain and `regexp` domain. ## Identifier `id` represents an identifier and must follow the rules below: * `id` must be a lower snake case. It can contain only `a` to `z`, `0` to `9`, and `_`. * The first and last characters must be one of `a` to `z`. * `_` cannot appear consecutively. ## Regular Expression `regexp` represents a regular expression. Its syntax is below: ⚠️ In JSON, you need to write `\` as `\\`. ⚠️ maleeni doesn't allow you to use some code points. See [Unavailable Code Points](#unavailable-code-points). ### Composites Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern. | Pattern | Matches | |------------|----------------| | `abc` | `abc` | | `abc\|def` | `abc` or `def` | ### Single Characters In addition to using ordinary characters, there are other ways to represent a single character: * dot expression * bracket expressions * code point expressions * character property expressions * escape sequences #### Dot Expression The dot expression matches any one chracter. | Pattern | Matches | |---------|-------------------| | `.` | any one character | #### Bracket Expressions The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of `a` or `b`, and `[^ab]` matches any one character except `a` and `b`. | Pattern | Matches | |----------|--------------------------------------------------| | `[abc]` | `a`, `b`, or `c` | | `[^abc]` | any one character except `a`, `b`, and `c` | | `[a-z]` | one in the range of `a` to `z` | | `[a-]` | `a` or `-` | | `[-z]` | `-` or `z` | | `[-]` | `-` | | `[^a-z]` | any one character except the range of `a` to `z` | | `[a^]` | `a` or `^` | #### Code Point Expressions The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string. | Pattern | Matches | |--------------|-----------------------------| | `\u{000A}` | U+000A (LF) | | `\u{3042}` | U+3042 (hiragana `あ`) | | `\u{01F63A}` | U+1F63A (grinning cat `😺`) | #### Character Property Expressions The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports `General_Category`, `Script`, `Alphabetic`, `Lowercase`, `Uppercase`, and `White_Space`. When you omitted the equal symbol and a right-side value, maleeni interprets a symbol in `\p{...}` as the `General_Category` value. | Pattern | Matches | |-------------------------------|--------------------------------------------------------| | `\p{General_Category=Letter}` | any one character whose `General_Category` is `Letter` | | `\p{gc=Letter}` | the same as `\p{General_Category=Letter}` | | `\p{Letter}` | the same as `\p{General_Category=Letter}` | | `\p{l}` | the same as `\p{General_Category=Letter}` | | `\p{Script=Latin}` | any one character whose `Script` is `Latin` | | `\p{Alphabetic=yes}` | any one character whose `Alphabetic` is `yes` | | `\p{Lowercase=yes}` | any one character whose `Lowercase` is `yes` | | `\p{Uppercase=yes}` | any one character whose `Uppercase` is `yes` | | `\p{White_Space=yes}` | any one character whose `White_Space` is `yes` | #### Escape Sequences As you escape the special character with `\`, you can write a rule that matches the special character itself. The following escape sequences are available outside of bracket expressions. | Pattern | Matches | |---------|---------| | `\\.` | `.` | | `\\?` | `?` | | `\\*` | `*` | | `\\+` | `+` | | `\\(` | `(` | | `\\)` | `)` | | `\\[` | `[` | | `\\\|` | `\|` | | `\\\\` | `\\` | The following escape sequences are available inside bracket expressions. | Pattern | Matches | |---------|---------| | `\\^` | `^` | | `\\-` | `-` | | `\\]` | `]` | ### Repetitions The repetitions match a string that repeats the previous single character or group. | Pattern | Matches | |---------|------------------| | `a*` | zero or more `a` | | `a+` | one or more `a` | | `a?` | zero or one `a` | ### Grouping `(` and `)` groups any patterns. | Pattern | Matches | |-------------|-------------------------------------------------| | `a(bc)*d` | `ad`, `abcd`, `abcbcd`, and so on | | `(ab\|cd)+` | `ab`, `cd`, `abcd`, `cdab`, `abcdab`, and so on | ### Fragment The fragment is a feature that allows you to define a part of a pattern. This feature is useful for decomposing complex patterns into simple patterns and for defining common parts between patterns. A fragment entry is defined by an entry whose `fragment` field is `true`, and is referenced by a fragment expression (`\f{...}`). Fragment patterns can be nested, but they are not allowed to contain circular references. For instance, you can define [an identifier of golang](https://golang.org/ref/spec#Identifiers) as follows: ```json { "name": "id", "entries": [ { "fragment": true, "kind": "unicode_letter", "pattern": "\\p{Letter}" }, { "fragment": true, "kind": "unicode_digit", "pattern": "\\p{Number}" }, { "fragment": true, "kind": "letter", "pattern": "\\f{unicode_letter}|_" }, { "kind": "identifier", "pattern": "\\f{letter}(\\f{letter}|\\f{unicode_digit})*" } ] } ``` ### Unavailable Code Points Lexical specifications and source files to be analyzed cannot contain the following code points. When you write a pattern that implicitly contains the unavailable code points, maleeni will automatically generate a pattern that doesn't contain the unavailable code points and replaces the original pattern. However, when you explicitly use the unavailable code points (like `\u{U+D800}` or `\p{General_Category=Cs}`), maleeni will occur an error. * surrogate code points: U+D800..U+DFFF ## Lex Mode Lex Mode is a feature that allows you to separate a DFA transition table for each mode. `modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry. For instance, you can define a subset of [the string literal of golang](https://golang.org/ref/spec#String_literals) as follows: ```json { "name": "string", "entries": [ { "kind": "string_open", "pattern": "\"", "push": "string" }, { "modes": ["string"], "kind": "char_seq", "pattern": "[^\\u{000A}\"\\\\]+" }, { "modes": ["string"], "kind": "escaped_char", "pattern": "\\\\[abfnrtv\\\\'\"]" }, { "modes": ["string"], "kind": "escape_symbol", "pattern": "\\\\" }, { "modes": ["string"], "kind": "newline", "pattern": "\\u{000A}" }, { "modes": ["string"], "kind": "string_close", "pattern": "\"", "pop": true }, { "kind": "identifier", "pattern": "[A-Za-z_][0-9A-Za-z_]*" } ] } ``` In the above specification, when the `"` mark appears in default mode (it's the initial mode), the driver transitions to the `string` mode and interprets character sequences (`char_seq`) and escape sequences (`escaped_char`). When the `"` mark appears the next time, the driver returns to the `default` mode. ```sh $ echo -n '"foo\nbar"foo' | maleeni lex stringc.json | jq -r '[.mode_name, .kind_name, .lexeme, .eof] | @csv' "default","string_open","""",false "string","char_seq","foo",false "string","escaped_char","\n",false "string","char_seq","bar",false "string","string_close","""",false "default","identifier","foo",false "default","","",true ``` The input string enclosed in the `"` mark (`foo\nbar`) are interpreted as the `char_seq` and the `escaped_char`, while the outer string (`foo`) is interpreted as the `identifier`. The same string `foo` is interpreted as different types because of the different modes in which they are interpreted. ## Unicode Version maleeni references [Unicode 13.0.0](https://unicode.org/versions/Unicode13.0.0/).