# maleeni

maleeni provides a command that generates a portable DFA for lexical analysis and a driver for golang. maleeni also provides a command to perform lexical analysis to allow easy debugging of your lexical specification.

[![Test](https://github.com/nihei9/maleeni/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/nihei9/maleeni/actions/workflows/test.yml)

## Installation

```sh
$ go install github.com/nihei9/maleeni/cmd/maleeni@latest
```

## Usage

First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.

```json
{
    "entries": [
        {
            "kind": "whitespace",
            "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
        },
        {
            "kind": "word",
            "pattern": "[0-9A-Za-z]+"
        },
        {
            "kind": "punctuation",
            "pattern": "[.,:;]"
        }
    ]
}
```

Save the above specification to a file in UTF-8. In this explanation, the file name is lexspec.json.

Next, generate a DFA from the lexical specification using `maleeni compile` command.

```sh
$ maleeni compile -l lexspec.json -o clexspec.json
```

If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to implement a driver.
`maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.

⚠️ An encoding that `maleeni lex` and the driver can handle is only UTF-8.

```sh
$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind_id, .kind_name, .text, .eof] | @csv'
2,"word","The",false
1,"whitespace"," ",false
2,"word","truth",false
1,"whitespace"," ",false
2,"word","is",false
1,"whitespace"," ",false
2,"word","out",false
1,"whitespace"," ",false
2,"word","there",false
3,"punctuation",".",false
0,"","",true
```

The JSON format of tokens that `maleeni lex` command prints is as follows:

| Field        | Type              | Description                                                                                                                                           |
|--------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| mode_id      | integer           | An ID of a lex mode.                                                                                                                                  |
| mode_name    | string            | A name of a lex mode.                                                                                                                                 |
| kind_id      | integer           | An ID of a kind. This is unique among all modes.                                                                                                      |
| mode_kind_id | integer           | An ID of a lexical kind. This is unique only within a mode. Note that you need to use `KindID` field if you want to identify a kind across all modes. |
| kind_name    | string            | A name of a lexical kind.                                                                                                                             |
| row          | integer           | A row number where a lexeme appears.                                                                                                                  |
| col          | integer           | A column number where a lexeme appears. Note that `col` is counted in code points, not bytes.                                                         |
| match        | array of integers | A byte sequense of a lexeme.                                                                                                                          |
| text         | string            | A string representation of a lexeme.                                                                                                                  |
| eof          | bool              | When this field is `true`, it means the token is the EOF token.                                                                                       |
| invalid      | bool              | When this field is `true`, it means the token is an error token.                                                                                      |

When using the driver, please import `github.com/nihei9/maleeni/driver` and `github.com/nihei9/maleeni/spec` package.
You can use the driver easily in the following way:

```go
// Read your lexical specification file.
f, err := os.Open(path)
if err != nil {
    // error handling
}
data, err := ioutil.ReadAll(f)
if err != nil {
    // error handling
}
clexspec := &spec.CompiledLexSpec{}
err = json.Unmarshal(data, clexspec)
if err != nil {
    // error handling
}

// Generate a lexer.
lex, err := driver.NewLexer(clexspec, src)
if err != nil {
    // error handling
}

// Perform lexical analysis.
for {
    tok, err := lex.Next()
    if err != nil {
        // error handling
    }
    if tok.Invalid {
        // An error token appeared.
        // error handling
    }
    if tok.EOF {
        // The EOF token appeared.
        break
    }

    // Do something using `tok`.
}
```

## More Practical Usage

See also [this example](example/README.md).

## Lexical Specification Format

The lexical specification format to be passed to `maleeni compile` command is as follows:

top level object:

| Field   | Type                   | Nullable | Description                                                                                                           |
|---------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------------|
| entries | array of entry objects | false    | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. |

entry object:

| Field    | Type             | Nullable | Description                                                                                                           |
|----------|------------------|----------|-----------------------------------------------------------------------------------------------------------------------|
| kind     | string           | false    | A name of a token kind. The name must be unique, but duplicate names between fragments and non-fragments are allowed. |
| pattern  | string           | false    | A pattern in a regular expression                                                                                     |
| modes    | array of strings | true     | Mode names that an entry is enabled in (default: "default")                                                           |
| push     | string           | true     | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears                         |
| pop      | bool             | true     | When `pop` is `true`, the lexer pops a mode from own mode stack.                                                      |
| fragment | bool             | true     | When `fragment` is `true`, its entry is a fragment.                                                                   |

See [Regular Expression Syntax](#regular-expression-syntax) for more details on the regular expression syntax.

## Regular Expression Syntax

⚠️ In JSON, you need to write `\` as `\\`.

### Composites

Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.

| Example  | Description           |
|----------|-----------------------|
| abc      | matches just 'abc'    |
| abc\|def | one of 'abc' or 'def' |

### Single Characters

In addition to using ordinary characters, there are other ways to represent a single character:

* dot expression
* bracket expressions
* code point expressions
* character property expressions
* escape sequences

The dot expression matches any one chracter.

| Example | Description       |
|---------|-------------------|
| .       | any one character |

The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'.

| Example | Description                                      |
|---------|--------------------------------------------------|
| [abc]   | one of 'a', 'b', or 'c'                          |
| [^abc]  | any one character except 'a', 'b', or 'c'        |
| [a-z]   | one in the range of 'a' to 'z'                   |
| [a-]    | 'a' or '-'                                       |
| [-z]    | '-' or 'z'                                       |
| [-]     | '-'                                              |
| [^a-z]  | any one character except the range of 'a' to 'z' |
| [a^]    | 'a' or '^'                                       |

The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.

| Example    | Description               |
|------------|---------------------------|
| \u{000A}   | U+0A (LF)                 |
| \u{3042}   | U+3042 (hiragana あ)      |
| \u{01F63A} | U+1F63A (grinning cat 😺) |

The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports only General_Category.

| Example                     | Description                                        |
|-----------------------------|----------------------------------------------------|
| \p{General_Category=Letter} | any one character whose General_Category is Letter |
| \p{gc=Letter}               | the same as \p{General_Category=Letter}            |
| \p{Letter}                  | the same as \p{General_Category=Letter}            |
| \p{l}                       | the same as \p{General_Category=Letter}            |

As you escape the special character with `\`, you can write a rule that matches the special character itself.
The following escape sequences are available outside of bracket expressions.

| Example | Description |
|---------|-------------|
| \\.     | '.'         |
| \\?     | '?'         |
| \\*     | '*'         |
| \\+     | '+'         |
| \\(     | '('         |
| \\)     | ')'         |
| \\[     | '['         |
| \\\|    | '\|'        |
| \\\\    | '\\'        |

The following escape sequences are available inside bracket expressions.

| Example | Description |
|---------|-------------|
| \\^     | '^'         |
| \\-     | '-'         |
| \\]     | ']'         |

### Repetitions

The repetitions match a string that repeats the previous single character or group.

| Example | Description      |
|---------|------------------|
| a*      | zero or more 'a' |
| a+      | one or more 'a'  |
| a?      | zero or one 'a'  |

### Grouping

`(` and `)` groups any patterns.

| Example   | Description                                            |
|-----------|--------------------------------------------------------|
| a(bc)*d   | matches 'ad', 'abcd', 'abcbcd', and so on              |
| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on |

### Fragment

The fragment is a feature that allows you to define a part of a pattern. This feature is useful for decomposing complex patterns into simple patterns and for defining common parts between patterns.
A fragment entry is defined by an entry whose `fragment` field is `true`, and is referenced by a fragment expression (`\f{...}`).
Fragment patterns can be nested, but they are not allowed to contain circular references.

For instance, you can define [an identifier of golang](https://golang.org/ref/spec#Identifiers) as follows:

```json
{
    "entries": [
        {
            "fragment": true,
            "kind": "unicode_letter",
            "pattern": "\\p{Letter}"
        },
        {
            "fragment": true,
            "kind": "unicode_digit",
            "pattern": "\\p{Number}"
        },
        {
            "fragment": true,
            "kind": "letter",
            "pattern": "\\f{unicode_letter}|_"
        },
        {
            "kind": "identifier",
            "pattern": "\\f{letter}(\\f{letter}|\\f{unicode_digit})*"
        }
    ]
}
```

## Lex Mode

Lex Mode is a feature that allows you to separate a DFA transition table for each mode.

`modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry.

For instance, you can define a subset of [the string literal of golang](https://golang.org/ref/spec#String_literals) as follows:

```json
{
    "entries": [
        {
            "kind": "string_open",
            "pattern": "\"",
            "push": "string"
        },
        {
            "modes": ["string"],
            "kind": "char_seq",
            "pattern": "[^\\u{000A}\"\\\\]+"
        },
        {
            "modes": ["string"],
            "kind": "escaped_char",
            "pattern": "\\\\[abfnrtv\\\\'\"]"
        },
        {
            "modes": ["string"],
            "kind": "escape_symbol",
            "pattern": "\\\\"
        },
        {
            "modes": ["string"],
            "kind": "newline",
            "pattern": "\\u{000A}"
        },
        {
            "modes": ["string"],
            "kind": "string_close",
            "pattern": "\"",
            "pop": true
        },
        {
            "kind": "identifier",
            "pattern": "[A-Za-z_][0-9A-Za-z_]*"
        }
    ]
}
```

In the above specification, when the `"` mark appears in default mode (it's the initial mode), the driver transitions to the `string` mode and interprets character sequences (`char_seq`) and escape sequences (`escaped_char`). When the `"` mark appears the next time, the driver returns to the `default` mode.

```sh
$ echo -n '"foo\nbar"foo' | maleeni lex go-string-cspec.json | jq -r '[.mode_name, .kind_name, .text, .eof] | @csv'
"default","string_open","""",false
"string","char_seq","foo",false
"string","escaped_char","\n",false
"string","char_seq","bar",false
"string","string_close","""",false
"default","identifier","foo",false
"default","","",true
```

The input string enclosed in the `"` mark (`foo\nbar`) are interpreted as the `char_seq` and the `escaped_char`, while the outer string (`foo`) is interpreted as the `identifier`. The same string `foo` is interpreted as different types because of the different modes in which they are interpreted.