aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 6fc098d1f98ef3f823377952dab56217dfcb7997 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# maleeni

maleeni provides a compiler that generates a portable DFA for lexical analysis and a driver for golang.

## Installation

```sh
$ go install ./cmd/maleeni
```

## Usage

First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.

```json
{
    "entries": [
        {
            "kind": "whitespace",
            "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
        },
        {
            "kind": "word",
            "pattern": "[0-9A-Za-z]+"
        },
        {
            "kind": "punctuation",
            "pattern": "[.,:;]"
        }
    ]
}
```

Save the above specification to a file. In this explanation, the file name is lexspec.json.

Next, generate a DFA from the lexical specification using `maleeni compile` command.

```sh
$ maleeni compile -l lexspec.json -o clexspec.json
```

If you want to make sure that the lexical specification behaves as expected, you can use `maleeni lex` command to try lexical analysis without having to implement a driver.
`maleeni lex` command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.

```sh
$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind_name, .text, .eof] | @csv'
"word","The",false
"whitespace"," ",false
"word","truth",false
"whitespace"," ",false
"word","is",false
"whitespace"," ",false
"word","out",false
"whitespace"," ",false
"word","there",false
"punctuation",".",false
"","",true
```

When using the driver, please import `github.com/nihei9/maleeni/driver` and `github.com/nihei9/maleeni/spec` package.
You can use the driver easily in the following way:

```go
// Read your lexical specification file.
f, err := os.Open(path)
if err != nil {
    // error handling
}
data, err := ioutil.ReadAll(f)
if err != nil {
    // error handling
}
clexspec := &spec.CompiledLexSpec{}
err = json.Unmarshal(data, clexspec)
if err != nil {
    // error handling
}

// Generate a lexer.
lex, err := driver.NewLexer(clexspec, src)
if err != nil {
    // error handling
}

// Perform lexical analysis.
for {
    tok, err := lex.Next()
    if err != nil {
        // error handling
    }
    if tok.Invalid {
        // An error token appeared.
        // error handling
    }
    if tok.EOF {
        // The EOF token appeared.
        break
    }

    // Do something using `tok`.
}
```

## Lexical Specification Format

The lexical specification format to be passed to `maleeni compile` command is as follows:

top level object:

| Field   | Type                   | Nullable | Description                                                                                                           |
|---------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------------|
| entries | array of entry objects | false    | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. |

entry object:

| Field   | Type             | Nullable | Description                                                                                   |
|---------|------------------|----------|-----------------------------------------------------------------------------------------------|
| kind    | string           | false    | A name of a token kind                                                                        |
| pattern | string           | false    | A pattern in a regular expression                                                             |
| modes   | array of strings | true     | Mode names that an entry is enabled in (default: "default")                                   |
| push    | string           | true     | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears |
| pop     | bool             | true     | When `pop` is true, the lexer pops a mode from own mode stack.                                |

See [Regular Expression Syntax](#regular-expression-syntax) for more details on the regular expression syntax.

## Regular Expression Syntax

### Composites

Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.

| Example  | Description           |
|----------|-----------------------|
| abc      | matches just 'abc'    |
| abc\|def | one of 'abc' or 'def' |

### Single Characters

In addition to using ordinary characters, there are other ways to represent a single character:

* dot expression
* bracket expressions
* code point expressions
* character property expressions

The dot expression matches any one chracter.

| Example | Description       |
|---------|-------------------|
| .       | any one character |

The bracket expressions are represented by enclosing characters in `[ ]` or `[^ ]`. `[^ ]` is negation of `[ ]`. For instance, `[ab]` matches one of 'a' or 'b', and `[^ab]` matches any one character except 'a' and 'b'.

| Example | Description                                      |
|---------|--------------------------------------------------|
| [abc]   | one of 'a', 'b', or 'c'                          |
| [^abc]  | any one character except 'a', 'b', or 'c'        |
| [a-z]   | one in the range of 'a' to 'z'                   |
| [a-]    | 'a' or '-'                                       |
| [-z]    | '-' or 'z'                                       |
| [-]     | '-'                                              |
| [^a-z]  | any one character except the range of 'a' to 'z' |
| [a^]    | 'a' or '^'                                       |

The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.

| Example    | Description               |
|------------|---------------------------|
| \u{000A}   | U+0A (LF)                 |
| \u{3042}   | U+3042 (hiragana あ)      |
| \u{01F63A} | U+1F63A (grinning cat 😺) |

The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports only General_Category.

| Example                     | Description                                        |
|-----------------------------|----------------------------------------------------|
| \p{General_Category=Letter} | any one character whose General_Category is Letter |
| \p{gc=Letter}               | the same as \p{General_Category=Letter}            |
| \p{Letter}                  | the same as \p{General_Category=Letter}            |
| \p{l}                       | the same as \p{General_Category=Letter}            |

### Repetitions

The repetitions match a string that repeats the previous single character or group.

| Example | Description      |
|---------|------------------|
| a*      | zero or more 'a' |
| a+      | one or more 'a'  |
| a?      | zero or one 'a'  |

### Grouping

`(` and `)` groups any patterns.

| Example   | Description                                            |
|-----------|--------------------------------------------------------|
| a(bc)*d   | matches 'ad', 'abcd', 'abcbcd', and so on              |
| (ab\|cd)+ | matches 'ab', 'cd', 'abcd', 'cdab', abcdab', and so on |

## Lex Mode

Lex Mode is a feature that allows you to separate a DFA transition table for each mode.

`modes` field of an entry in a lexical specification indicates in which mode the entry is enabled. If `modes` field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows `push` or `pop` field of each entry.