| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
| |
[^\p{...}] available)
|
| |
|
| |
|
| |
|
|
|
|
| |
kind names
|
| |
|
|
|
|
| |
close #1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change fields of tokens, results of lexical analysis, as follows:
- Rename: mode -> mode_id
- Rename: kind_id -> mode_kind_id
- Add: kind_id
The kind ID is unique across all modes, but the mode kind ID is unique only within a mode.
Change fields of a transition table as follows:
- Rename: initial_mode -> initial_mode_id
- Rename: modes -> mode_names
- Rename: kinds -> kind_names
- Rename: specs[].kinds -> specs[].kind_names
- Rename: specs[].dfa.initial_state -> specs[].dfa.initial_state_id
Change public types defined in the spec package as follows:
- Rename: LexModeNum -> LexModeID
- Rename: LexKind -> LexKindName
- Add: LexKindID
- Add: StateID
|
| |
|
| |
|
| |
|
|
|
|
| |
A fragment entry is defined by an entry whose `fragment` field is `true`, and is referenced by a fragment expression (`\f{...}`).
|
| |
|
|
|
|
| |
--compression-level specifies a compression level. The default value is 2.
|
|
|
|
| |
This commit fixes a bug that caused the second and subsequent characters of the text representation of an error token to be missing.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
lex mode is a feature that separates transition tables per each mode.
The lexer starts from an initial state indicated by `initial_state` field and
transitions between modes according to `push` and `pop` fields.
The initial state will always be `default`.
Currently, maleeni doesn't provide the ability to change the initial state.
You can specify the modes of each lex entry using `modes` field.
When the mode isn't indicated explicitly, the entries have `default` mode.
|
|
|
|
|
|
|
|
| |
\u{hex string} matches a character has the code point represented by the hex string.
For instance, \u{3042} matches hiragana あ (U+3042). The hex string must have 4 or 6 digits.
This feature meets RL1.1 of UTS #18.
RL1.1 Hex Notation: https://unicode.org/reports/tr18/#RL1.1
|
| |
|
|
|
|
|
| |
* Make the lexer treat ']' as an ordinary character in default mode
* Define values of the syntax error type that represents error information concretely
|
|
|
|
|
| |
* Print the result of the lex command in JSON format.
* Print the EOF token.
|
|
|
|
| |
[^a-z] matches any character that is not in the range a-z.
|
|
|
|
| |
[a-z] matches any one character from a to z. The order of the characters depends on Unicode code points.
|
|
|
|
|
| |
* a+ matches 'a' one or more times. This is equivalent to aa*.
* a? matches 'a' zero or one time.
|
|
|
|
| |
APIs of compiler and driver packages use these types. Because CompiledLexSpec struct a lexer takes has kind names of lexical specification entries, the lexer sets them to tokens.
|
|
|
|
| |
The bracket expression matches any single character specified in it. In the bracket expression, the special characters like ., *, and so on are also handled as normal characters.
|
|
|
|
|
|
|
|
|
| |
The dot symbol matches any single character. When the dot symbol appears, the parser generates an AST matching all of the well-formed UTF-8 byte sequences.
Refelences:
* https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404
* Table 3-6. UTF-8 Bit Distribution
* Table 3-7. Well-Formed UTF-8 Byte Sequences
|
|
The driver takes a DFA and an input text and generates a lexer. The lexer tokenizes the input text according to the lexical specification that the DFA expresses.
|