| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change follows [UAX #44 5.13 Property APIs].
> The following subtypes of Unicode character properties should generally not be exposed in APIs,
> except in limited circumstances. They may not be useful, particularly in public API collections,
> and may instead prove misleading to the users of such API collections.
>
> * Contributory properties are not recommended for public APIs.
> ...
https://unicode.org/reports/tr44/#Property_APIs
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change fields of tokens, results of lexical analysis, as follows:
- Rename: mode -> mode_id
- Rename: kind_id -> mode_kind_id
- Add: kind_id
The kind ID is unique across all modes, but the mode kind ID is unique only within a mode.
Change fields of a transition table as follows:
- Rename: initial_mode -> initial_mode_id
- Rename: modes -> mode_names
- Rename: kinds -> kind_names
- Rename: specs[].kinds -> specs[].kind_names
- Rename: specs[].dfa.initial_state -> specs[].dfa.initial_state_id
Change public types defined in the spec package as follows:
- Rename: LexModeNum -> LexModeID
- Rename: LexKind -> LexKindName
- Add: LexKindID
- Add: StateID
|
|
|
|
| |
A fragment entry is defined by an entry whose `fragment` field is `true`, and is referenced by a fragment expression (`\f{...}`).
|
| |
|
|
|
|
|
|
|
|
|
|
| |
A pattern like \p{Letter} generates an AST with many symbols concatenated by alt operators,
which results in a large number of symbol positions in one state of the DFA.
Such a pattern increases the compilation time. This commit improves the compilation time a little better.
- To avoid calling astNode#first and astNode#last recursively, memoize the result of them.
- Use a byte sequence that symbol positions are encoded to as a hash value to avoid using fmt.Fprintf function.
- Implement a sort function for symbol positions instead of using sort.Slice function.
|
|
|
|
|
|
|
|
|
|
| |
\p{property name=property value} matches a character has the property.
When the property name is General_Category, it can be omitted.
That is, \p{Letter} equals \p{General_Category=Letter}.
Currently, only General_Category is supported.
This feature meets RL1.2 of UTS #18 partially.
RL1.2 Properties: https://unicode.org/reports/tr18/#RL1.2
|
|
|
|
|
|
|
|
| |
\u{hex string} matches a character has the code point represented by the hex string.
For instance, \u{3042} matches hiragana あ (U+3042). The hex string must have 4 or 6 digits.
This feature meets RL1.1 of UTS #18.
RL1.1 Hex Notation: https://unicode.org/reports/tr18/#RL1.1
|
|
|
|
|
| |
* Make the lexer treat ']' as an ordinary character in default mode
* Define values of the syntax error type that represents error information concretely
|
|
|
|
|
| |
This commit increases the maximum number of symbol positions per pattern to 2^15 (= 32,768).
When the limit is exceeded, the parse method returns an error.
|
|
|
|
|
| |
* Add cases test the parse method.
* Fix the parser to pass the cases.
|
|
|
|
|
| |
compile command writes logs out to the maleeni-compile.log file.
When you use compiler.Compile(), you can choose whether the lexer writes logs or not.
|
|
|
|
|
|
| |
* Remove token field from symbolNode
* Simplify notation of nested nodes
* Simplify arguments of newSymbolNode()
|
|
|
|
|
|
|
|
|
| |
The dot symbol matches any single character. When the dot symbol appears, the parser generates an AST matching all of the well-formed UTF-8 byte sequences.
Refelences:
* https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404
* Table 3-6. UTF-8 Bit Distribution
* Table 3-7. Well-Formed UTF-8 Byte Sequences
|
|
The compiler takes a lexical specification expressed by regular expressions and generates a DFA accepting the tokens.
Operators that you can use in the regular expressions are concatenation, alternation, repeat, and grouping.
|