aboutsummaryrefslogtreecommitdiff
path: root/compiler/parser_test.go (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Move UTF8-related processes to utf8 packageRyo Nihei2021-12-011-111/+111
|
* Make contributory properties unavailable except internal useRyo Nihei2021-11-281-0/+32
| | | | | | | | | | | | This change follows [UAX #44 5.13 Property APIs]. > The following subtypes of Unicode character properties should generally not be exposed in APIs, > except in limited circumstances. They may not be useful, particularly in public API collections, > and may instead prove misleading to the users of such API collections. > > * Contributory properties are not recommended for public APIs. > ... https://unicode.org/reports/tr44/#Property_APIs
* Keep the order of AST nodes constantRyo Nihei2021-09-221-4/+10
|
* Change APIsRyo Nihei2021-08-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | Change fields of tokens, results of lexical analysis, as follows: - Rename: mode -> mode_id - Rename: kind_id -> mode_kind_id - Add: kind_id The kind ID is unique across all modes, but the mode kind ID is unique only within a mode. Change fields of a transition table as follows: - Rename: initial_mode -> initial_mode_id - Rename: modes -> mode_names - Rename: kinds -> kind_names - Rename: specs[].kinds -> specs[].kind_names - Rename: specs[].dfa.initial_state -> specs[].dfa.initial_state_id Change public types defined in the spec package as follows: - Rename: LexModeNum -> LexModeID - Rename: LexKind -> LexKindName - Add: LexKindID - Add: StateID
* Add fragment expressionRyo Nihei2021-05-251-8/+126
| | | | A fragment entry is defined by an entry whose `fragment` field is `true`, and is referenced by a fragment expression (`\f{...}`).
* Fix parser to recognize property expressions in bracket expressionsRyo Nihei2021-05-021-0/+11
|
* Improve compilation time a littleRyo Nihei2021-05-021-2/+1
| | | | | | | | | | A pattern like \p{Letter} generates an AST with many symbols concatenated by alt operators, which results in a large number of symbol positions in one state of the DFA. Such a pattern increases the compilation time. This commit improves the compilation time a little better. - To avoid calling astNode#first and astNode#last recursively, memoize the result of them. - Use a byte sequence that symbol positions are encoded to as a hash value to avoid using fmt.Fprintf function. - Implement a sort function for symbol positions instead of using sort.Slice function.
* Add character property expression (Meet RL1.2 of UTS #18 partially)Ryo Nihei2021-04-301-1/+72
| | | | | | | | | | \p{property name=property value} matches a character has the property. When the property name is General_Category, it can be omitted. That is, \p{Letter} equals \p{General_Category=Letter}. Currently, only General_Category is supported. This feature meets RL1.2 of UTS #18 partially. RL1.2 Properties: https://unicode.org/reports/tr18/#RL1.2
* Add code point expression (Meet RL1.1 of UTS #18)Ryo Nihei2021-04-241-2/+125
| | | | | | | | \u{hex string} matches a character has the code point represented by the hex string. For instance, \u{3042} matches hiragana あ (U+3042). The hex string must have 4 or 6 digits. This feature meets RL1.1 of UTS #18. RL1.1 Hex Notation: https://unicode.org/reports/tr18/#RL1.1
* Change the lexical specs of regexp and define concrete syntax error valuesRyo Nihei2021-04-171-318/+308
| | | | | * Make the lexer treat ']' as an ordinary character in default mode * Define values of the syntax error type that represents error information concretely
* Increase the maximum number of symbol positions per patternRyo Nihei2021-04-121-4/+12
| | | | | This commit increases the maximum number of symbol positions per pattern to 2^15 (= 32,768). When the limit is exceeded, the parse method returns an error.
* Fix grammar the parser acceptsRyo Nihei2021-04-111-57/+955
| | | | | * Add cases test the parse method. * Fix the parser to pass the cases.
* Add logging to compile commandRyo Nihei2021-04-081-38/+0
| | | | | compile command writes logs out to the maleeni-compile.log file. When you use compiler.Compile(), you can choose whether the lexer writes logs or not.
* RefactoringRyo Nihei2021-02-251-22/+9
| | | | | | * Remove token field from symbolNode * Simplify notation of nested nodes * Simplify arguments of newSymbolNode()
* Add dot symbol matching any single characterRyo Nihei2021-02-141-7/+14
| | | | | | | | | The dot symbol matches any single character. When the dot symbol appears, the parser generates an AST matching all of the well-formed UTF-8 byte sequences. Refelences: * https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404 * Table 3-6. UTF-8 Bit Distribution * Table 3-7. Well-Formed UTF-8 Byte Sequences
* Add compilerRyo Nihei2021-02-141-0/+208
The compiler takes a lexical specification expressed by regular expressions and generates a DFA accepting the tokens. Operators that you can use in the regular expressions are concatenation, alternation, repeat, and grouping.