Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedication #11

Open
aMOPel opened this issue Oct 2, 2022 · 24 comments
Open

Dedication #11

aMOPel opened this issue Oct 2, 2022 · 24 comments

Comments

@aMOPel
Copy link
Contributor

aMOPel commented Oct 2, 2022

Hi there,

I know it's kinda odd to ask about your plans, it's just that I happen to also have worked a bunch on a tree-sitter-nim repo and was wondering how dedicated you are to finishing yours. I haven't had the time to really continue since mid august and frankly the last bits are gonna be difficult (I made a bunch of issues where you could see what's missing in mine).

You seem to have made some good progress and also you seem to have been able to keep the grammar.js a lot smaller and possibly cleaner than me so far. Are you dedicated to finishing it? Because then I would not try to push mine to the end.

@alaviss
Copy link
Owner

alaviss commented Oct 4, 2022

I'll try my best to get it done, but so far school is really getting in the way.

@aMOPel
Copy link
Contributor Author

aMOPel commented Feb 9, 2023

Apparently you started again. I kinda lost motivation on mine, however mine is very close to completion I believe.
The last problems proved to be hard to overcome though.

If I can give you a tip: You might want to solve those first and create super tight test cases so that you can't regress.

  1. the complex expressions, ie if expressions, try expressions etc. are quite a problem, since they can be inline or over multiple lines. And to match those right in all contexts that expressions can appear in, seems like a really involved task.
  2. Also the par element in the grammar is difficult and has a lot of conflicts with all the other uses of parenthesis, like tuple construction, function calls that are not right after the function identifier and of course the parentheses that you can just wrap around any expression. In fact those might be almost impossible to distinguish, just by syntax.
  3. cmdCall and postExprBlocks syntax is also really hard to nail down with all the corner cases and conflicts.

If you also lose motivation, maybe we can team up. Either way feel free to use my code and test cases.

Edit: I just saw that you made the https://github.com/alaviss/nim.nvim plugin. I've been using it for a while and it does a beautiful job at highlighting. Thank you :)

@alaviss
Copy link
Owner

alaviss commented Feb 9, 2023

Thanks for the kind words.

If I can give you a tip: You might want to solve those first and create super tight test cases so that you can't regress.

-----:scissors: snipped------

Indeed, these are the reason I'm restructuring the external scanner and the grammar as a whole.

One thing I've been doing is to not use the official grammar more than I have to. The grammar was, well, hand derived from hand written code, so it's not precise in the first place. What I've been doing is to test snippets of code and cross reference with the macros.dumpTree. Currently I think I have somewhat of an idea on how to finish this 🤞

@aMOPel
Copy link
Contributor Author

aMOPel commented Feb 10, 2023

That's sensible. There are a lot of things missing in the grammar, some rules are described in an odd or inconsistent way, and some are even partially wrong. I've made a collection of the issues I noticed (in the bottom are also some things I asked):

https://github.com/aMOPel/tree-sitter-nim/blob/main/grammar_bugs_to_report.md

Also I don't know how you're constructing your test cases right now, but I put some mappings for vim users in my readme to easily write and iterate on test cases:

https://github.com/aMOPel/tree-sitter-nim/tree/main#for-vim-users

And I can really recommend you to harvest my test corpus. Of course you can't use the AST's but you could use the examples and I've been trying to write down a lot of cases. For example in literals and operators I've tried to really cover all the cases, if possible.

Also for many other things I've put a lot of corner cases, like the classic "is it a tuple or a function call?":

echo()            # function call
echo ()            # tuple
echo(1, 2)            # function call
echo (1, 2)            # function call
echo (a: 1,b: 2)      # tuple
echo(1)               # function call
echo (1)              # function call
echo (1,)              # tuple
echo (a: 1)           # tuple, no trailing comma, still tuple
echo((1,2))           # tuple
echo ((1,2))           # tuple
echo(((1,2),),)           # double tuple
echo (((1,2),),)           # triple tuple

Anyway. Thank you for giving the nim tree-sitter another go. I'd love to use it somewhen in the future :)

@alaviss
Copy link
Owner

alaviss commented Feb 10, 2023

Also I don't know how you're constructing your test cases right now, but I put some mappings for vim users in my readme to easily write and iterate on test cases:

I have nothing like that :P I really just write the test, add --- separator at the end, then tree-sitter test -f "test name" -u to fill in the blanks.

And I can really recommend you to harvest my test corpus. Of course you can't use the AST's but you could use the examples and I've been trying to write down a lot of cases. For example in literals and operators I've tried to really cover all the cases, if possible.

Thanks, I'll consult them for sure.

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 19, 2023

Little update: I've started a rework of my grammar as well. Somebody wrote some highlight queries for it and I was surprised to see that it already did like 95% of everything. Gave me some new motivation.

Anyway I wanted to tell you I have a great workflow for updating test cases. Since it can be super tedious when done manually.

https://github.com/aMOPel/tree-sitter-nim/blob/main/.lvimrc

With those mappings you can literally go <space>dd to open a diff between the old ast and the new ast and then <space>yy or <space>nn to accept or disregard the new ast. And then you can go <space>dd again for the next test case. That means you can really update test cases in seconds. I hope that helps :)

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 20, 2023

While fighting with command call syntax and expression precedences I stumbled upon this:

Whether an operator is used as a prefix operator is also affected by preceding whitespace (this parsing change was introduced with version 0.13.0):

echo $foo
# is parsed as
echo($foo)

from here:
https://nim-lang.github.io/Nim/manual.html#syntax-precedence

I didn't see a test case in your tests yet, maybe I missed it.

echo $foo

should be parsed as

(command_call (identifiert) (unary_expression (operator) (identifier)))

and

echo $ foo

should be parsed as

(binary_expression (identifier) (operator) (identifier))

I figure this is rather impossible, since you need to use token() or token.immediate() to control the whitespace matching and both disallow references to rules and require terminal symbols (as you know).

And I don't see how to resolve this with scanner.cc and without a huge effort either.

@alaviss
Copy link
Owner

alaviss commented Mar 20, 2023

And I don't see how to resolve this with scanner.cc and without a huge effort either.

Feel free to use mine:

namespace binary_op {
/// The maximum value of `char`. Useful for unicode testing.
constexpr char MaxAsciiChar = numeric_limits<char>::max();
/// The set of all BinaryOps tokens.
constexpr ValidSymbols BinaryOps = make_valid_symbols(
{TokenType::BinaryOp10Left, TokenType::BinaryOp10Right,
TokenType::BinaryOp9, TokenType::BinaryOp8, TokenType::BinaryOp7,
TokenType::BinaryOp6, TokenType::BinaryOp5, TokenType::BinaryOp2,
TokenType::BinaryOp1, TokenType::BinaryOp0});
// TODO: Invent something to encapsulate this structure.
/// All ASCII operator characters.
///
/// The ordering is done so that it reflects the precedence structure.
constexpr array<char, 19> Chars{// OP10
/* 0 */ '$', '^',
// OP9
/* 2 */ '*', '%', '\\', '/',
// OP8
/* 6 */ '+', '-', '~', '|',
// OP7
/* 10 */ '&',
// OP6
/* 11 */ '.',
// OP5
/* 12 */ '=', '<', '>', '!',
// OP2
/* 16 */ '@', ':', '?'};
/// Mapping of {@link Chars} starting index and offset from BinaryOpsStart.
constexpr array<int8_t, 8> CharRanges{
/* BinaryOpStart | BinaryOp10Left */ 0,
/* BinaryOp10Right */ 1,
/* BinaryOp9 */ 2,
/* BinaryOp8 */ 6,
/* BinaryOp7 */ 10,
/* BinaryOp6 */ 11,
/* BinaryOp5 */ 12,
/* BinaryOp2 */ 16};
/// All Unicode operator characters.
///
/// The ordering is done so that it reflects the precedence structure.
constexpr array<char16_t, 21> UnicodeChars{
// OP9
/* 0 */ u'', u'', u'×', u'', u'', u'', u'', u'', u'', u'', u'',
u'', u'',
// OP8
/* 13 */ u'±', u'', u'', u'', u'', u'', u'', u''};
/// Mapping of {@link UnicodeChars} starting index and offset from
/// BinaryOpsStart.
constexpr array<int8_t, 2> UnicodeRanges{
/* BinaryOp9 */ 0,
/* BinaryOp8 */ 13};
/// Given a segmented array, returns the segment the given value is in.
///
/// @param array - The segmented array.
/// @param mapping - The start of each segment, sorted from low to high.
/// @param value - The value to search for.
/// @param fallback - The value returned if `value` is not in `array`, that the
/// segment found is not in `mapping`.
///
/// @returns The index of the segment in `mapping`.
template<class InputT, class MapT, class T>
int mapped_find(
const InputT& array, const MapT& mapping, T value, int fallback = -1)
{
const auto iter = find(array.begin(), array.end(), value);
if (iter == array.end()) {
return fallback;
}
const auto idx = iter - array.begin();
const auto mapped_iter = find_if(
mapping.rbegin(), mapping.rend(),
[=](auto rangeStart) { return rangeStart <= idx; });
if (mapped_iter == mapping.rend()) {
return fallback;
}
return mapping.rend() - mapped_iter - 1;
}
/// Returns the BinaryOp token type corresponding to the given `character`.
///
/// Returns `TokenType::TokenTypeLen` if the character is not a valid operator.
TokenType classify(char32_t character)
{
if (character <= MaxAsciiChar) {
const auto idx = mapped_find(Chars, CharRanges, character);
if (idx != -1) {
return (TokenType)(idx + (uint8_t)TokenType::BinaryOpStart);
}
return TokenType::TokenTypeLen;
}
const auto idx = mapped_find(UnicodeChars, UnicodeRanges, character);
if (idx != -1) {
return (TokenType)(idx + (uint8_t)TokenType::BinaryOp9);
}
return TokenType::TokenTypeLen;
}
/// Returns whether `character` is a valid operator character.
bool is_op_char(char32_t character)
{
if (character <= MaxAsciiChar) {
return find(Chars.begin(), Chars.end(), character) != Chars.end();
}
return find(UnicodeChars.begin(), UnicodeChars.end(), character) !=
UnicodeChars.end();
}
/// Lexer for binary operators.
///
/// @param ctx - The context to scan from.
/// @param immediate - Whether the lexer was called before any spaces were
/// scanned.
bool lex(Context& ctx, bool immediate)
{
if (!ctx.all_valid(BinaryOps)) {
return false;
}
enum class State {
Arrow,
Assignment,
Colon,
Equal,
MaybeArrow,
MaybeDotOp,
Regular,
};
auto state{State::Regular};
const auto first_character = ctx.lookahead();
auto result = classify(first_character);
if (result == TokenType::TokenTypeLen) {
return false;
}
switch (first_character) {
case '.':
ctx.consume();
state = State::MaybeDotOp;
break;
case '=':
ctx.consume();
state = State::Equal;
break;
case ':':
ctx.consume();
state = State::Colon;
break;
default:
state = State::Regular;
break;
}
while (is_op_char(ctx.lookahead())) {
switch (state) {
case State::MaybeDotOp:
switch (ctx.lookahead()) {
case '.':
state = State::Regular;
ctx.consume();
break;
default:
return false;
}
break;
case State::Assignment:
case State::Equal:
case State::MaybeArrow:
switch (ctx.lookahead()) {
case '>':
state = State::Arrow;
ctx.consume();
break;
default:
state = State::Regular;
}
break;
case State::Arrow:
case State::Colon:
case State::Regular:
switch (ctx.lookahead()) {
case '~':
case '-':
state = State::MaybeArrow;
break;
case '=':
state = State::Assignment;
break;
default:
state = State::Regular;
break;
}
ctx.consume();
break;
}
}
switch (state) {
case State::Arrow:
result = TokenType::BinaryOp0;
break;
case State::Assignment:
switch (first_character) {
case '<':
case '>':
case '!':
case '=':
case '~':
case '?':
break;
default:
result = TokenType::BinaryOp1;
}
break;
case State::Equal:
if (ctx.valid(TokenType::Equal)) {
result = TokenType::Equal;
}
break;
case State::Colon:
if (!ctx.valid(TokenType::Colon)) {
return false;
}
result = TokenType::Colon;
break;
case State::MaybeDotOp:
return false;
default:
break;
}
if (immediate) {
return ctx.finish(result);
}
switch (ctx.lookahead()) {
case ' ':
case '\n':
case '\r':
break;
default:
return false;
}
return ctx.finish(result);
}
} // namespace binary_op

Command call is a pain to parse, it's only recently that I managed to resolve the conflicts with unary & binary operators.

So far my grammar managed this suite (up until block statements): https://github.com/alaviss/tree-sitter-nim/blob/646fe39ba9850c26b7cbc64515af2116ff366f6f/corpus/expressions.txt

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 20, 2023

That's the huge effort I was talking about. Props.

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 21, 2023

Reading your operator lexer right now. If I understand correctly you also decided to not use the dotlikeop yet, since it needs a compiler flag and that's impossible to check for from tree-sitter. So the dotlikeop code in there is more like a placeholder for when it becomes the default, right?

Edit: Oh, but you are parsing dotlikeops differently in grammar.js.

Consider:

proc `.?`(a: int, b: int): int = a+b
let 
  a = 5
  b = 6
echo a.?b.float

with -d:nimPreviewDotLikeOps this prints 11.0, without it throws an error: Error: type mismatch: got <int, float>

So I believe it's more sensible for now to not treat dotlikeops differently.

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 21, 2023

I think your lex_long_string_quote is quite clever. Much smarter than my current solution, which tried to match the end of the multi line string, instead of quotes within.

@alaviss
Copy link
Owner

alaviss commented Mar 21, 2023

I parse dot-like ops by default since there's no reason not to. It's a pretty harmless part of the parser that I can toggle off if it becomes an issue.

As far as the grammar goes I try to parse a superset of Nim to simplify the process (hence the support for unicode ops without any experimental flags).

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 21, 2023

there's no reason not to

Thing is with the dotlikeops, it's no superset it's different behaviour, depending on the compiler flag.

@alaviss
Copy link
Owner

alaviss commented Mar 22, 2023

I looked further into this and looks like dot-like ops is slated for removal (from 2.0 changelog).

I misremembered the feature to be support for dot ops altogether, sorry.

Thanks for bringing my attention to this, I'll remove it from the grammar.

alaviss added a commit that referenced this issue Mar 22, 2023
This feature is gonna be removed, and it was not the correct way
to handle dot ops in the first place.

See #11 (comment)
@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 22, 2023

Another thing I noticed:
your generalized_string_literal doesn't support accent_quoted identifiers.

@alaviss
Copy link
Owner

alaviss commented Mar 22, 2023

It doesn't seem to support accent_quoted from this snippet I tested on playground:

import macros

dumpTree:
  `foo`"str\"
/usercode/in.nim(4, 14) Error: closing " expected

@aMOPel
Copy link
Contributor Author

aMOPel commented Mar 23, 2023

Hm, you're right.

proc `>>`(a: string) = echo a

`>>`"hi"

compiles and prints hi, but

proc `>>`(a: string) = echo a

`>>`"h\ni"

prints

h
i

while

proc x(a: string) = echo a

x"h\ni"

prints h\ni as it should.

So the case with ticks should actually be parsed as a command call.

@firasuke
Copy link

firasuke commented Apr 8, 2023

I stumbled upon this issue after trying to get proper Nim support with nvim, and failing to do so which lead me to read the current situation over at nim-lang/RFCs#300 (comment).

Keep up the great work!

@alaviss
Copy link
Owner

alaviss commented May 1, 2023

Just a small update, the rewrite branch now have a pretty complete Nim grammar, but now I'm looking at yet another rewrite to reduce the parser size (which is 65MiB and takes 1mins to build, not to mention that some changes can explode the number of states to beyond what tree-sitter can support) and fix command call induced precedence quirks.

The better news is that the 2nd rewrite (not uploaded yet) is maintaining a more bareable 1.5MiB in size and I'm making a lot of headway on handling command call expressions and (hopefully) avoid invoking the GLR parser.

@firasuke
Copy link

firasuke commented May 1, 2023

Just a small update, the rewrite branch now have a pretty complete Nim grammar, but now I'm looking at yet another rewrite to reduce the parser size (which is 65MiB and takes 1mins to build, not to mention that some changes can explode the number of states to beyond what tree-sitter can support) and fix command call induced precedence quirks.

The better news is that the 2nd rewrite (not uploaded yet) is maintaining a more bareable 1.5MiB in size and I'm making a lot of headway on handling command call expressions and (hopefully) avoid invoking the GLR parser.

That's good to hear!

Any ideas why the size difference between the two implementations is this big?

@alaviss
Copy link
Owner

alaviss commented May 1, 2023

Any ideas why the size difference between the two implementations is this big?

The gap is closing, and closing too fast. Right now the next rewrite is already at 38MiB!

The good news is that I cracked command expression, so at least we will have something correct, even if it's insanely big...

From what I can tell, the size delta boils down to conflicts between expressions with : in them. Every time this sort of conflict happens, say if if expr:, the parser size explodes.

Here's an unfinished parser state count:

binary_expression                       3881
proc_expression                         1428
iterator_expression                     1428
func_expression                         1428
block                                   490
if_repeat1                              415
when                                    400
if                                      400
_line_statement_list_repeat1            377
_basic_command                          369
_basic_call                             368
_expression_no_unary_word               335
case_repeat1                            297
bracket_expression                      293
curly_expression                        293
except_branch                           292
dot_expression                          272
try_repeat1                             268
_unary_symbol_expression                266
elif_branch                             263
try                                     256
cast                                    252
case                                    240
do_block                                236
_line_statement_list                    232
else_branch                             181
curly_construction                      168
_call_block_arguments_repeat1           167
of_branch                               161
_expression_with_block_call             159
_unary_sigil_expression                 154
array_construction                      147
parenthesized                           147
tuple_construction                      147
finally_branch                          146
_call_block_expression                  144
_command_expression                     144
_call_expression                        144
_command_expression_arguments           144
_call_expression_arguments              143
ref_type                                133
distinct_type                           133
pointer_type                            133
out_type                                133
_unary_word_expression                  133
accent_quoted                           129
pragma_block                            126
colon_expression                        115
_basic_expression                       105
_block_statement_list                   100
equal_expression                        93
_call_block_arguments                   90
_interpreted_string_literal             84
tuple_type                              84
_parenthesized_argument_list            84
_call_with_block                        78
_pragma                                 66
_expression                             63
generalized_string                      63
_generalized_string_literal             63
array_construction_repeat1              53
accent_quoted_repeat1                   44
_long_string_body                       42
_long_string_literal                    42
_block_statement_list_repeat1           29
_statement                              26
statement_list                          25
tuple_construction_repeat1              25
_long_string_body_repeat1               22
_interpreted_string_literal_repeat1     22
_command_first_argument                 21
boolean_literal                         21
nil_literal                             21
string_literal                          21
enum_type                               21
object_type                             21
_argument_list                          9
_identifier_declaration                 7
tuple_deconstruct_declaration           5
symbol_declaration                      4
tuple_deconstruct_declaration_repeat1   4
symbol_declaration_list_repeat1         4
field_declaration_list_repeat1          4
expression_list_repeat1                 4
parameter_declaration_list_repeat1      4
symbol_declaration_list                 2
expression_list                         2
parameter_declaration_list              2
field_declaration_list                  2
source_file                             1
_literal                                0
_maybe_colon_equal_expression           0
_maybe_colon_expression                 0
_basic_sigil_expression                 0
var_type                                0
_symbol                                 0
_maybe_equal_expression                 0

Since the explosion is exponential, we will quickly approach tree-sitter state limit of 2^16, which will render a broken parser.

@alaviss
Copy link
Owner

alaviss commented May 2, 2023

I think I finally pinned the problem down: command expressions and conditionals are exploding the parser size.

So tree-sitter uses an LR(1) parser, which means it uses a flat state hierarchy for all possible sequences of tokens (ie. doesn't have the stack to "remember" the parent node).

Nim's grammar have a lot of right-associative nodes: if <expr>, case <expr>, etc. And for every possible combinations of these nodes, an LR state has to be made to capture the progress.

And to make the problem worse, in my "superset" grammar, some stuff like command expressions can nest if/try/case expressions which then can be placed in, say x: <type> which then exponentially increase the number of states required.

So the bad news is my approach is a dead end. Trying to make a "nice" and "flat" grammar won't cut it.

The good news is that one source of combinatorial explosion (command expression) have some pretty restrictive rules in the Nim grammar that I should be able to leverage, so that's progress.

@firasuke
Copy link

firasuke commented May 2, 2023

The good news is that one source of combinatorial explosion (command expression) have some pretty restrictive rules in the Nim grammar that I should be able to leverage, so that's progress.

Interesting, can you elaborate more on this?

@alaviss
Copy link
Owner

alaviss commented May 4, 2023

I'm proud to say that the rewrite branch is now a 41MiB parser that can completely parse koch.nim without errors and can actually be generated on a desktop computer.

The grammar is written to cover pretty much all of Nim so please test it if you're interested. When you find bugs, feel free to open issues for them, just remember to say that you're using the rewrite branch.

A lot of testing and cleanup still has to be done before this can replace main branch, and I'm sure that there are many cases where it doesn't parse correctly (some of it are in the tests but commented).

Right now there's a small blocker on generalized strings due to:
tree-sitter/tree-sitter#2236

I have learnt a whole lot about how tree-sitter splits/merges its states and managed to optimize the grammar by a fair bit to cut down on those, but there are more to be done in the future.

I'll save the write up about this part for a later date. For now I'll take a small break from this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants