Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser rewrite #47

Merged
merged 25 commits into from
Dec 5, 2023
Merged

parser rewrite #47

merged 25 commits into from
Dec 5, 2023

Conversation

psteinroe
Copy link
Collaborator

@psteinroe psteinroe commented Nov 14, 2023

What kind of change does this PR introduce?

complete rewrite of the parser for increased

  • resilience
  • performance
  • maintainabiltiy
  • extensibility

Highlights

  • lexer is a separate module and includes all tokens, including whitespaces, by merging the output of libpg_querys scan() method with a very simple custom lexer that extracts whitespaces.
  • rewrite statement parsing to be resilient. instead of regular expressions, we now use a simple LL-Parser. The idea is to just check if a new statement is starting by comparing the first few tokens. Once started, we walk all tokens until either a new statement is started, eof or ";" is reached. tokens within sub-statements (enclosed by (...)) are not tested against these conditions.
  • while valid statements are parsed with libpg_query, we can easily implement a custom resilient parser statement by statement for invalid ones.
  • invalid statements are parsed "flat", meaning that we just open the node, apply all tokens and close the node.
  • the parser for valid statements is now very performant. we turn the ast into an untyped tree structure where each node holds its list of properties. the parser then walks the tokens once, efficiently finds the next valid node, and opens / closes nodes accordingly.
  • the parser vor valid statements is also "stable", meaning that it will only produce a valid cst or panic. no manual comparison required anymore.

@psteinroe psteinroe marked this pull request as ready for review November 21, 2023 13:33
while !parser.eof() {
match is_at_stmt_start(parser) {
Some(stmt) => {
statement(parser, stmt);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom parsers can be added here later, and statement would just be the fallback.

.collect()
}

fn custom_handlers(node: &Node) -> TokenStream {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only manual node-by-node implementation required for the parser.

- make whitespace lexer consume consecutive tokens as one
- merge String, Integer etc Nodes into their parent
- due to the aforementioned change, we wont have never-visited leaf
  nodes anymore and can skip the child check when closing leaf nodes
- instead of searching for a token in the entire token range, we now
  only search for it in the next n+m non-whitespace tokens where n is the number of
  properties and m the number of tokens with just a single character
  (e.g. "(")
@psteinroe psteinroe merged commit c8f1708 into main Dec 5, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant