Skip to content

Latest commit

 

History

History
548 lines (419 loc) · 9.47 KB

che-parser.md

File metadata and controls

548 lines (419 loc) · 9.47 KB

title: Learn peg.js for fun and profit author: name: sudodoki email: [email protected] github: sudodoki twitter: sudodoki theme: sudodoki/reveal-cleaver-theme output: index.html

--

--

Peg.js and custom grammars

<style> .reveal code { // font-family: "Joystix"; font-size: 42px; line-height: 46px; } </style>

--

@sudodoki

Sprint-42

Formerly RailsReactor

kottans.org

--

Task @ RailsReactor

Let user type SQL-like queries. <COLUMN> <KEYWORD> <ARGS>

--

Hm, and how do they actually do it?

** JS-devs & CS **

--

Cool old books

--

Talk overview

  • theory: common ground & vocabulary
  • PEG.js: not a silver bullet, but it works

--

Part I: theory

--

Working with text

  • structured – Parsing
  • unstructured – NLP

--

ALGOL 60

  • Spec comes out in 1960
  • Ned Irons publishes ALGOL parser, based on formal notation

--

String -> Tokens -> AST / Parse Tree

--

"ID = 3"

[<COLUMN, ID> <KEYWORD, EQL> <ARG, 3>]

--

Parsing application

  • languages / dsl / data format / custom protocol
  • data mining / parsing
  • reading language specs --

Formal Grammar

  • T – terminals
  • N – non-terminals
  • S – root
  • R – production rules

--

BNF

Backus–Naur form

Letter = 'A' / 'B' / … / 'Z'
Digit = [0-9]
Integer = Digit Integer / Digit

--

Same thing, different production rules

  • <Name> = [A-Z] [a-z]+
  • <Name> = [A-Z] <Letters>
  • <Name> = [A-Za-z] <Name>

--

Tokens – product of lexical analysis

<name, attribute>

  • <COLUMN, "Id">
  • <KEYWORD, "=">
  • <ARG, 1>

--

Terminals <=> Tokens

--

Let's talk only context-free grammars

--

Building parse trees

  • Bottom-up
  • Top-down
  • [Universal]

--

Sample grammar

Start = Term
Digit = [0-9]
Sign = '-' / '+'
Term = Digit Sign Term / Digit

--

Sample tree

2 + 3 - 4

--

Bottom-up

--

Top-down

--

Vanilla JS Recursive descent~ish

--

let result = [];
let curIndex = 0;
const descend = (input) => {
    term(input);
    return result;
}

--

const term = (input) => {
 result.push(['term']);
 digit(input);
 if (input[curIndex + 1]) {
   sign(input);
   term(input);
 }
}

--

const sign = (input) => {
 const curChar = input[curIndex];
 if (isSign(curChar)) {
  result.push(['sign', curChar])
  curIndex += 1;
 } else {
  throw new 
   SyntaxError(`${curChar||'∅'}\
instead of sign`); } }

--

const isSign = (c) =>
  ['+', '-'].includes(c)
const isDigit = (c) =>
  '0123456789'
    .split('')
    .includes(c);

--

const parsed = JSON.stringify(descend("2+3-4"))
console.log('parsed: ', parsed)
// parsed:  [["term"],["digit","2"]
// ["sign","+"],["term"],["digit","3"]
// ["sign","-"],["term"],["digit","4"]]

--

Backtracking

vs

Lookahead

--

Parser approaches

--

Other aspects

--

Part II: practice

--

PEG.js as representative of PEG

--

When to build a parser

  • when simple regexp is not good enough
  • keeping some context / etc
  • loose syntax / casing

--

npm i pegjs chokidar-cli
// in package.json scripts
chokidar 'parens.pegjs' 
  --initial 
  --silent 
  -c 'pegjs -o
    parens-parser.js
    parens.pegjs'

--

Aka 'don't bother with installation'

Basic parsing expressions

  • Terms / Non-terms
  • "literals" / 'literals'
  • . - any char
  • [a-z] [0-9] [\u00C0-\u10FFFF] char ranges
  • * / + / ? / ()

--

Simple parens grammar

start = group
lpar = '('
rpar = ')'
nothing = ''
group = 
    lpar group rpar
    / group group
    / ''

--

--

--

--

Going from this…

group = 
    lpar group rpar
    / group group
    / ''

--

to this

group = lpar rest
  / lpar group rpar
  / ''
rest = ')' group
  / ')'

--

Run action

start = FLOAT
SIGN = "-" / "+"
DIGIT = [0-9]
INTEGER = 
  d:DIGIT int:INTEGER
  { /* other side effect */
    return d + int }
  / d:DIGIT 
  { return d }

--

Helpers

JS code at top of grammar file

{
const notNull = (thing) =>
  thing != null
}

--

Options / context

parser.parse(string, options);

--

Predicate

FLOAT = sign:SIGN?
  whole:INTEGER? "."?
  fraction:INTEGER? &{
    return notNull(whole)
    || notNull(fraction) 
  }
  { return { sign,
             whole,
             fraction } }

--

Thinking about your grammar

  • case sensitivity
  • loose syntax
  • distinguishing lexemes
  • AST

--

Avoiding ambiguity

  • in terms of building parse tree
  • in terms of deciding token's type

--

Two cases

"ID" >= (20)
ID less than 20

--

How to tackle Precedence

// level2 * /
// level1 + -
start = level1
level1 =
    level2 ([+-] level2)+
    / level2
level2 =
    level0 ([*/] level0)+
    / level0
level0 = number / "(" level1 ")"
number = [0-9]+

--

Coming up with AST

  • Sequential
  • Non-sequential

--

Sample AST

{
column: { name, location, type }
keyword: { value, location }
argument: [ { value, location }]
}

--

Non-sequential

{
 complete: true,
 sequence: [
  { column: { name, location, type }},
  { whitespace: true },
  { keyword: { value, location, type }},
  { whitespace: true },
  { argument: [{ value, location }]}
 ]
}

--

Location

(start / end)

  • offset
  • line
  • column

--

Validation

  • Syntax error
  • Semantic error

--

Parsing-on-the-go

ID =

is a valid expression

--

--

Possible solutions?

  • ditching peg.js
  • using predicates and complete / non-complete attrs
  • swallowing parse errors and trying to figure out how lethal it is
  • incorporating 'incomplete' grammar

--

progressiveExpression =
  column:column whitespace keyword:keyword whitespace* argument:inprogArgumentList
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(),
             makeKeyword(keyword), makeWhitespace(), makeArgument(argument) ] }}
  / column:column whitespace keyword:keyword whitespace+
  { return { complete: false, sequence: [makeColumn(column), makeWhitespace(), 
             makeKeyword(keyword), makeWhitespace()] }}
  … 3 redacted
  / column:column { return { complete: false, sequence: [makeColumn(column)] }}
  / inprogColumn  

--

Syntax errors and messages

inprogEntity '' = …

Prompting is basically

  • manipulating AST
  • stringifying

--

Ignoring case

equalOp = ['equal']i 's'i?

--

What would I redo?

--

Alternatives

--

Further reading

--

Questions?

Book thingy