Skip to content

Latest commit

 

History

History
1962 lines (1483 loc) · 54.6 KB

tutorial.org

File metadata and controls

1962 lines (1483 loc) · 54.6 KB

SMUG Tutorial : Literate

This tutorial started as a translation of Monadic Parser Combinators [1], or at least the first half, into common lisp.

The example code in the following document is completely self-contained, and does not require an installation of the SMUG library.

No prior experience with functional programming, monads or recursive descent parsing is assumed. The only requirements are a common lisp environment, and a burning desire to find a better way to parse.

Introduction

This tutorial, like this library, is based on an approach to building parsers using higher-order functions (combinators) that is popular in the functional programming community. Incidentally, these parsers form an instance of something called a monad, which is itself a useful construct with implications beyond parsing.

With great debt to Monadic Parser Combinators [1], the paper from which this library is derived, this tutorial presents a step by step introduction to the topics of parser combinators and monads and their use in common lisp.

Common Lisp

In some cases, the natural name for a parser conflicts with a name in the COMMON-LISP package. Rather then shadow the symbols, I’ve chosen to prefix all parser names with a #. character. It is thought that this aids usability, as one can simply (:use :smug/tutorial). It also helps to distinguish parser returning functions from other functions.

How To Combine Parsers

A Parser for Things is a function from Strings to Lists of Pairs of Things and Strings!

– Fritz Ruehr, Willamette University [2]

A parser is something that is familiar to all programmers… a function that, given a series of tokens as input, produces a data structure that relates to the grammatical structure of the input in some way. Or, to put it simply, a function from strings to things.

;; our fictional parser matches the string "string" 
;; and returns a SYMBOL thing    
≻ (parse-thing "string") ≕≻ THING

In order to combine simple parsers into larger more complex ones, they need a way to communicate between them. First, because any given parser might consume only a part of the input, we’ll have our parser return a CONS with the result in the CAR and the remaining input in the CDR.

≻ (parse-thing "string string") ≕≻ (THING . " string")

Because a parser may return multiple results when the grammar is ambiguous, or may return no results all, we’ll put our conses in a list, and have the empty list, NIL, denote a failed parse.

≻ (parse-thing "string string")  ≕≻ ((THING . " string"))
≻ (parse-thing "strong string")  ≕≻ NIL

So, for our purposes, a parser is just a FUNCTION that takes a single value as the input and returns a LIST of CONS’s of results and unconsumed input.

It is this trivial protocol that allows us to combine small simple parsers into larger more useful ones.

Reading Input

Smug parsers allow infinite look-ahead and backtracking. To support parsing many different things, it’s useful to define an input protocol. Smug parsers only require three operations on input : INPUT-FIRST, INPUT-REST and INPUT-EMPTY-P.

We’ll define them in terms of strings. It serves our purposes and makes for a nice visual presentation.

(defgeneric input-empty-p (input)
  (:method ((input string)) (zerop (length input))))

(defgeneric input-first (input)
  (:method ((input string)) (aref input 0)))

(defgeneric input-rest (input)
  (:method ((input string))
    (multiple-value-bind (string displacement) 
        (array-displacement input)      
      (make-array (1- (length input))
                  :displaced-to (or string input)
                  :displaced-index-offset (1+ displacement)
                  :element-type (array-element-type input)))))
≻ (input-empty-p "") ≕≻ t
≻ (input-empty-p "foo") ≕≻
≻ (input-first "foo") ≕≻ #\f
≻ (input-rest "foo") ≕≻ "oo"

The Three Primitive Parsers

There are 3 simple primitive parsers. It it only necessary to understand them, and one sequencing combinator, .BIND, to understand all of SMUG/TUTORIAL.

.IDENTITY

The first parser is .IDENTITY, which always succeeds by returning the value passed to it, and does not consume any input. Because we’ve earlier defined parsers as functions that take a single argument we’ll make a curry[5] with the input parameter.

(defun .identity (value)
  (lambda (input)
    (list (cons value input))))
(test> (funcall (.identity :foo) "bar baz")
       => ((:foo . "bar baz")))

.FAIL

The second parser, .FAIL, is the inverse behaviour of .IDENTITY. It simply fails regardless of the input. we could define .FAIL as a function that takes a single argument, but then we’d have to access it using FUNCTION (#’), and aesthetically that inconsistency is undesirable, so we’ll again make curry with the input parameter.

(defun .fail ()
  (lambda (input) (declare (ignore input)) nil))
(test> (funcall (.fail) "foo") => NIL)

.ITEM

The last true primitive is .~.ITEM~, which is a parser that consumes the first token in the input, or fails in the input is empty.

(defun .item ()
  (lambda (input)
    (unless (input-empty-p input)
      (list (cons (input-first input)
		  (input-rest input))))))
(test> (funcall (.item) "foo") 
       => ((#\f . "oo")))

(test> (funcall (.item) "") 
       => NIL)

RUN, PARSE: FUNCALL and CAAR in disguise

All the primitives return a FUNCTION that must be FUNCALL‘ed with INPUT in order to run the parser.

There are many reasons to define a RUN function. We can CL:TRACE it, or change the input parameter TYPE, or change what the actual primitives return.

(defun run (parser input)
  (funcall parser input))

The RUN function returns the entire parse tree. Most of the time we simply want the CAR of the FIRST result. The CDR is the leftover input, and the REST of the result alternative outcomes. We might want these as well, so we return that as VALUES.

(defun parse (parser input)
  (let ((result (run parser input)))
    (when result 
      (destructuring-bind ((result . input) &rest rest)
          result      
        (apply #'values result input rest)))))

.BIND: Our first primitive combinator

Now that we have our primitive parsers, we need a way to combine them. We’d like to be able to apply parsers in sequence, and it would also come in handy if we could give names to the intermediate results of parsers. Both these requirements are fulfilled by using the monadic sequencing operator, .BIND.

.BIND is a function that takes as arguments a parser P, and a function F which take a value and returns a parser P2. .BIND returns a parser that first applies P to the input, returning a list of (VALUE . INPUT) pairs. The the function F is applied to each VALUE, and the result P2 then applied to the INPUT. The collected lists of pairs returned from the P2’s are then concatenated and the result returned.

(defun .bind (parser function)
  (lambda (input)
    (loop :for (value . input) :in (run parser input)
          :append (run (funcall function value) input))))
(let ((char-token
       (.bind (.item) 
             (lambda (char) 
               (.identity (list :char char))))))           
  (run char-token "foo"))
;; ~> (((:CHAR #\f) . "oo"))

Because .BIND itself returns a parser, the result of a .BIND can be returned as P2. This allows parsers to be chained, and allows us to use LAMBDA to provide names for the values of parser results. For example, the following parser uses .BIND to return the first two characters as a cons.

(let ((two-chars 
       (.bind (.item) 
	     (lambda (char) 
	       (.bind (.item) 
		     (lambda (char2) 
		       (.identity (cons char char2))))))))
  (run two-chars "asd"))
;;=> (((#\a . #\s) . "d"))

The next section gets into some details about why our parser is a monad. You don’t really need to know this, so feel free to skip it if you’re in a hurry.

A quick word on monads

By virtue of having the functions .BIND and .IDENTITY defined as they are, our parser interface forms a monad. A monad is, essentially, a category of things that provide the functions .BIND and .IDENTITY.

Of course, just having functions called .BIND and .IDENTITY does not a monad make. There are other contracts that .BIND (also known as pipe, >>~, *, or let) or .IDENTITY (aka result, lift, unit, return) must fulfill.

The monad laws

In order to be properly categorized as a monad, the thing providing a definition for .BIND and .IDENTITY must obey three laws (a static functional programmer would say ‘must have a certain type’, but the word type means something different to a dynamic functional programmer, so we’ll avoid it here)

In order to describe those laws we need to define a few terms

Monadic Value (MV)
a function that, given a value, returns a

value in the form expected by the internals of .BIND. In our examples above, a parser (taking an input and returning a list of results) is the Monadic Value.

Monadic Function (MF)
A function that, given a value returns

a monadic value encapsulating that value. .IDENTITY is the canonical Monadic Function

In Object-Oriented terms, the MF is a constructor, and the MV an object.

The laws which all things must obey in order to be called a monad are simple :

“Left identity”
(bind (result x) MF) = (funcall MF x)
“Right identity”
(bind MV result) = MV
“Associativity”
(bind (bind MV MF) MF2)

= (bind MV (lambda (x) (bind (MF x) MF2)))

With static type systems, the compiler will enforce this contract for you. In a dynamic system, we just need to be a little more careful. Proving the monad laws for our .BIND and .IDENTITY is left as an exercise.

That’s really all there is to monads except for syntax, which we’ll get to later. There are extended laws that other monads obey, and monads have other uses beyond parsing, but we’re reaching the end of our scope already.

.SATISFIES : the parser predicate

Often, we only want to consume input if a certain condition is true. This where .SATISFIES comes in.

(defun .satisfies (predicate &rest args)
  (.bind (.item) 
	(lambda (x) 
	  (if (apply predicate x args)
	      (.identity x)
	      (.fail)))))
(run (.satisfies #'digit-char-p) "1 and") 
;;=> ((#\1 . " and"))

If .ITEM fails, so will the .SATISFIES parser. This is because (bind (fail) MF) will always fail. .FAIL, also known as .ZERO, is a function belonging to a category of monads knows as “monads with a zero”. That’s not terribly important for parsing, but interesting if you’re into that sort of thing.

.IS and .IS-NOT

Imagine we need to parse all characters that come before a #\;. The simple way is to have a function that uses CL:NOT.

(.satisfies 
 (lambda (item)
   (not (char= #\; item))))

It turns out that (.satisfies (lambda (i) (not ...))) is quite common, so we define a parser that has a shorter and more relevant name.

(defun .is-not (predicate &rest args)
  (.satisfies (lambda (i) 
                 (cl:not (apply predicate i args)))))

This makes things a lot shorter and easier to read.

(test> 
 (run (.is-not #'char= #\;) "foobar;%^*&")       
 => ((#\f . "oobar;%^*&")))
(test> 
 (run (.is-not #'char= #\;) ";%^*&")
 => NIL)

For that matter, now that we have .IS-NOT, .SATISFIES is a bit long, and does not prefix -NOT . So we type a few keys in order to save a bundle in the future.

(defun .is (predicate &rest args)
  (apply #'.satisfies predicate args))

Example Parsers for letters and numbers using .SATISFIES

.SATISFIES allows us to DEFUN some simple parsers

(defun .char= (x)
  (.is #'cl:char= x))

(defun .digit-char-p ()
  (.is #'cl:digit-char-p))

(defun .lower-case-p ()
  (.is #'cl:lower-case-p))

(defun .upper-case-p ()
  (.is #'cl:upper-case-p))  
(run (.char= #\x) "xyzzy") ;=> ((#\x . "yzzy"))
(run (.digit-char-p) "1234") ;=> ((#\1 . "234"))
(run (.lower-case-p) "abcd") ;=> ((#\a . "bcd"))
(run (.upper-case-p) "Abcd") ;=> ((#\A . "bcd"))  
(run (.upper-case-p) "doh!") ;=> NIL

.PLUS, the non-deterministic choice combinator

If we want to combine our earlier parsers, say to create an ALPHANUMERIC-CHAR from UPPER-CASE-P and LOWER-CASE-P we need a combinator capable of making the choice between them.

In some cases, it may not be an exclusive choice. There might be multiple ways to parse a string, or a later pass might resolve the ambiguity.

For example, in one of our earlier examples of .BIND, we saw a parser that returned the first two characters in a stream. This parser will fail if there is only one character left in the input.

(let ((two-chars 
       (.bind (.item) 
	     (lambda (char) 
	       (.bind (.item) 
		     (lambda (char2) 
		       (.identity (cons char char2))))))))
  (funcall two-chars "a"))
;;=> NIL

If we want to parse one or two characters, or an arbitrarily long series of characters, we need some a way to express that.

Enter the .PLUS combinator.

(defun .plus (first-parser second-parser)
  (lambda (input)
    (append (funcall first-parser input) (funcall second-parser input))))
(let ((two-chars 
       (.bind (.item) 
             (lambda (char) 
               (.bind (.item) 
                     (lambda (char2) 
                       (.identity (cons char char2))))))))
  (funcall (.plus two-chars (.item)) "a") 
  ;;=> ((#\a . "")) 
  (funcall (.plus two-chars (.item)) "asd")
  ;;=> (((#\a . #\s) . "d") (#\a . "sd"))
  )

Note that the second parse returned two pairs, as both parsers were successful. The string parsed as both two chars and a single item.

Example parsers using PLUS

The examples used in the original paper[1] are for letters and alphanumeric characters. There’s no good reason to use them over /(.is #’alpha-char-p)/and the like, but they do serve as simple example.

(defun letter () (plus (lower-case-char) (upper-case-char)))

(funcall (letter) "foo") => ((#\f . "oo"))
(funcall (letter) "1foo") => NIL

(defun alphanumeric () (plus (letter) (.digit-char)))

(funcall (alphanumeric) "1foo") => ((#\1 . "foo"))
(funcall (alphanumeric) "!1foo") => NIL

The other example is more illustrative, a parser that returns a series of letters or the empty string.

(defun word ()
  (let ((non-empty-letters 
	 (bind (letter) 
	       (lambda (first-letter) 
		 (bind (word)
		       (lambda (rest-of-letters)
			 (result (format nil "~A~A" 
					 first-letter
					 rest-of-letters))))))))
    (plus non-empty-letters (result ""))))

(funcall (word) "asd")
=>
(("asd" . "") ("as" . "d") ("a" . "sd") ("" . "asd"))

This is our first recursive parser, but it’s a common idiom. Notice that it returns all the possible strings of letters.

This is obviously inefficient when one only requires the first value. required, a deterministic combinator .OR, will be introduced later in the tutorial.

Efficiency

.FIRST is the real choice when it comes down to it, as .PLUS really does matter.

(defun .first (parser)
  (lambda (input)
    (let ((results (run parser input)))
       (when results (list (cl:first results))))))

Explain more about .FIRST

Syntax : LET* and the identity monad

If you read the earlier section on monads, you’d know that .BIND and .IDENTITY are the interface to many different types of monads, of which our parser is but one example. If you didn’t, you know now. Again, if you’re not at all interested and really just want to keep on parsing, skip down to the macro.

The most basic monad is the identity monad. A definition of its .BIND and .IDENTITY might look like the following.

(defun i-bind (mv mf) (funcall mf mv))
(defun i-result (value) value)

In Lisp, the identity monad is so trivial as to be useless. In a functional programming language, or any language where the order of operations is not guaranteed, the identity monad serves to sequence operations.

Imagine a silly lisp where the order of evaluation isn’t defined as strict left to right[3]. The following form could have disastrous consequences.

(progn (remove-gun-from-pants)
       (point-gun-at-bad-guy)
       (pull-trigger))

The identity monad makes the sequencing explicit. In a purely functional lisp, one might sequence the operations as follows.

(i-bind (remove-gun-from-pants) 
      (lambda (gun)
	(i-bind (point-gun-at-bad-guy gun)
	      (lambda (pointed-gun)
	      (i-bind (pull-trigger pointed-gun)
		      (lambda (fired-gun)
			(i-result fired-gun)))))))

In functional programming languages this pattern is so common that there is special syntax for it. The usual choices are ‘do notation’ or ‘list comprehension syntax’.

First, the previous example rendered in list comprehension notation :

[fgun | gun <- removeGun 
      , pgun <- pointGunAtBadGuy gun
      , fgun <- pullTrigger pgun] 
        

And in do notation :

do 
  gun <- removeGun 
  pgun <- pointGunAtBadGuy
  fgun <- pullTrigger pgun
  return fgun

The astute lisper might notice that do notation looks a lot like LET*. In fact, that’s really all it is. LET* is lisp syntax for the identity monad, and our i-bind using forms above are directly translatable.

(let* ((gun (remove-gun-from-pants))
       (pointed-gun (point-gun-at-bad-guy gun))
       (fired-gun (pull-trigger pointed-gun)))
  (identity fired-gun))

One could legitimately say that the common lisp package is an instance of the identity monad, if one cared for such insights.

.LET*, our version of LET* like do notation

A LET* like construct is the obvious notation for a lisper to take advantage of the monadic nature of parsers. It’s often useful to ignore a value. In haskell, the underscore character is used to denote an ignorable variable, so we’ll use the same convention.

(defmacro .let* (bindings &body body)
  (if bindings
      (let ((symbol (first (first bindings))))
        `(.bind ,@(cdr (first bindings))
               (lambda (,symbol)
                 ,@(when (or (string-equal (symbol-name symbol) "_")
                             (null (symbol-package symbol)))
                         `((declare (ignorable ,symbol))))
                 (.let* ,(cdr bindings)
                   ,@body))))
      `(progn ,@body)))
(funcall (.let* ((a (.identity 1)))
           (.identity a)) "")

If we replace .BIND with our I-BIND function above, we get a macro that is equivalent to LET*. .LET* binds the results of parsers, and is a much nicer way to work over nesting .BIND’s.

Examples using .LET*

.PROGN, .PROG1, .PROG2

Using .LET*, we can implement the macros .PROGN (which is similar .AND because it will fail when the parser does), .PROG1 (which comes in handy for matching things and the end of the line, or when there is no more input) and .PROG2, which as we will see is also quite useful.

(defmacro .progn (&rest parsers)
    (if (rest parsers)
        (let ((name (gensym)))
          `(.let* ((,name ,(first parsers)))
             (.progn ,@(rest parsers))))
        (first parsers)))

(defmacro .prog1 (parser &rest parsers)
  (let ((name (gensym))
        (ignore (gensym)))
    `(.let* ((,name ,parser)
             (,ignore (.progn ,@parsers)))
       (.identity ,name))))

(defmacro .prog2 (parser1 parser2 &rest parsers)
  (let ((name (gensym))
        (ignore (gensym)))
    `(.let* ((,ignore ,parser1)
             (,name ,parser2)
             (,ignore (.progn ,@parsers)))
       (.identity ,name))))

.CONS

(defun .string= (string)
  (if (string= string "")
      (.identity nil)
      (.let* 
          ((_ (.is 'char= (aref string 0)))
           (_ (.string= (subseq string 1))))
        (.identity string))))
(run (.string= "asdf")  "asdfjkl") => (("asdf" . "jkl"))
(run (.string= "asdf")  "asd") => NIL

Once can see how much nicer .LET* notation is, and also how the ignorable _ comes in handy.

.STRING=

Using recursion like we did in our WORD parser, we’ll create a parser that matches a specific string.

(defun .string= (string)
  (if (string= string "")
      (.identity nil)
      (.let* 
          ((_ (.is 'char= (aref string 0)))
           (_ (.string= (subseq string 1))))
        (.identity string))))
(run (.string= "asdf")  "asdfjkl") => (("asdf" . "jkl"))
(run (.string= "asdf")  "asd") => NIL

Once can see how much nicer .LET* notation is, and also how the ignorable _ comes in handy.

.MAP : The repetition combinator

Earlier, we defined a parser, .WORD, using .BIND and a recursive call. Lets define a similar parser using .LET* that returns a list of letters.

(defun .letters ()
  (.plus (.let* ((x (.letter))
               (xs (.letters)))
         (.identity (cons x xs)))
       (.identity nil)))

This pattern can easily be abstracted into a more general combinator, .ZERO-OR-MORE

.ZERO-OR-MORE

(defun .zero-or-more (parser)
  (.plus (.let* ((x parser)
                 (xs (.zero-or-more parser)))
           (.identity (cons x xs)))
         (.identity ())))
(test> 
 (run (.zero-or-more (.char= #\a)) "aaaab"
 =>  (((#\a #\a #\a #\a) . "b") ((#\a #\a #\a) . "ab") ((#\a #\a) . "aab")
      ((#\a) . "aaab") (NIL . "aaaab")))

(test> 
 (run (.zero-or-more (.char= #\a)) "bbbba")
 =>
 ((NIL . "bbbba")))

Note that zero or more always succeeds. If one needs a parser that matches one or more items and fails otherwise, we can define one in terms of ZERO-OR-MORE, can call it, appropriately enough, ONE-OR-MORE.

.ONE-OR-MORE

(defun .one-or-more (parser)
  (.let* ((x parser)
	  (y (.zero-or-more parser)))
    (.identity (cons x y))))

(test> (funcall (.one-or-more (.char= #\a)) "aaaab")
  =>
 (((#\a #\a #\a #\a) .many "b")))

(test> (funcall (.one-or-more (.char= #\a)) "bbbba")
   => NIL)

We could now define a TWO-OR-MORE and THREE-OR-MORE etc., but it is likely better to define a function to rule them all. It needs a base to rule from.

Function .MAPC, .MAPCAR

Syntax:

.mapc parser => parser

.mapcar parser => result-list

Arguments and Values:

parser
The parser that is attempted
result-list
a list

Description:

The mapping operation involves attempting parser many times. Except for .mapc and .mapl, the result contains the results returned by the parser.

.MAPCAR operates on successive results of parser. The iteration terminates when the parser fails. The value returned by mapcar is a list of the results of parser

(defun .mapcar (parser)
    (.plus (.let* ((x parser)
                   (xs (.mapcar parser)))
             (.identity (cons x xs)))
           (.identity ())))

.MAPC is like .MAPCAR except that the results of applying function are not accumulated. The parser argument is returned as a result.

(defun .mapc (parser)
    (.plus (.let* ((_ parser)
                   (_ (.mapc parser)))
             (.identity parser))
           (.identity parser)))

Examples:

(test> (parse (.prog1 (.mapcar (.item))
                       (.char= #\!))
               "Yay!")
  => (#\Y #\a #\y))
 
(let ((/parser/ (.item)))
  (test> (parse (.let* ((parser (.prog1 (.mapc /parser/)
                                        (.char= #\!)))
                        (char parser))
                  (.identity (cons (eq parser /parser/)
                                 char)))
                "Holy Guacamole!?")
         =>(T . #\?)))                       

Function .MAKE-LIST, .MAKE-SEQUENCE

Often, we know what we want, and how many we want, all at once. So, just like CL:MAKE-LIST, we take a number.

Syntax:

.make-list size &key initial-element => list

.make-sequence result-type size &key initial-element => list

Arguments and Values:

size
a non-negative integer.
initial-element
a parser. The default is (.item).
list
a list.

Description:

Returns a list of length given by size, each of the elements of which is a successful run of the initial-element parser.

(defun .make-list (size &key (initial-element (.item)))
  (if (zerop size) 
      (.identity nil)
      (.let* ((first initial-element)
              (rest (.make-list (1- size) 
                                :initial-element initial-element)))
        (.identity (list* first rest)))))               
(defun .make-sequence (type length &key (initial-element (.item)))
  (.let* ((list (.make-list length :initial-element initial-element)))
    (.identity (coerce list type))))               

Function .CONCATENATE

There are often mulitple parsers that are run one after another, and the results joined together at the end. .CONCATENATE takes care of that.

(defun .concatenate (output-type-spec &rest parsers)
  (if (not parsers)
      (.fail)
      (.let* ((first (first parsers))
              (rest (if (rest parsers)
                        (apply 
                         #'.concatenate output-type-spec (rest parsers))
                        (.identity nil))))
        (.identity (cl:concatenate output-type-spec first rest)))))

.MAP : ONE FUNCTION TO RULE THEM ALL!

In the end

The :AT-LEAST keyword solves the “how many do we want to start with?”. We have decided on 1 as the default, as most of the time we do want the parser to succeed, and besides that, .OPTIONAL is a better way of saying :at-least 0.

So, we can specify the .MAP parser as follows.

(defun .map (result-type parser
             &key 
               (at-least 1))
  "=> a ~result-type~ of /parser/ results."
  (.let* ((list-1 (.make-list at-least :initial-element parser))
          (list-2 (funcall (if result-type #'.mapcar #'.mapc) parser)))
    (.identity (when result-type (concatenate result-type list-1 list-2)))))

Examples using .MAP

Let’s make a parser for standard quoted strings. We’ll use the #' character as the quotes, and the #\| character as the escape character, simply to make it easier to embed in our example text in common lisp strings.

(defun .quoted-string (&key (quote #\')
                         (escape #\|))
  (.let* ((_ (.char= quote))
          (string 
           (.map 'string 
                 (.plus (.let* ((_ (.char= escape)))
                          (.item))
                        (.is-not 'char= quote))))
          (_ (.char= quote)))
    (.identity string)))

(The quote char is ’ and the escape char is. . ”)

.OR, .NOT, and .AND : deterministic logic combinators

.OR

.OR is a deterministic .PLUS. It take any number of parsers. The first parser is run, and if it succeeds, evaluation short circuits and the result of the parser is returned. Otherwise, the next parser is run, and so on, until one succeeds or there are no more parsers.

We can’t use .BIND or .LET* for .OR because it would fail if one of its parsers fails. As such, .OR must be a primitive.

(defun .or (parser &rest parsers)
  (lambda (input)
    (or (funcall parser input) 
	(when parsers 
	  (funcall (apply #'.or parsers) input)))))

.NOT

Similarly, .NOT, which continues parsing only when the parser fails, is primitive as well.

(defun .not (parser)
  (lambda (input)
    (let ((result (funcall parser input)))
      (if result
	  nil
	  (list (cons t input))))))

.AND

On the other hand, .AND can be defined in terms of IF, and doesn’t even need to test for failure, as .BIND handles failure automatically.

.AND (known as ‘>>’ in haskell) sequentially composes parsers, discarding the results of all but the last one, and returning that result.

(defun .and (p1 &rest ps)
  (.let* ((result p1))
    (if ps
	(apply #'.and ps)
	(.identity result))))

Examples using .OR, .NOT, and .AND

.NO-MORE-INPUT

Now that we have .NOT, we can specifically test for failure rather than abort the parse entirely. since the primitive parser .ITEM only fails when the input is empty, we can define .NO-MORE-INPUT by negating it.

(defun .no-more-input ()
 (.not (.item)))

.OPTIONAL

The OPTIONAL combinator, which allows a parser to fail and still continue, is a natural use of .OR.

(defun .optional (parser)
  (.or parser (.identity nil)))

Finally, using .OR, .AND and .NOT, we can make parser versions of the lisp conditionals we all know and love.

(defun .if (test-parser then-parser 
            &optional (else-parser (.fail)))
  (let ((no (gensym)))
    (.let* ((no? (.or test-parser (.identity no))))
      (if (not (eq no? no))
          then-parser
          else-parser))))

(defun .when (test-parser then-parser)
   "we define .when in terms of .IF, but it's really just .AND again"
  (.if test-parser then-parser))

(defun .unless (test-parser then-parser)
   "defined in term of .when, even though it's just (.AND (.NOT ...))"
  (.when (.not test-parser) then-parser))

.READ-LINE: #\Newline is a decent break up

Lines are quite important for lexing/parsing most text files. For everything before the last line it is very simple, but though #\Newline is a decent break up, it does not always end that way.

(defun .read-line-newline ()
  (.let* ((line (.optional (.map 'list (.is-not #'char= #\Newline))))
          (newline (.is #'char= #\Newline)))
  (.identity (concatenate 'string line (string newline)))))

The issue is that the last “line” has text but does does not end with a #\Newline, and the first line does not start with one. Beyond that, for testing we often want to try with only one “line” and most likely without the newline at all.

In other words, if we want the entire thing as a list of lines, this works, as the string ends with a #\Newline

(smug/tutorial:≻ 
 (parse (.map 'list (.read-line-newline))
        "foo
bar
")
 :=> ;; Which gives us :
 ("foo
"
  "bar
"))
  

If there is no newline at the end, it does not.

(smug/tutorial:≻ (parse (.map 'list (.read-line-newline))
                  "foo
  bar")
   ;; Which gives us :
         => ("foo
  "))   

In the latter case, there is no “bar” at the end, which makes it almost not worth the journey there.

As luck would have it, there is a .NOT which will help us tie up this particular line.

(defun .last-line ()
 (.prog1 (.map 'string (.is-not #'char= #\Newline))
         (.not (.item))))
(test> (parse (.last-line) "bar")
     => "bar")

(test> (parse (.last-line) "bar
")
     => "bar")

For every line save for the last will fail for .LAST-LINE, but every line save for an empty one will succeed until the end of that parser, so using .PLUS would be a waste of CPU time.

(defun .line ()
  (.plus (.read-line-newline)
         (.last-line)))  

Using .OR to run the last only when the first fails works great, but that still means but in the end there is a better option. We combine them both into one function.

Function .READ-LINE
(defun .read-line (&optional 
                     (eof-error-p t)
                     eof-value)
  (.let* ((text (.optional 
                 (.first (.map 'list (.is-not #'char= #\Newline)))))
          (newline (.or (.char= #\Newline)
                        (.and (.not (.item)) 
                              (.identity '())))))
    (if (or text newline)
        (.identity (concatenate 'string text (when newline (string newline))))
        (if eof-error-p 
            (.fail)
            (.identity eof-value)))))
(t> (parse (.map 'list (.document-line)) "
<<weave-out-line-test-text>>")
       ' ("
"
 "* Very important
"
 "** Less important
"
 "*** A detail
"
 "And the rest is text
"
 "between the headers."))
(parse (.map 'list (.line)) "
 <<weave-out-line-test-text>>")
=>
("
"
 " * Very important
"
 " ** Less important
"
 " *** A detail
"
 " And the rest is text
"
 " between the headers.")

There is really one important detail, and that is the use of .OR ,which could easily be .PLUS without much waste, rather than having the TEXT be .OPTIONAL or :AT-LEAST 0.

What we do not want is the parser to always succeed when there is no input, because then the parser succeeds when there is no input, ad infinitum.

This kind of event happens quite often when using .NOT, so often it is best to take another approach, yet (.not (.item)) does have its uses. Thus, caution is in the air.

.STRING-EQUAL: For case insensitivity

The #+NAME, # are case insensitive. We have ~.CHAR~ and .STRING= already, so .CHAR-EQUAL and .STRING-EQUAL are in orderf_chareq.

(defun .char-equal (char)
  (.is #'cl:char-equal char))

For our .STRING=, we simply return the string we passed in. Because CL:STRING-EQUAL “ignore[s] differences in case”f_chareq, we actually need to return that matched string from what we are parsing.

(defun .string-equal (string)
  (labels ((%string-equal (string)
             (.let* ((first (.char-equal (aref string 0)))
                     (rest (if (> (length string) 1)
                                  (%string-equal (subseq string 1))
                                  (.identity nil))))
               (.identity (cons first rest)))))
    (.let* ((list (%string-equal string)))
      (.identity (coerce list 'string)))))
(test> 
 (run (.string-equal "asd") "AsD") 
  => (("AsD" . "")))   

The Conditions Dictionary

(define-condition smug-condition (simple-condition)
  ((input :reader smug-condition-input
          :initarg :input))
  (:report (lambda (condition stream)
             (apply #'format stream 
                    (simple-condition-format-control condition) 
                    (simple-condition-format-arguments condition))
             (terpri stream)
             (write-string "Input:" stream)
             (print (smug-condition-input condition) stream))))

(define-condition smug-error (error smug-condition) ())

(defun .error (datum &rest arguments)
  (lambda (input)
    (apply #'error 'smug-error
           :input input 
           (if (or (stringp datum)
                   (functionp datum))
               (list :format-control datum
                     :format-arguments arguments)
               arguments))))

Parsing S-Expressions

We should already by quite familiar with sexps because that is the syntax SMUG uses.

In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as

  1. an atom, or
  2. an expression of the form (x . y) where x and y are s-expressions. […]

Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that

(x y z) stands for

(x . (y . (z . NIL)))

where NIL is the special end-of-list symbol (written ‘() in Scheme).

https://en.wikipedia.org/wiki/S-expression

.whitespace from +whitespace+

+WHITESPACE+ is a list of characters that do not matter to the .READ‘er.

(defvar *whitespace* '(#\space #\newline #\tab))

The Parser itself is very simple, but we do use three SMUG parsers so it is a good example of use.

(defun .whitespace (&optional result-type)
  (.first (.map result-type (.is 'member *whitespace*))))  

In particular, the RESULT-TYPE defaults to NIL. This means that we can throw the result away which saves on time and memory.

Also, using .FIRST prevents a large parse tree which also saves on such things.

.READ

Anything that is not whitespace must be parsed. This is a perfect use of .PROG2 and .OPTIONAL.

(defun .read (&optional (parser (.sexp)))
  (.prog2 (.optional (.whitespace))
      parser
    (.optional (.whitespace))))                  

Using that we can now parse anything that is not in .( ). Strangely enough, that is very close to what we want for an ATOM.

(smug/tutorial:≻ (parse (.read (.map 'string (.is-not 'find ".( )")))
                        (format nil "~%~T asd()1234"))
 	 :=> (values "asd" "()1234a"))

.ATOM

An atom is anything that is not a list save for the empty list (), which is also know as NIL.

(defun .nil ()
  (.or (.string-equal "NIL")
       (.string= "()")))

So to make an atom it is ether that or we simply need to know what is a member of the constituent characters and make that into a token. We use .FIRST again.

(defun .atom ()
  (.or (.nil) (.first (.token)))

.TOKEN

At this point, anything that is a not a parenthesis or whitespace makes up a token. They are called constituent characters.

(defun .token ()
  (.first (.map 'string (.constituent))))  

.CONSTITUENT

constituent n., adj. 1. a. n. the syntax type of a character that is part of a token – http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_c.htm#constituent

Well, we know what does not constitute a character that is part of a token, so anything that is not a not a part of token is part of a token. The double negative works.

(defvar *non-constituents* 
  (list* #\( #\) *whitespace*))

(defun .constituent (&optional (non-constituents
                                *non-constituents*))
    (.or (.and (.char= #\\) (.item))
         (.is-not 'member non-constituents)))           
(defun .read (&optional (parser (.sexp)))
  (.prog2 (.optional (.whitespace))
      parser
    (.optional (.whitespace))))

(defun .sexp ()
  (.read (.or (.list) (.atom))))
 
(defun .constituent ()
  "http://www.lispworks.com/documentation/HyperSpec/Body/02_ad.htm"
  (let ((non-constituent
         (list* #\( #\) (whitespace))))
  (.or (.is-not 'member non-constituent)
       (.and (.char= #\\) (.item)))))


  
(defun .list-first ()
  (.or (.progn (.read (.read (.char= #\.)))
               (.error "List has nothing before after ."))
       (.sexp)))

(defun .list-rest ()
  (.or (.and (.read (.char= #\.))
             (.or (.sexp) (.error "List has nothing after .")))
       (.list :start nil :end nil)))
       
(defun .list (&key 
                (start (.char= #\())
                (endchar #\))
                (end (.char= endchar)))

  (let ((false (gensym "false")))
    (.let* ((_ (or start
                       (.identity nil)))
            (first (.or (.list-first) (.identity false)))
            (rest (if (not (eq first false))
                      (.or (.list-rest)
                           (.identity false))
                      (.identity false)))
            (_ (if (or (listp rest) (eq false rest))
                   (.progn (.optional (.whitespace))
                           (or end (.identity nil)))
                   (.or (.char= endchar) 
                        (.error "More than one object follows . in list.")))))
      (.identity (if (not (eq false first))
                   (list* first (if (eq false rest) nil rest)))))))

(defun .list (&key 
                (start (.char= #\())
                (end (.char=  #\))))
  (.or 
   (.let* ((start (or start
                  (.identity nil)))
           (first (.optional (.sexp)))
           (rest (if first
                     (.optional (.list :start nil :end nil))
                     (.identity nil)))
           (end (.progn (.optional (.whitespace))
                      (or end (.identity nil)))))
     (.identity (if first (list* first rest))))))

(defvar *nil* (gensym))

(defun .nil ()
  (.and (.or (.string-equal "NIL")
             (.string= "()"))
        (.identity *nil*)))





        


(defun .list (&key delimited &aux (not (gensym)))
  (.or (.nil)
       (.let* ((_ (if (not delimited) 
                      (prog1 (.char= #\() 
                        (setf delimited #\)))
                      (.identity nil)))
               (first (.sexp))
               (dot (.optional (.dot)))
               (rest (if dot 
                         (.sexp) 
                         (.or (.list :delimited #\))
                              (.and (.char= delimited)
                                    (.identity :nil))))))
         (.identity (list* first rest)))))



          


(defun .cons ()
  (.let* ((_ (.char= #\())
          (car (.sexp))
          (_ (.dot))
          (cdr (.sexp))
          (_ (.char= #\))))
    (.identity (cons car cdr))))
                


(defun .read (&optional (parser (.sexp)))
  (.prog2 (.optional (.whitespace))
      parser
    (.optional (.whitespace))))



(defun .sexp ()
  (.read (.or (.list) (.atom))))

(defun .atom ()
  (.first (.map 'string (.constituent))))

(defun .constituent ()
  "http://www.lispworks.com/documentation/HyperSpec/Body/02_ad.htm"
  (let ((non-constituent
         (list* #\( #\) (whitespace))))
  (.or (.is-not 'member non-constituent)
       (.and (.char= #\\) (.item)))))

(defun .error (message &rest args)
  (lambda (input)
    (declare (ignore input))
    (apply #'error message args)))
  
(defun .list-first ()
  (.or (.progn (.read (.read (.char= #\.)))
               (.error "List has nothing before after ."))
       (.sexp)))

(defun .list-rest ()
  (.or (.and (.read (.char= #\.))
             (.or (.sexp) (.error "List has nothing after .")))
       (.list :start nil :end nil)))
       
(defun .list (&key 
                (start (.char= #\())
                (endchar #\))
                (end (.char= endchar)))

  (let ((false (gensym "false")))
    (.let* ((_ (or start
                       (.identity nil)))
            (first (.or (.list-first) (.identity false)))
            (rest (if (not (eq first false))
                      (.or (.list-rest)
                           (.identity false))
                      (.identity false)))
            (_ (if (or (listp rest) (eq false rest))
                   (.progn (.optional (.whitespace))
                           (or end (.identity nil)))
                   (.or (.char= endchar) 
                        (.error "More than one object follows . in list.")))))
      (.identity (if (not (eq false first))
                   (list* first (if (eq false rest) nil rest)))))))

(defun .list (&key 
                (start (.char= #\())
                (end (.char=  #\))))
  (.or 
   (.let* ((start (or start
                  (.identity nil)))
           (first (.optional (.sexp)))
           (rest (if first
                     (.optional (.list :start nil :end nil))
                     (.identity nil)))
           (end (.progn (.optional (.whitespace))
                      (or end (.identity nil)))))
     (.identity (if first (list* first rest))))))

(defvar *nil* (gensym))

(defun .nil ()
  (.and (.or (.string-equal "NIL")
             (.string= "()"))
        (.identity *nil*)))





        


(defun .list (&key delimited &aux (not (gensym)))
  (.or (.nil)
       (.let* ((_ (if (not delimited) 
                      (prog1 (.char= #\() 
                        (setf delimited #\)))
                      (.identity nil)))
               (first (.sexp))
               (dot (.optional (.dot)))
               (rest (if dot 
                         (.sexp) 
                         (.or (.list :delimited #\))
                              (.and (.char= delimited)
                                    (.identity :nil))))))
         (.identity (list* first rest)))))



          


(defun .cons ()
  (.let* ((_ (.char= #\())
          (car (.sexp))
          (_ (.dot))
          (cdr (.sexp))
          (_ (.char= #\))))
    (.identity (cons car cdr))))
                


In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as

  1. an atom, or
  2. an expression of the form (x . y) where x and y are s-expressions.

[…]

Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that

(x y z) stands for

(x . (y . (z . NIL)))

where NIL is the special end-of-list symbol (written ‘() in Scheme).

https://en.wikipedia.org/wiki/S-expression

Source Code

Testing

  ; emacs : ≻ C-x 8 RET 227b RET
            ≕ C-x 8 RET 2255 RET
  ; gnome/X : Ctrl-Shift-u 227b
(let ((defpackage *package*))
  (defmacro ≻ (form &key ((≕≻ provided-result) nil result-provided?)
                    (test ''equal))
      (let* ((form-result (gensym))
             (form-string (with-output-to-string (s) (print form s)))
             (new-form (let ((*package* (find-package (package-name defpackage))))
                         (read-from-string form-string)))
             (result (gensym))
             (values (when (and result-provided? (listp provided-result))
                       (eq (first provided-result) 'cl:values))))
                      
        `(progn 
           (let* ((,form-result (multiple-value-list ,new-form))
                  (,result (if ,result-provided?
                               (funcall ,test ,(if values 
                                                   form-result
                                                   `(first ,form-result))
                                        ,(if values 
                                             `(multiple-value-list ,provided-result)
                                             provided-result)) 
                               t)))
              (assert ,result
                      () "~A~% => ~S ~%  ...should be :~%    ~S"
                      ',form ,form-result ',(if result-provided? 
                                                provided-result
                                                "Something that evaluates to a non-NIL value"))
              (apply #'values ,form-result))))))

SMUG/TUTORIAL

SMUG/TUTORIAL primitives

The PARSER itself

#:.item

The Testing Reader Macro

(defmacro smug> (form &key ((:=> expected-result) nil result-given?))
  (let ((results (gensym))
        (result (gensym)))
    `(let* ((,results (multiple-value-list ,form))
            (,result (first ,results)))
       (multiple-value-prog1 (apply #'values ,results)
         , (when result-given? 
             `(assert (equalp ,result ',expected-result)))))))

(set-macro-character 
 #\≻ 
 (lambda (s c &aux (*readtable* (copy-readtable *readtable*))) 
   (declare (ignore c))
   (set-macro-character #\≻ nil) 
   (let* ((form (read s))
          (char (peek-char t s nil #\null))
          (x (gensym))
          (expected-result (if (char= char #\≕)
                               (progn (read s) (read s) (read s))
                               x)))
                                        
     `(smug> ,form ,@(when (not (eq x expected-result))
                           `(:=> ,expected-result))))))

The tutorial.lisp file

(defpackage :smug/tutorial
  (:use :cl)
  (:export 
   #:≻
   #:.identity
   #:.fail
   #:.item
   #:.bind
   
   #:input-empty-p
   #:input-first
   #:input-rest
   #:run
   #:parse

   #:.plus
   #:.or
   #:.not
   #:.let*
   #:.map
   #:.concatenate
   #:.is
   #:.is-not
   #:.char=
   #:.char-equal
   #:.string-equal
   #:.string=
   #:.progn
   #:.prog1
   #:.prog2
   #:.and
   #:.or
   #:.not
   #:.first
   #:.optional
   #:.read-line
   ))
(in-package :smug/tutorial) 

<<tutorial_.letstar>>

<<tutorial_run>>

<<tutorial_.fail>>

<<tutorial_.plus>>

<<tutorial_.identity>>

<<tutorial_.bind>>

<<tutorial_.or>>

<<tutorial_.not>>

<<tutorial_reading-input>>

<<tutorial_.item>>

<<tutorial-source>>

<<tutorial_.list-of>>

<<tutorial_.satisfies>>

<<tutorial_.optional>>

<<tutorial_.and>>

<<tutorial_.progn>>

<<tutorial_.is-not>>

<<tutorial_.is>>

<<tutorial-.mapcar>>

<<tutorial-.mapc>>

<<tutorial-.make-list>>

<<tutorial-.concatenate>>

<<tutorial_.map>>

<<tutorial_char=digit-char|lower-case-p>>

<<tutorial-line>>
 
<<tutorial_.coerce>>

<<tutorial_.string=>>

<<tutorial_.char-equal>>

<<tutorial_.string-equal>>

<<tutorial_.first>>

<<tutorial-test-macro>>
 

[1] Monadic parser combinators (pdf, ps, bibtex) Graham Hutton and Erik Meijer. Technical Report NOTTCS-TR-96-4, Department of Computer Science, University of Nottingham, 1996.

http://www.cs.nott.ac.uk/~gmh/bib.html#monparsing

[2] http://www.willamette.edu/~fruehr/haskell/seuss.html

3 like, say, scheme