This tutorial started as a translation of Monadic Parser Combinators [1], or at least the first half, into common lisp.
The example code in the following document is completely
self-contained, and does not require an installation of the SMUG
library.
No prior experience with functional programming, monads or recursive descent parsing is assumed. The only requirements are a common lisp environment, and a burning desire to find a better way to parse.
This tutorial, like this library, is based on an approach to building parsers using higher-order functions (combinators) that is popular in the functional programming community. Incidentally, these parsers form an instance of something called a monad, which is itself a useful construct with implications beyond parsing.
With great debt to Monadic Parser Combinators [1], the paper from which this library is derived, this tutorial presents a step by step introduction to the topics of parser combinators and monads and their use in common lisp.
In some cases, the natural name for a parser conflicts with a name in
the COMMON-LISP
package. Rather then shadow the symbols, I’ve
chosen to prefix all parser names with a #. character. It is thought that
this aids usability, as one can simply (:use :smug/tutorial)
. It also
helps to distinguish parser returning functions from other
functions.
A Parser for Things is a function from Strings to Lists of Pairs of Things and Strings!
– Fritz Ruehr, Willamette University [2]
A parser is something that is familiar to all programmers… a function that, given a series of tokens as input, produces a data structure that relates to the grammatical structure of the input in some way. Or, to put it simply, a function from strings to things.
;; our fictional parser matches the string "string"
;; and returns a SYMBOL thing
≻ (parse-thing "string") ≕≻ THING
In order to combine simple parsers into larger more complex ones,
they need a way to communicate between them. First, because any
given parser might consume only a part of the input, we’ll have our
parser return a CONS
with the result in the CAR
and the remaining
input in the CDR
.
≻ (parse-thing "string string") ≕≻ (THING . " string")
Because a parser may return multiple results when the
grammar is ambiguous, or may return no results all, we’ll put our
conses in a list, and have the empty list, NIL
, denote a failed
parse.
≻ (parse-thing "string string") ≕≻ ((THING . " string"))
≻ (parse-thing "strong string") ≕≻ NIL
So, for our purposes, a parser is just a FUNCTION
that takes a
single value as the input and returns a LIST
of CONS
’s of results
and unconsumed input.
It is this trivial protocol that allows us to combine small simple parsers into larger more useful ones.
Smug parsers allow infinite look-ahead and backtracking. To support
parsing many different things, it’s useful to define an input
protocol. Smug parsers only require three operations on input :
INPUT-FIRST
, INPUT-REST
and INPUT-EMPTY-P
.
We’ll define them in terms of strings. It serves our purposes and makes for a nice visual presentation.
(defgeneric input-empty-p (input)
(:method ((input string)) (zerop (length input))))
(defgeneric input-first (input)
(:method ((input string)) (aref input 0)))
(defgeneric input-rest (input)
(:method ((input string))
(multiple-value-bind (string displacement)
(array-displacement input)
(make-array (1- (length input))
:displaced-to (or string input)
:displaced-index-offset (1+ displacement)
:element-type (array-element-type input)))))
≻ (input-empty-p "") ≕≻ t
≻ (input-empty-p "foo") ≕≻
≻ (input-first "foo") ≕≻ #\f
≻ (input-rest "foo") ≕≻ "oo"
There are 3 simple primitive parsers. It it only necessary to
understand them, and one sequencing combinator, .BIND
, to understand
all of SMUG/TUTORIAL
.
The first parser is .IDENTITY
, which always succeeds by returning the
value passed to it, and does not consume any input. Because we’ve
earlier defined parsers as functions that take a single argument
we’ll make a curry[5] with the input parameter.
(defun .identity (value)
(lambda (input)
(list (cons value input))))
(test> (funcall (.identity :foo) "bar baz")
=> ((:foo . "bar baz")))
The second parser, .FAIL
, is the inverse behaviour of
.IDENTITY
. It simply fails regardless of the input. we could
define .FAIL
as a function that takes a single argument, but then
we’d have to access it using FUNCTION
(#’), and aesthetically that
inconsistency is undesirable, so we’ll again make curry with the
input parameter.
(defun .fail ()
(lambda (input) (declare (ignore input)) nil))
(test> (funcall (.fail) "foo") => NIL)
The last true primitive is .~.ITEM~, which is a parser that consumes the first token in the input, or fails in the input is empty.
(defun .item ()
(lambda (input)
(unless (input-empty-p input)
(list (cons (input-first input)
(input-rest input))))))
(test> (funcall (.item) "foo")
=> ((#\f . "oo")))
(test> (funcall (.item) "")
=> NIL)
All the primitives return a FUNCTION
that must be FUNCALL
‘ed with INPUT
in
order to run the parser.
There are many reasons to define a RUN
function. We can CL:TRACE
it, or change the input parameter TYPE
, or change what the actual
primitives return.
(defun run (parser input)
(funcall parser input))
The RUN
function returns the entire parse tree. Most of the time we
simply want the CAR
of the FIRST
result. The CDR
is the leftover
input, and the REST
of the result alternative outcomes. We might
want these as well, so we return that as VALUES
.
(defun parse (parser input)
(let ((result (run parser input)))
(when result
(destructuring-bind ((result . input) &rest rest)
result
(apply #'values result input rest)))))
Now that we have our primitive parsers, we need a way to combine
them. We’d like to be able to apply parsers in sequence, and it
would also come in handy if we could give names to the intermediate
results of parsers. Both these requirements are fulfilled by using
the monadic sequencing operator, .BIND
.
.BIND
is a function that takes as arguments a parser P, and a
function F which take a value and returns a parser P2. .BIND
returns
a parser that first applies P to the input, returning a list of
(VALUE . INPUT)
pairs. The the function F
is applied to each VALUE
,
and the result P2
then applied to the INPUT
. The collected lists of
pairs returned from the P2
’s are then concatenated and the result
returned.
(defun .bind (parser function)
(lambda (input)
(loop :for (value . input) :in (run parser input)
:append (run (funcall function value) input))))
(let ((char-token
(.bind (.item)
(lambda (char)
(.identity (list :char char))))))
(run char-token "foo"))
;; ~> (((:CHAR #\f) . "oo"))
Because .BIND
itself returns a parser, the result of a .BIND
can be
returned as P2. This allows parsers to be chained, and allows us to
use LAMBDA
to provide names for the values of parser results. For
example, the following parser uses .BIND
to return the first two
characters as a cons.
(let ((two-chars
(.bind (.item)
(lambda (char)
(.bind (.item)
(lambda (char2)
(.identity (cons char char2))))))))
(run two-chars "asd"))
;;=> (((#\a . #\s) . "d"))
The next section gets into some details about why our parser is a monad. You don’t really need to know this, so feel free to skip it if you’re in a hurry.
By virtue of having the functions .BIND
and .IDENTITY
defined as they
are, our parser interface forms a monad. A monad is, essentially,
a category of things that provide the functions .BIND
and .IDENTITY
.
Of course, just having functions called .BIND
and .IDENTITY
does not a
monad make. There are other contracts that .BIND
(also known as
pipe, >>~, *, or let) or .IDENTITY
(aka result, lift, unit, return) must
fulfill.
In order to be properly categorized as a monad, the thing
providing a definition for .BIND
and .IDENTITY
must obey three laws
(a static functional programmer would say ‘must have a certain
type’, but the word type means something different to a dynamic
functional programmer, so we’ll avoid it here)
In order to describe those laws we need to define a few terms
- Monadic Value (MV)
- a function that, given a value, returns a
value in the form expected by the internals of .BIND
. In our
examples above, a parser (taking an input and returning a
list of results) is the Monadic Value.
- Monadic Function (MF)
- A function that, given a value returns
a monadic value encapsulating that value. .IDENTITY
is the
canonical Monadic Function
In Object-Oriented terms, the MF is a constructor, and the MV an object.
The laws which all things must obey in order to be called a monad are simple :
- “Left identity”
- (bind (result x) MF) = (funcall MF x)
- “Right identity”
- (bind MV result) = MV
- “Associativity”
- (bind (bind MV MF) MF2)
= (bind MV (lambda (x) (bind (MF x) MF2)))
With static type systems, the compiler will enforce this contract
for you. In a dynamic system, we just need to be a little more
careful. Proving the monad laws for our .BIND
and .IDENTITY
is
left as an exercise.
That’s really all there is to monads except for syntax, which we’ll get to later. There are extended laws that other monads obey, and monads have other uses beyond parsing, but we’re reaching the end of our scope already.
Often, we only want to consume input if a certain
condition is true. This where .SATISFIES
comes in.
(defun .satisfies (predicate &rest args)
(.bind (.item)
(lambda (x)
(if (apply predicate x args)
(.identity x)
(.fail)))))
(run (.satisfies #'digit-char-p) "1 and")
;;=> ((#\1 . " and"))
If .ITEM
fails, so will the .SATISFIES
parser. This is because (bind
(fail) MF) will always fail. .FAIL
, also known as .ZERO
, is a function
belonging to a category of monads knows as “monads with a
zero”. That’s not terribly important for parsing, but interesting if
you’re into that sort of thing.
Imagine we need to parse all characters that come before a #\;
. The
simple way is to have a function that uses CL:NOT
.
(.satisfies
(lambda (item)
(not (char= #\; item))))
It turns out that (.satisfies (lambda (i) (not ...)))
is quite
common, so we define a parser that has a shorter and more relevant
name.
(defun .is-not (predicate &rest args)
(.satisfies (lambda (i)
(cl:not (apply predicate i args)))))
This makes things a lot shorter and easier to read.
(test>
(run (.is-not #'char= #\;) "foobar;%^*&")
=> ((#\f . "oobar;%^*&")))
(test>
(run (.is-not #'char= #\;) ";%^*&")
=> NIL)
For that matter, now that we have .IS-NOT
, .SATISFIES
is a bit
long, and does not prefix -NOT
. So we type a few keys in
order to save a bundle in the future.
(defun .is (predicate &rest args)
(apply #'.satisfies predicate args))
.SATISFIES
allows us to DEFUN
some simple parsers
(defun .char= (x)
(.is #'cl:char= x))
(defun .digit-char-p ()
(.is #'cl:digit-char-p))
(defun .lower-case-p ()
(.is #'cl:lower-case-p))
(defun .upper-case-p ()
(.is #'cl:upper-case-p))
(run (.char= #\x) "xyzzy") ;=> ((#\x . "yzzy"))
(run (.digit-char-p) "1234") ;=> ((#\1 . "234"))
(run (.lower-case-p) "abcd") ;=> ((#\a . "bcd"))
(run (.upper-case-p) "Abcd") ;=> ((#\A . "bcd"))
(run (.upper-case-p) "doh!") ;=> NIL
If we want to combine our earlier parsers, say to create an
ALPHANUMERIC-CHAR
from UPPER-CASE-P
and LOWER-CASE-P
we need
a combinator capable of making the choice between them.
In some cases, it may not be an exclusive choice. There might be multiple ways to parse a string, or a later pass might resolve the ambiguity.
For example, in one of our earlier examples of .BIND
, we saw a
parser that returned the first two characters in a stream. This
parser will fail if there is only one character left in the input.
(let ((two-chars
(.bind (.item)
(lambda (char)
(.bind (.item)
(lambda (char2)
(.identity (cons char char2))))))))
(funcall two-chars "a"))
;;=> NIL
If we want to parse one or two characters, or an arbitrarily long series of characters, we need some a way to express that.
Enter the .PLUS
combinator.
(defun .plus (first-parser second-parser)
(lambda (input)
(append (funcall first-parser input) (funcall second-parser input))))
(let ((two-chars
(.bind (.item)
(lambda (char)
(.bind (.item)
(lambda (char2)
(.identity (cons char char2))))))))
(funcall (.plus two-chars (.item)) "a")
;;=> ((#\a . ""))
(funcall (.plus two-chars (.item)) "asd")
;;=> (((#\a . #\s) . "d") (#\a . "sd"))
)
Note that the second parse returned two pairs, as both parsers were successful. The string parsed as both two chars and a single item.
The examples used in the original paper[1] are for letters and alphanumeric characters. There’s no good reason to use them over /(.is #’alpha-char-p)/and the like, but they do serve as simple example.
(defun letter () (plus (lower-case-char) (upper-case-char)))
(funcall (letter) "foo") => ((#\f . "oo"))
(funcall (letter) "1foo") => NIL
(defun alphanumeric () (plus (letter) (.digit-char)))
(funcall (alphanumeric) "1foo") => ((#\1 . "foo"))
(funcall (alphanumeric) "!1foo") => NIL
The other example is more illustrative, a parser that returns a series of letters or the empty string.
(defun word ()
(let ((non-empty-letters
(bind (letter)
(lambda (first-letter)
(bind (word)
(lambda (rest-of-letters)
(result (format nil "~A~A"
first-letter
rest-of-letters))))))))
(plus non-empty-letters (result ""))))
(funcall (word) "asd")
=>
(("asd" . "") ("as" . "d") ("a" . "sd") ("" . "asd"))
This is our first recursive parser, but it’s a common idiom. Notice that it returns all the possible strings of letters.
This is obviously inefficient when one only requires the first value.
required, a deterministic combinator .OR
, will be introduced later
in the tutorial.
.FIRST
is the real choice when it comes down to it, as .PLUS
really does matter.
(defun .first (parser)
(lambda (input)
(let ((results (run parser input)))
(when results (list (cl:first results))))))
If you read the earlier section on monads, you’d know that .BIND
and .IDENTITY
are the interface to many different types of monads,
of which our parser is but one example. If you didn’t, you know
now. Again, if you’re not at all interested and really just want to
keep on parsing, skip down to the macro.
The most basic monad is the identity monad. A definition of its
.BIND
and .IDENTITY
might look like the following.
(defun i-bind (mv mf) (funcall mf mv))
(defun i-result (value) value)
In Lisp, the identity monad is so trivial as to be useless. In a functional programming language, or any language where the order of operations is not guaranteed, the identity monad serves to sequence operations.
Imagine a silly lisp where the order of evaluation isn’t defined as strict left to right[3]. The following form could have disastrous consequences.
(progn (remove-gun-from-pants)
(point-gun-at-bad-guy)
(pull-trigger))
The identity monad makes the sequencing explicit. In a purely functional lisp, one might sequence the operations as follows.
(i-bind (remove-gun-from-pants)
(lambda (gun)
(i-bind (point-gun-at-bad-guy gun)
(lambda (pointed-gun)
(i-bind (pull-trigger pointed-gun)
(lambda (fired-gun)
(i-result fired-gun)))))))
In functional programming languages this pattern is so common that there is special syntax for it. The usual choices are ‘do notation’ or ‘list comprehension syntax’.
First, the previous example rendered in list comprehension notation :
[fgun | gun <- removeGun
, pgun <- pointGunAtBadGuy gun
, fgun <- pullTrigger pgun]
And in do notation :
do
gun <- removeGun
pgun <- pointGunAtBadGuy
fgun <- pullTrigger pgun
return fgun
The astute lisper might notice that do notation looks a lot like LET*. In fact, that’s really all it is. LET* is lisp syntax for the identity monad, and our i-bind using forms above are directly translatable.
(let* ((gun (remove-gun-from-pants))
(pointed-gun (point-gun-at-bad-guy gun))
(fired-gun (pull-trigger pointed-gun)))
(identity fired-gun))
One could legitimately say that the common lisp package is an instance of the identity monad, if one cared for such insights.
A LET*
like construct is the obvious notation for a lisper to take
advantage of the monadic nature of parsers. It’s often useful to
ignore a value. In haskell, the underscore character is used to
denote an ignorable variable, so we’ll use the same convention.
(defmacro .let* (bindings &body body)
(if bindings
(let ((symbol (first (first bindings))))
`(.bind ,@(cdr (first bindings))
(lambda (,symbol)
,@(when (or (string-equal (symbol-name symbol) "_")
(null (symbol-package symbol)))
`((declare (ignorable ,symbol))))
(.let* ,(cdr bindings)
,@body))))
`(progn ,@body)))
(funcall (.let* ((a (.identity 1)))
(.identity a)) "")
If we replace .BIND
with our I-BIND
function above, we get a macro
that is equivalent to LET*
. .LET*
binds the results of parsers,
and is a much nicer way to work over nesting .BIND
’s.
Using .LET*
, we can implement the macros .PROGN
(which is similar
.AND
because it will fail when the parser does), .PROG1
(which
comes in handy for matching things and the end of the line, or
when there is no more input) and .PROG2
, which as we will see is
also quite useful.
(defmacro .progn (&rest parsers)
(if (rest parsers)
(let ((name (gensym)))
`(.let* ((,name ,(first parsers)))
(.progn ,@(rest parsers))))
(first parsers)))
(defmacro .prog1 (parser &rest parsers)
(let ((name (gensym))
(ignore (gensym)))
`(.let* ((,name ,parser)
(,ignore (.progn ,@parsers)))
(.identity ,name))))
(defmacro .prog2 (parser1 parser2 &rest parsers)
(let ((name (gensym))
(ignore (gensym)))
`(.let* ((,ignore ,parser1)
(,name ,parser2)
(,ignore (.progn ,@parsers)))
(.identity ,name))))
(defun .string= (string)
(if (string= string "")
(.identity nil)
(.let*
((_ (.is 'char= (aref string 0)))
(_ (.string= (subseq string 1))))
(.identity string))))
(run (.string= "asdf") "asdfjkl") => (("asdf" . "jkl"))
(run (.string= "asdf") "asd") => NIL
Once can see how much nicer .LET*
notation is, and also how the
ignorable _
comes in handy.
Using recursion like we did in our WORD parser, we’ll create a parser that matches a specific string.
(defun .string= (string)
(if (string= string "")
(.identity nil)
(.let*
((_ (.is 'char= (aref string 0)))
(_ (.string= (subseq string 1))))
(.identity string))))
(run (.string= "asdf") "asdfjkl") => (("asdf" . "jkl"))
(run (.string= "asdf") "asd") => NIL
Once can see how much nicer .LET*
notation is, and also how the
ignorable _
comes in handy.
Earlier, we defined a parser, .WORD
, using .BIND
and a recursive
call. Lets define a similar parser using .LET*
that returns a list
of letters.
(defun .letters ()
(.plus (.let* ((x (.letter))
(xs (.letters)))
(.identity (cons x xs)))
(.identity nil)))
This pattern can easily be abstracted into a more general
combinator, .ZERO-OR-MORE
(defun .zero-or-more (parser)
(.plus (.let* ((x parser)
(xs (.zero-or-more parser)))
(.identity (cons x xs)))
(.identity ())))
(test>
(run (.zero-or-more (.char= #\a)) "aaaab"
=> (((#\a #\a #\a #\a) . "b") ((#\a #\a #\a) . "ab") ((#\a #\a) . "aab")
((#\a) . "aaab") (NIL . "aaaab")))
(test>
(run (.zero-or-more (.char= #\a)) "bbbba")
=>
((NIL . "bbbba")))
Note that zero or more always succeeds. If one needs a parser that matches one or more items and fails otherwise, we can define one in terms of ZERO-OR-MORE, can call it, appropriately enough, ONE-OR-MORE.
(defun .one-or-more (parser)
(.let* ((x parser)
(y (.zero-or-more parser)))
(.identity (cons x y))))
(test> (funcall (.one-or-more (.char= #\a)) "aaaab")
=>
(((#\a #\a #\a #\a) .many "b")))
(test> (funcall (.one-or-more (.char= #\a)) "bbbba")
=> NIL)
We could now define a TWO-OR-MORE
and THREE-OR-MORE
etc., but it
is likely better to define a function to rule them all. It needs a
base to rule from.
Syntax:
.mapc
parser => parser
.mapcar
parser => result-list
Arguments and Values:
- parser
- The parser that is attempted
- result-list
- a list
Description:
The mapping operation involves attempting parser many times. Except
for .mapc
and .mapl
, the result contains the results returned by
the parser.
.MAPCAR
operates on successive results of parser. The iteration
terminates when the parser fails. The value returned by mapcar is a
list of the results of parser
(defun .mapcar (parser)
(.plus (.let* ((x parser)
(xs (.mapcar parser)))
(.identity (cons x xs)))
(.identity ())))
.MAPC
is like .MAPCAR
except that the results of applying function are
not accumulated. The parser argument is returned as a result.
(defun .mapc (parser)
(.plus (.let* ((_ parser)
(_ (.mapc parser)))
(.identity parser))
(.identity parser)))
Examples:
(test> (parse (.prog1 (.mapcar (.item))
(.char= #\!))
"Yay!")
=> (#\Y #\a #\y))
(let ((/parser/ (.item)))
(test> (parse (.let* ((parser (.prog1 (.mapc /parser/)
(.char= #\!)))
(char parser))
(.identity (cons (eq parser /parser/)
char)))
"Holy Guacamole!?")
=>(T . #\?)))
Often, we know what we want, and how many we want, all at once. So,
just like CL:MAKE-LIST
, we take a number.
Syntax:
.make-list
size &key initial-element => list
.make-sequence
result-type size &key initial-element => list
Arguments and Values:
- size
- a non-negative integer.
- initial-element
- a parser. The default is
(.item)
. - list
- a list.
Description:
Returns a list of length given by size, each of the elements of which is a successful run of the initial-element parser.
(defun .make-list (size &key (initial-element (.item)))
(if (zerop size)
(.identity nil)
(.let* ((first initial-element)
(rest (.make-list (1- size)
:initial-element initial-element)))
(.identity (list* first rest)))))
(defun .make-sequence (type length &key (initial-element (.item)))
(.let* ((list (.make-list length :initial-element initial-element)))
(.identity (coerce list type))))
There are often mulitple parsers that are run one after another, and
the results joined together at the end. .CONCATENATE
takes care of
that.
(defun .concatenate (output-type-spec &rest parsers)
(if (not parsers)
(.fail)
(.let* ((first (first parsers))
(rest (if (rest parsers)
(apply
#'.concatenate output-type-spec (rest parsers))
(.identity nil))))
(.identity (cl:concatenate output-type-spec first rest)))))
In the end
The :AT-LEAST
keyword solves the “how many do we want to start
with?”. We have decided on 1
as the default, as most of the time we
do want the parser to succeed, and besides that, .OPTIONAL
is a
better way of saying :at-least 0
.
So, we can specify the .MAP
parser as follows.
(defun .map (result-type parser
&key
(at-least 1))
"=> a ~result-type~ of /parser/ results."
(.let* ((list-1 (.make-list at-least :initial-element parser))
(list-2 (funcall (if result-type #'.mapcar #'.mapc) parser)))
(.identity (when result-type (concatenate result-type list-1 list-2)))))
Let’s make a parser for standard quoted strings. We’ll use the #' character as the quotes, and the #\| character as the escape character, simply to make it easier to embed in our example text in common lisp strings.
(defun .quoted-string (&key (quote #\')
(escape #\|))
(.let* ((_ (.char= quote))
(string
(.map 'string
(.plus (.let* ((_ (.char= escape)))
(.item))
(.is-not 'char= quote))))
(_ (.char= quote)))
(.identity string)))
(The quote char is ’ and the escape char is | . . ”) |
.OR
is a deterministic .PLUS
. It take any number of parsers. The
first parser is run, and if it succeeds, evaluation short circuits
and the result of the parser is returned. Otherwise, the next
parser is run, and so on, until one succeeds or there are no more
parsers.
We can’t use .BIND
or .LET*
for .OR
because it would fail if one of
its parsers fails. As such, .OR
must be a primitive.
(defun .or (parser &rest parsers)
(lambda (input)
(or (funcall parser input)
(when parsers
(funcall (apply #'.or parsers) input)))))
Similarly, .NOT
, which continues parsing only when the parser
fails, is primitive as well.
(defun .not (parser)
(lambda (input)
(let ((result (funcall parser input)))
(if result
nil
(list (cons t input))))))
On the other hand, .AND
can be defined in terms of IF
, and
doesn’t even need to test for failure, as .BIND
handles failure
automatically.
.AND
(known as ‘>>’ in haskell) sequentially composes parsers,
discarding the results of all but the last one, and returning that
result.
(defun .and (p1 &rest ps)
(.let* ((result p1))
(if ps
(apply #'.and ps)
(.identity result))))
Now that we have .NOT
, we can specifically test for failure rather
than abort the parse entirely. since the primitive parser .ITEM
only fails when the input is empty, we can define .NO-MORE-INPUT
by
negating it.
(defun .no-more-input ()
(.not (.item)))
The OPTIONAL combinator, which allows a parser to fail and still
continue, is a natural use of .OR
.
(defun .optional (parser)
(.or parser (.identity nil)))
Finally, using .OR
, .AND
and .NOT
, we can make parser versions of
the lisp conditionals we all know and love.
(defun .if (test-parser then-parser
&optional (else-parser (.fail)))
(let ((no (gensym)))
(.let* ((no? (.or test-parser (.identity no))))
(if (not (eq no? no))
then-parser
else-parser))))
(defun .when (test-parser then-parser)
"we define .when in terms of .IF, but it's really just .AND again"
(.if test-parser then-parser))
(defun .unless (test-parser then-parser)
"defined in term of .when, even though it's just (.AND (.NOT ...))"
(.when (.not test-parser) then-parser))
Lines are quite important for lexing/parsing most text files. For
everything before the last line it is very simple, but though
#\Newline
is a decent break up, it does not always end that way.
(defun .read-line-newline ()
(.let* ((line (.optional (.map 'list (.is-not #'char= #\Newline))))
(newline (.is #'char= #\Newline)))
(.identity (concatenate 'string line (string newline)))))
The issue is that the last “line” has text but does does not end with
a #\Newline
, and the first line does not start with one. Beyond
that, for testing we often want to try with only one “line” and most
likely without the newline at all.
In other words, if we want the entire thing as a list of lines, this
works, as the string ends with a #\Newline
(smug/tutorial:≻
(parse (.map 'list (.read-line-newline))
"foo
bar
")
:=> ;; Which gives us :
("foo
"
"bar
"))
If there is no newline at the end, it does not.
(smug/tutorial:≻ (parse (.map 'list (.read-line-newline))
"foo
bar")
;; Which gives us :
=> ("foo
"))
In the latter case, there is no “bar” at the end, which makes it almost not worth the journey there.
As luck would have it, there is a .NOT
which will help us tie up
this particular line.
(defun .last-line ()
(.prog1 (.map 'string (.is-not #'char= #\Newline))
(.not (.item))))
(test> (parse (.last-line) "bar")
=> "bar")
(test> (parse (.last-line) "bar
")
=> "bar")
For every line save for the last will fail for .LAST-LINE
, but every
line save for an empty one will succeed until the end of that parser,
so using .PLUS
would be a waste of CPU time.
(defun .line ()
(.plus (.read-line-newline)
(.last-line)))
Using .OR
to run the last only when the first fails works great, but that still means but
in the end there is a better option. We combine them both into one
function.
(defun .read-line (&optional
(eof-error-p t)
eof-value)
(.let* ((text (.optional
(.first (.map 'list (.is-not #'char= #\Newline)))))
(newline (.or (.char= #\Newline)
(.and (.not (.item))
(.identity '())))))
(if (or text newline)
(.identity (concatenate 'string text (when newline (string newline))))
(if eof-error-p
(.fail)
(.identity eof-value)))))
(t> (parse (.map 'list (.document-line)) "
<<weave-out-line-test-text>>")
' ("
"
"* Very important
"
"** Less important
"
"*** A detail
"
"And the rest is text
"
"between the headers."))
(parse (.map 'list (.line)) "
<<weave-out-line-test-text>>")
=>
("
"
" * Very important
"
" ** Less important
"
" *** A detail
"
" And the rest is text
"
" between the headers.")
There is really one important detail, and that is the use of .OR
,which could easily be .PLUS
without much waste, rather than having
the TEXT
be .OPTIONAL
or :AT-LEAST 0
.
What we do not want is the parser to always succeed when there is no input, because then the parser succeeds when there is no input, ad infinitum.
This kind of event happens quite often when using .NOT
, so often it
is best to take another approach, yet (.not (.item))
does have its
uses. Thus, caution is in the air.
The #+NAME
, # are case insensitive. We have
~.CHAR~
and .STRING=
already, so .CHAR-EQUAL
and .STRING-EQUAL
are in orderf_chareq.
(defun .char-equal (char)
(.is #'cl:char-equal char))
For our .STRING=
, we simply return the string we passed in. Because
CL:STRING-EQUAL
“ignore[s] differences in case”f_chareq, we
actually need to return that matched string from what we are parsing.
(defun .string-equal (string)
(labels ((%string-equal (string)
(.let* ((first (.char-equal (aref string 0)))
(rest (if (> (length string) 1)
(%string-equal (subseq string 1))
(.identity nil))))
(.identity (cons first rest)))))
(.let* ((list (%string-equal string)))
(.identity (coerce list 'string)))))
(test>
(run (.string-equal "asd") "AsD")
=> (("AsD" . "")))
(define-condition smug-condition (simple-condition)
((input :reader smug-condition-input
:initarg :input))
(:report (lambda (condition stream)
(apply #'format stream
(simple-condition-format-control condition)
(simple-condition-format-arguments condition))
(terpri stream)
(write-string "Input:" stream)
(print (smug-condition-input condition) stream))))
(define-condition smug-error (error smug-condition) ())
(defun .error (datum &rest arguments)
(lambda (input)
(apply #'error 'smug-error
:input input
(if (or (stringp datum)
(functionp datum))
(list :format-control datum
:format-arguments arguments)
arguments))))
We should already by quite familiar with sexps because that is the
syntax SMUG
uses.
In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as
- an atom, or
- an expression of the form (x . y) where x and y are s-expressions. […]
Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that
(x y z) stands for
(x . (y . (z . NIL)))
where NIL is the special end-of-list symbol (written ‘() in Scheme).
+WHITESPACE+
is a list of characters that do not matter to the
.READ
‘er.
(defvar *whitespace* '(#\space #\newline #\tab))
The Parser itself is very simple, but we do use three SMUG
parsers
so it is a good example of use.
(defun .whitespace (&optional result-type)
(.first (.map result-type (.is 'member *whitespace*))))
In particular, the RESULT-TYPE
defaults to NIL
. This means that we
can throw the result away which saves on time and memory.
Also, using .FIRST
prevents a large parse tree which also saves on
such things.
Anything that is not whitespace must be parsed. This is a perfect use
of .PROG2
and .OPTIONAL
.
(defun .read (&optional (parser (.sexp)))
(.prog2 (.optional (.whitespace))
parser
(.optional (.whitespace))))
Using that we can now parse anything that is not in .( )
. Strangely
enough, that is very close to what we want for an ATOM
.
(smug/tutorial:≻ (parse (.read (.map 'string (.is-not 'find ".( )")))
(format nil "~%~T asd()1234"))
:=> (values "asd" "()1234a"))
An atom is anything that is not a list save for the empty list ()
,
which is also know as NIL
.
(defun .nil ()
(.or (.string-equal "NIL")
(.string= "()")))
So to make an atom it is ether that or we simply need to know what is
a member of the constituent characters and make that into a token. We
use .FIRST
again.
(defun .atom ()
(.or (.nil) (.first (.token)))
At this point, anything that is a not a parenthesis or whitespace makes up a token. They are called constituent characters.
(defun .token ()
(.first (.map 'string (.constituent))))
constituent n., adj. 1. a. n. the syntax type of a character that is part of a token – http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_c.htm#constituent
Well, we know what does not constitute a character that is part of a token, so anything that is not a not a part of token is part of a token. The double negative works.
(defvar *non-constituents*
(list* #\( #\) *whitespace*))
(defun .constituent (&optional (non-constituents
*non-constituents*))
(.or (.and (.char= #\\) (.item))
(.is-not 'member non-constituents)))
(defun .read (&optional (parser (.sexp)))
(.prog2 (.optional (.whitespace))
parser
(.optional (.whitespace))))
(defun .sexp ()
(.read (.or (.list) (.atom))))
(defun .constituent ()
"http://www.lispworks.com/documentation/HyperSpec/Body/02_ad.htm"
(let ((non-constituent
(list* #\( #\) (whitespace))))
(.or (.is-not 'member non-constituent)
(.and (.char= #\\) (.item)))))
(defun .list-first ()
(.or (.progn (.read (.read (.char= #\.)))
(.error "List has nothing before after ."))
(.sexp)))
(defun .list-rest ()
(.or (.and (.read (.char= #\.))
(.or (.sexp) (.error "List has nothing after .")))
(.list :start nil :end nil)))
(defun .list (&key
(start (.char= #\())
(endchar #\))
(end (.char= endchar)))
(let ((false (gensym "false")))
(.let* ((_ (or start
(.identity nil)))
(first (.or (.list-first) (.identity false)))
(rest (if (not (eq first false))
(.or (.list-rest)
(.identity false))
(.identity false)))
(_ (if (or (listp rest) (eq false rest))
(.progn (.optional (.whitespace))
(or end (.identity nil)))
(.or (.char= endchar)
(.error "More than one object follows . in list.")))))
(.identity (if (not (eq false first))
(list* first (if (eq false rest) nil rest)))))))
(defun .list (&key
(start (.char= #\())
(end (.char= #\))))
(.or
(.let* ((start (or start
(.identity nil)))
(first (.optional (.sexp)))
(rest (if first
(.optional (.list :start nil :end nil))
(.identity nil)))
(end (.progn (.optional (.whitespace))
(or end (.identity nil)))))
(.identity (if first (list* first rest))))))
(defvar *nil* (gensym))
(defun .nil ()
(.and (.or (.string-equal "NIL")
(.string= "()"))
(.identity *nil*)))
(defun .list (&key delimited &aux (not (gensym)))
(.or (.nil)
(.let* ((_ (if (not delimited)
(prog1 (.char= #\()
(setf delimited #\)))
(.identity nil)))
(first (.sexp))
(dot (.optional (.dot)))
(rest (if dot
(.sexp)
(.or (.list :delimited #\))
(.and (.char= delimited)
(.identity :nil))))))
(.identity (list* first rest)))))
(defun .cons ()
(.let* ((_ (.char= #\())
(car (.sexp))
(_ (.dot))
(cdr (.sexp))
(_ (.char= #\))))
(.identity (cons car cdr))))
(defun .read (&optional (parser (.sexp)))
(.prog2 (.optional (.whitespace))
parser
(.optional (.whitespace))))
(defun .sexp ()
(.read (.or (.list) (.atom))))
(defun .atom ()
(.first (.map 'string (.constituent))))
(defun .constituent ()
"http://www.lispworks.com/documentation/HyperSpec/Body/02_ad.htm"
(let ((non-constituent
(list* #\( #\) (whitespace))))
(.or (.is-not 'member non-constituent)
(.and (.char= #\\) (.item)))))
(defun .error (message &rest args)
(lambda (input)
(declare (ignore input))
(apply #'error message args)))
(defun .list-first ()
(.or (.progn (.read (.read (.char= #\.)))
(.error "List has nothing before after ."))
(.sexp)))
(defun .list-rest ()
(.or (.and (.read (.char= #\.))
(.or (.sexp) (.error "List has nothing after .")))
(.list :start nil :end nil)))
(defun .list (&key
(start (.char= #\())
(endchar #\))
(end (.char= endchar)))
(let ((false (gensym "false")))
(.let* ((_ (or start
(.identity nil)))
(first (.or (.list-first) (.identity false)))
(rest (if (not (eq first false))
(.or (.list-rest)
(.identity false))
(.identity false)))
(_ (if (or (listp rest) (eq false rest))
(.progn (.optional (.whitespace))
(or end (.identity nil)))
(.or (.char= endchar)
(.error "More than one object follows . in list.")))))
(.identity (if (not (eq false first))
(list* first (if (eq false rest) nil rest)))))))
(defun .list (&key
(start (.char= #\())
(end (.char= #\))))
(.or
(.let* ((start (or start
(.identity nil)))
(first (.optional (.sexp)))
(rest (if first
(.optional (.list :start nil :end nil))
(.identity nil)))
(end (.progn (.optional (.whitespace))
(or end (.identity nil)))))
(.identity (if first (list* first rest))))))
(defvar *nil* (gensym))
(defun .nil ()
(.and (.or (.string-equal "NIL")
(.string= "()"))
(.identity *nil*)))
(defun .list (&key delimited &aux (not (gensym)))
(.or (.nil)
(.let* ((_ (if (not delimited)
(prog1 (.char= #\()
(setf delimited #\)))
(.identity nil)))
(first (.sexp))
(dot (.optional (.dot)))
(rest (if dot
(.sexp)
(.or (.list :delimited #\))
(.and (.char= delimited)
(.identity :nil))))))
(.identity (list* first rest)))))
(defun .cons ()
(.let* ((_ (.char= #\())
(car (.sexp))
(_ (.dot))
(cdr (.sexp))
(_ (.char= #\))))
(.identity (cons car cdr))))
In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as
- an atom, or
- an expression of the form (x . y) where x and y are s-expressions.
[…]
Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that
(x y z) stands for
(x . (y . (z . NIL)))
where NIL is the special end-of-list symbol (written ‘() in Scheme).
; emacs : ≻ C-x 8 RET 227b RET
≕ C-x 8 RET 2255 RET
; gnome/X : Ctrl-Shift-u 227b
(let ((defpackage *package*))
(defmacro ≻ (form &key ((≕≻ provided-result) nil result-provided?)
(test ''equal))
(let* ((form-result (gensym))
(form-string (with-output-to-string (s) (print form s)))
(new-form (let ((*package* (find-package (package-name defpackage))))
(read-from-string form-string)))
(result (gensym))
(values (when (and result-provided? (listp provided-result))
(eq (first provided-result) 'cl:values))))
`(progn
(let* ((,form-result (multiple-value-list ,new-form))
(,result (if ,result-provided?
(funcall ,test ,(if values
form-result
`(first ,form-result))
,(if values
`(multiple-value-list ,provided-result)
provided-result))
t)))
(assert ,result
() "~A~% => ~S ~% ...should be :~% ~S"
',form ,form-result ',(if result-provided?
provided-result
"Something that evaluates to a non-NIL value"))
(apply #'values ,form-result))))))
#:.item
(defmacro smug> (form &key ((:=> expected-result) nil result-given?))
(let ((results (gensym))
(result (gensym)))
`(let* ((,results (multiple-value-list ,form))
(,result (first ,results)))
(multiple-value-prog1 (apply #'values ,results)
, (when result-given?
`(assert (equalp ,result ',expected-result)))))))
(set-macro-character
#\≻
(lambda (s c &aux (*readtable* (copy-readtable *readtable*)))
(declare (ignore c))
(set-macro-character #\≻ nil)
(let* ((form (read s))
(char (peek-char t s nil #\null))
(x (gensym))
(expected-result (if (char= char #\≕)
(progn (read s) (read s) (read s))
x)))
`(smug> ,form ,@(when (not (eq x expected-result))
`(:=> ,expected-result))))))
(defpackage :smug/tutorial
(:use :cl)
(:export
#:≻
#:.identity
#:.fail
#:.item
#:.bind
#:input-empty-p
#:input-first
#:input-rest
#:run
#:parse
#:.plus
#:.or
#:.not
#:.let*
#:.map
#:.concatenate
#:.is
#:.is-not
#:.char=
#:.char-equal
#:.string-equal
#:.string=
#:.progn
#:.prog1
#:.prog2
#:.and
#:.or
#:.not
#:.first
#:.optional
#:.read-line
))
(in-package :smug/tutorial)
<<tutorial_.letstar>>
<<tutorial_run>>
<<tutorial_.fail>>
<<tutorial_.plus>>
<<tutorial_.identity>>
<<tutorial_.bind>>
<<tutorial_.or>>
<<tutorial_.not>>
<<tutorial_reading-input>>
<<tutorial_.item>>
<<tutorial-source>>
<<tutorial_.list-of>>
<<tutorial_.satisfies>>
<<tutorial_.optional>>
<<tutorial_.and>>
<<tutorial_.progn>>
<<tutorial_.is-not>>
<<tutorial_.is>>
<<tutorial-.mapcar>>
<<tutorial-.mapc>>
<<tutorial-.make-list>>
<<tutorial-.concatenate>>
<<tutorial_.map>>
<<tutorial_char=digit-char|lower-case-p>>
<<tutorial-line>>
<<tutorial_.coerce>>
<<tutorial_.string=>>
<<tutorial_.char-equal>>
<<tutorial_.string-equal>>
<<tutorial_.first>>
<<tutorial-test-macro>>
[1] Monadic parser combinators (pdf, ps, bibtex) Graham Hutton and Erik Meijer. Technical Report NOTTCS-TR-96-4, Department of Computer Science, University of Nottingham, 1996.
– http://www.cs.nott.ac.uk/~gmh/bib.html#monparsing
[2] http://www.willamette.edu/~fruehr/haskell/seuss.html
3 like, say, scheme