Version 0 of Parser Combinators

Updated 2005-06-05 01:06:02 by NEM

interp alias {} comment {} if 0 comment { Discussions recently on this wiki about parsing techniques [L1 ] reminded me that I've been meaning to have a play with defining a simple parser combinator library in Tcl, based on the techniques used in Haskell [L2 ]. After my previous experiments with Monadic TOOT it should be fairly straight-forward to translate the monadic parser combinator paper for use in Tcl. It also demonstrates a little more concretely why monads are a cool (if slightly bewildering at first) technique. Here, a monad is used to sequence parsers to build larger parsers. A parser in this case is a function from a string to one or more possible parses, represented as a list of parsed info and remaining unparsed string pairs. The result is an elegant way of building a parser using recursive descent. The code isn't going to win any prizes for efficiency, and is heavily recursive in places, so use with care.

At present this code is 8.5 only, but only for dict as far as I can tell. You might be able to get it working using the standalone dict package for 8.4. }

 # Monadic parser combinator library. Written using [TOOT] techniques, and
 # based on the paper "Monadic Parsing in Haskell" Graham Hutton and Erik
 # Meijer, Journal of Functional Programming, 8(4):437--444, July 1998.
 # See http://www.cs.nott.ac.uk/~gmh/bib.html#pearl
 #
 # Copyright (c) 2005 Neil Madden ([email protected])
 package require Tcl 8.5

 # newtype Parser a = Parser (String -> [(a,String)])
 # A Parser is a function from strings to a list of token,string pairs which
 # represent a parse sequence. Each pair consists of a typed item which is
 # the parsed representation, and the remaining unparsed string suffix.
 namespace eval Parser {
    namespace export {[a-z]*}
    # Ensemble-ify
    if {[catch {namespace ensemble create}]} {
         # Pre 8.5
         proc ::Parser {method args} {
             uplevel 1 [linsert $args 0 ::Parser::$method]
         }
    }
    # Simple constructor
    proc create {args} {
         list Parser: [uplevel 1 [linsert $args 0 lambda]]
    }

    # Implement the monad interface, which allows us to sequence parsers
    # together in interesting ways.

    # ret        :: a -> Parser a
    # Injects a value into the Parser monad. Returns a parser which when
    # given a string, simply returns the given token and leaves the string
    # untouched. This is called simply "return" in Haskell, but that name
    # is already taken in Tcl, so we use "ret" instead.
    proc ret a {
         create cs { list $a $cs }
    }

    # >>=        :: Parser a -> (a -> Parser b) -> Parser b
    # Creates a parser which is a combination of two other parsers. The
    # resulting parser feeds the input string into the first parser and then
    # tries each possible parse by feeding the resulting suffix strings into
    # the second parser. This is the fundamental operation of monadic
    # programming (the bind/sequencing op).
    proc >>= {p f} {
         create cs {
             set ret [list]
             foreach {a ccs} [$p parse $cs] {
                 lappend ret [[$f $a] parse $ccs]
             }
             # Flatten the resulting list
             join $ret
         }
    }

    # MonadZero instance
    # No-op parser, which simply fails to parse anything.
    proc zero {} { create cs { list } }

    # MonadPlus instance. This is used to combine the results of two parsers
    # (effectively creating a choice between them). This is done by simply
    # concatenating the result lists of the parsers. For instance, if you
    # had a grammar with a production:
    #  Foo ::= Bar | Jim
    # Then you could code that up as:
    #  set Foo [$Bar ++ $Jim]
    # We use the "++" notation as that is what is used in the paper.
    proc ++ {p q} {
         create cs {
             concat [$p parse $cs] [$q parse $cs]
         }
    }

    # Deterministic version of ++ -- returns only first result
    proc +++ {p q} {
         create cs {
             set ret [[$p ++ $q] parse $cs]
             if {[llength $ret]} {
                 # First result only. In our scheme the result list is flat,
                 # of form {token string token string ...} rather than
                 # Haskell's [(token, string), (token,string), ...]
                 lrange $ret 0 1
             } else {
                 list
             }
         }
    }

    # Just unpack the parser function and apply it to the given input
    # string.
    proc parse {p cs} { 
         [lindex $p 1] $cs 
    }

    # A little syntactic sugar. Does a simple version of Haskell's do
    # notation. Converts a script separated by semi-colons into monadic
    # sequenced form, e.g.:
    #  do { a <- p1; p2; b <- p3; Parser ret [list $a $b] }
    # becomes:
    #  p1 >>= [lambda a { p2 >>= [lambda _ { p3 >>= [lambda b {
    #                Parser ret [list $a $b]
    #  }]}]}]
    # This version is a bit more robust than the version on [Monadic TOOT],
    # but still cannot handle nested do-scripts. Also, the use of
    # semi-colons as a separator char may be a bit subtle given that they
    # are usually optional in Tcl.
    proc do {script} {
         set eval ""
         set num 0
         foreach line [lrange [split $script \;] 0 end-1] {
             set line [string trim $line]
             if {[string length $line]} {
                 if {[regexp {(.*)<-(.*)} $line -> var comp]} {
                     append eval "\n[string repeat { } $num]\[$comp\] >>= \[lambda $var \{"
                 } else {
                     append eval "\n[string repeat { } $num]\[$line\] >>= \[lambda _ \{"
                 }
                 incr num
             }
         }
         append eval \n[lindex [split $script \;] end]
         append eval [string repeat "\n\}\]" $num]
         uplevel 1 $eval
    }

    # Type dispatch function -- part of [TOOT]s magic.
    proc ::Parser: {p args} {
         if {[llength $args]} {
             set method [lindex $args 0]
             uplevel 1 [lreplace $args 0 0 ::Parser::$method [list Parser: $p]]
         } else {
             return [list Parser: $p]
         }
    }
 }

 # A helper method. This is a lexically-scoped lambda construct. Variables
 # to be captured from the lexically-enclosing scope can either be specified
 # explicitly by using the form [lambda params statics body], or if the
 # statics argument is omitted then all local vars from the current scope
 # are captured (actually, snap-shotted) to become part of the (immutable)
 # lexical closure of the lambda. If this is gibberish to you, don't panic!
 # All it means is that code such as:
 #  set a 12
 #  set foo [lambda {} { puts "a = $a" }]; $foo
 # will do the right thing (i.e. print "a = 12"), instead of complaining
 # that a is not a variable.
 proc lambda {params args} {
    if {[llength $args] == 1} {
         set body [lindex $args 0]
         # Get 'em all!
         set statics [uplevel 1 info locals]
    } else {
         foreach {statics body} $args { break }
    }
    set scope {}
    foreach vname $statics {
         upvar 1 $vname var
         dict set scope $vname $var
    }
    list lambda: [list $params $body $scope]
 }
 # Evaluates a lambda term with the supplied arguments. Simple, but
 # should work ok.
 proc lambda: {lambda args} {
    foreach {params body scope} $lambda break
    foreach name $params value $args {
         dict set scope $name $value
    }
    evalInScope $scope $body
 }
 # Safer version of [dict with] which doesn't clobber its parent environment
 proc evalInScope {env body} {
    dict with env $body
 }
 # TOOT's auto-expand magic:
 if {[llength [info commands ::__toot_unknown]] == 0} {
    rename ::unknown ::__toot_unknown
    proc ::unknown {cmd args} {
         if {[llength $cmd] > 1} {
             uplevel 1 $cmd $args
         } else {
             uplevel 1 [linsert $args 0 ::__toot_unknown $cmd]
         }
    }
 }
 # Another little helper -- creates a unified var/command thing.
 proc def {name value} {
    upvar #0 $name var
    set var $value
    interp alias {} $name {} ::DefReadVar $name
 }
 proc DefReadVar {name args} {
    upvar #0 $name var
    if {[llength $args]} {
         uplevel 1 $var $args
    } else {
         return $var
    }
 }

comment { Now we have some basic infrastructure in place, let's start writing some actual parsers. }

 # Simple parser -- consumes first character, if there is one, or fails
 # otherwise.
 # item :: Parser Char
 # item = Parser (\cs -> case cs of
 #                                ""        -> []
 #                                (c:ccs) -> [(c,ccs)])
 def item [Parser create cs {
    if {[string length $cs]} {
         list [string index $cs 0] [string range $cs 1 end]
    } else {
         list
    }
 }]

 # p :: Parser (Char,Char)
 # Takes the 1st and 3rd characters from a string
 def p [item >>= [lambda c {
    item >>= [lambda _ {
    item >>= [lambda d {
         Parser ret [list $c $d]
    }]}]}]]
 # Same, but using do notation. We will use do notation pretty much
 # exclusively from here on, for obvious reasons!
 def p2 [Parser do {
    c <- item;
    item;
    d <- item;
    Parser ret [list $c $d]
 }]

 proc const {a} { lambda b { return $a } }

 # sat         :: (Char -> Bool) -> Parser Char
 # A combinator which takes a predicate and yields a parser that consumes
 # characters only if they satisfy the predicate.
 proc sat p {
    Parser do {
         c <- item;
         if {[$p $c]} {
             Parser ret $c
         } else {
             Parser zero
         }
    }
 }
 # char         :: Char -> Parser Char
 # Returns a parser which matches a single character
 proc char c { sat [lambda x { string equal $x $c }] }

 # String :: String -> Parser String
 # Match a specified string
 proc String s {
    if {[string length $s]} {
         set c  [string index $s 0]
         set cs [string range $s 1 end]
         Parser do {
             char $c;
             String $cs;
             Parser ret $s
         }
    } else {
         Parser ret ""
    }
 }

 # many        :: Parser a -> Parser [a]
 # Kleene-star operator. Applies the given parser 0 or more times.
 # Equivalent to * regexp modifier.
 proc many p {
    [many1 $p] +++ [Parser ret [list]]
 }
 # 1 or more version of above (equivalent to + regexp modifier).
 proc many1 p {
    Parser do {
         a <- $p;
         as <- many $p;
         Parser ret [linsert $as 0 $a]
    }
 }

 # Repeated applications of parser p, separated by applications of parser sep
 # whose result values are thrown away. e.g. sepby [char a] [char ,] will
 # match a sequence of 0 or more "a"s separated by commas, such as "a,a,a".
 # sepby :: Parser a -> Parser b -> Parser [a]
 proc sepby {p sep} {
    [sepby1 $p $sep] +++ [Parser ret [list]]
 }
 proc sepby1 {p sep} {
    # Simple do notation doesn't handle nesting, so we resort to explicit
    # sequencing for the inner "many" loop in here: 
    Parser do {
         a <- $p;
         as <- many [$sep >>= [lambda _ { return $p }]];
         Parser ret [linsert $as 0 $a]
    }
 }

 # chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
 # Parses a sequences of values separated by applications of an operator
 # parser which yields an operation which is used to combine values being
 # parsed. Like a cross between sepby and foldl.
 proc chainl {p op a} {
    [chainl1 $p $op] +++ [Parser ret $a]
 }

 proc chainl1 {p op} {
    Parser do {
         a <- $p;
         rest $a $p $op
    }
 }
 # Helper for chainl1
 proc rest {a p op} {
    [Parser do {
         f <- $op;
         b <- $p;  
         rest [$f $a $b] $p $op
    }] +++ [Parser ret $a]
 }

comment { All seems to be working ok so far. We'll leave out the chainr/chainr1 parser combinators as done in the paper. Shouldn't be too difficult to work out. Now we move on to the section on Lexical combinators which shows how we can avoid the lexing/parsing distinction by defining combinators to do the lexing. }

 # Whitespace
 proc isSpace {char} { regexp {\s} $char }
 def space [many [sat isSpace]]
 # Parse a token and discard trailing space
 proc token p {
     Parser do {
         a <- $p;
         space;
         Parser ret $a
     }
 }
 # Parse a symbolic (string) token:
 proc symb cs { token [String $cs] }
 # Apply a parser, p, discarding any leading space:
 # apply :: Parser a -> String -> [(a,String)]
 proc apply {p cs} {
     [Parser do { space; $p }] parse $cs
 }

comment { The final example of the paper is to implement a simple expression evaluator, which uses the following grammar:

    expr        ::= expr addop term | term
    term        ::= term mulop factor | factor
    factor      ::= digit | ( Expr )
    digit       ::= 0 | 1 | ... | 9
    addop       ::= + | -
    mulop       ::= * | /

We have to define in reverse order to in the paper, so that the correct definitions are set up in the correct order. }

 # We can be a bit more concise than Haskell here, as we don't have to
 # distinguish between "+" the string and "+" the operator, as Everything Is
 # A String!
 def addop [[symb +] ++ [symb -]] ;# addop ::= + | -
 def mulop [[symb *] ++ [symb /]] ;# mulop ::= * | /
 # Likewise, we don't have to convert between char and int here...
 # digit ::= 0 | 1 | ... | 9
 def digit [Parser do {
     x <- token [sat isDigit];
     Parser ret $x
 }]
 # Extend the parser to deal with multiple digit numbers...
 # number ::= digit +
 def number [[many1 digit] >>= [lambda ds { Parser ret [join $ds ""] }]]
 # factor ::= number | ( Expr )
 def factor [number +++ [Parser do {
     symb "(";
     n <- Expr;
     symb ")";
     Parser ret $n
 }]]
 # term ::= term mulop factor | factor
 def term [chainl1 factor mulop]
 # Expr ::= Expr addop term | term
 def Expr [chainl1 term addop]

 # Some helpers:
 foreach op {+ - * /} { proc $op {a b} [format {expr {$a %s $b}} $op] }
 proc isDigit d { regexp {\d} $d }

 # And now a little test:
 puts " 1 - 2 * 3 + 4 = [apply Expr { 1 - 2 * 3 + 4 }]"
 puts "12 * 52 / 64 = [apply Expr {12 * 52 / 64 }]"
 puts "time = [time { apply Expr {12 * 52 / 64 } } 20]"

comment { To me, this is what parsing should be like: elegant and straight-forward (once the infrastructure is in place). It'll take quite a bit of work to get it up to "industrial-strength" (like the Parsec library [L3 ] for Haskell). For instance, it takes 450592 microseconds per iteration for that last test on my iBook 800MHz G4! Most of that is due to the overhead of TOOT which involves lots of extra function calls and unknown-command trickery. It'd be an interesting project to see how far this could be taken from fun demo to a useful level of efficiency. }