Version 23 of grammar_peg

Updated 2011-01-23 11:11:29 by tomas

Actually several packages:

  • grammar::peg - Construction and manipulation of parsing expression grammars
  • grammar::peg::interp - Interpreter for parsing expression grammars.

Documentation:


3-9-8: I hope the following may help others who are trying to use the "grammar::peg" package. These were tested under "etcl" (Evolane Tcl/Tk Engine Version 1.0-rc26 See http://wiki.tcl.tk/15260 ).

1% # For now, just load the 'grammar::peg^ package
2% package require grammar::peg
0.1

3% #initialize a new (abeit empty) grammar.  The start symbol will be, by default, 'epsilon^.
4% grammar::peg myGrammar
::myGrammar

5% # add a nonterminal to it
6% ::myGrammar nonterminal add myDigit {/ {t 0} {t 1} {t 2}}

7% # show its serialization
8% ::myGrammar serialize
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} epsilon

9% # change the start symbol to the nonterminal 'myDigit^
10% ::myGrammar start {/ {n myDigit}}

11% # show its serialization
12% ::myGrammar serialize
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}}

13% # Now we show how you could, instead, have started with a list representing
14% # the grammar. Since "myGrammar" exists, we will make "myGrammarFromList".
15% package require struct::list
1.6.1

16% # (Incidentally, Andreas Kupries states: 'struct::list' is a package in Tcllib, like grammar::peg, and
17% #                  guesses that grammar::peg forgot to 'package require' it.)
18% set mySerialization [list grammar::pegc \
> {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}} \
> ]
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}}

19% grammar::peg myGrammarFromList deserialize $mySerialization
::myGrammarFromList

20% # Next we attempt to use (interpret) the grammar on an input string.
21% #Load all relevant packages (this may be overkill)
22% set pattern {grammar::me|grammar::peg}; foreach packageName [package names] {if [regexp $pattern $packageName] {puts "[catch {package require $packageName}]  $packageName"}}
0  grammar::peg::interp
0  grammar::peg
0  grammar::me::tcl
0  grammar::me::cpu
0  grammar::me::util
0  grammar::me::cpu::core
0  grammar::me::cpu::gasm

23% # Initialize the interpreter.
24% grammar::peg::interp::setup ::myGrammar
25% # As of 3-9-8, that's as far as I have gotten.  The documentation at http://tcllib.sourceforge.net/doc/peg_interp.html
26% # suggests using 
27% #                            ::grammar::peg::interp::parse nextcmd errorvar astvar 
28% #" which "interprets the loaded grammar and tries to match it against the stream of characters represented by the command prefix nextcmd".
29% # so I put the string "102" in "myFile.txt" and tried the following, but, as you can see, was unsuccessful.

30% set f [open c:/myFile.txt r+]
file37bf6c8
31% chan configure $f -encoding cp1252
32% set offset 0
0
33% ::grammar::peg::interp::parse $f myErrorVar myAstVar
invalid command name "file37bf6c8"
34% chan close $f

JBR I've made my own attempt to create a grammar::peg example. Maybe someone who knows can make this example work. Thanks.

 #!/usr/bin/env tclsh8.6
 #

 package require grammar::peg
 package require grammar::peg::interp

 proc parse-string { string } {
    coroutine next-char apply { { string } {
        yield

        set i 1
        foreach ch [split $string {}] {
            yield [list $ch 1 1 $i]

            incr i
        }

        while { 1 } { yield {} } } } $string
 }

 ::grammar::peg parser

 parser nonterminal add Digit { / { t 1 } { t 2 } { t 3 } { t 4 } { t 5 } }
 parser nonterminal add Int   { + { n Digit } }
 parser start { n Int }

 parse-string 54321

 grammar::peg::interp::setup parser
 grammar::peg::interp::parse next-char err ast

When run I get this:

 wrong # args: should be "ict_match_token tok msg"
    while executing
 "ict_match_token "Expected $ch""
    (procedure "MatchExpr" line 40)
    invoked from within
 "MatchExpr $e"
    (procedure "MatchExpr" line 236)
    invoked from within
 "MatchExpr $ru($nt)"
    (procedure "MatchExpr" line 75)
    invoked from within
 "MatchExpr $sub"
    (procedure "MatchExpr" line 159)
    invoked from within
 "MatchExpr $ru($nt)"
    (procedure "MatchExpr" line 75)
    invoked from within
 "MatchExpr $se"
    (procedure "grammar::peg::interp::parse" line 9)
    invoked from within
 "grammar::peg::interp::parse next-char err ast"
    (file "./ISBL" line 30)

This appears to be a bug in grammar::peg::interp I changed line 116 of peg_interp.tcl to:

 ict_match_token $ch "Expected $ch"

Then changing my coroutine to return EOF multiple times after the input is exhausted the string parses returning:

 ALL 0 4 {Int 0 4 {Digit 0 0 {{} 0 0}} {Digit 1 1 {{} 1 1}} {Digit 2 2 {{} 2 2}} {Digit 3 3 {{} 3 3}} {Digit 4 4 {{} 4 4}}}

Now to learn how to do something useful with this. The input tokens don't appear to be represented in the output AST?


Category Package, a part of Tcllib Category Parsing

AK - 2011-01-21 14:40:21

Regarding the representation of input tokens. No, they are not represented directly. The numbers in the AST tell you the character range covered by the symbol, as offsets from the beginning of the string. This allows you to extract the lexemes/token from the input string. See http://docs.activestate.com/activetcl/8.5/tcllib/pt/pt_parser_api.html for an example of the tree and its contents.


tomas 2011-01-23

Here's a more complete example. It tries to implement the "classical" expression grammar, with (integer decimal) numbers, the four arithmetical operations.

To note:

  * PEGs look a lot like BNF. They are not the same (as noted in the comments). See <http://en.wikipedia.org/wiki/Parsing_expression_grammar>
    for a discussion. I still rode on this similarity to make clear in the comments what the PEG statements "mean".
  * I stumbled upon the bug mentioned above and tried to file in the SourceForge bug tracker <https://sourceforge.net/tracker/?func=detail&atid=112883&aid=3163541&group_id=12883>
  * The program tries to visualize what's going on by printing the AST's structure with the (recursive) function print_nested.
  * Grokking whitespace is left as an exercise for the reader :-)

Comments very welcome!

====

 #!/usr/bin/tclsh
 #
 # peg: playing around with peg
 package require grammar::peg
 package require grammar::me::tcl
 package require grammar::peg::interp

 grammar::peg expression

 # PEGs ain't BNF, but the similarities are striking
 # One important difference: the choices {/ ...} are _ordered_
 # choices, i.e. no backtracking. See
 #  <http://en.wikipedia.org/wiki/Parsing_expression_grammar>
 # for te gory details

 # Therefore the comments evoking BNF are just to tickle the
 # reader's gentle associative memory (or was it the associative's
 # reader gentle memory? Ah -- you know whaat I mean.

 # Sign ::= '+' | '-'
 expression nonterminal add Sign {/ {t +} {t -}}
 # Digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
 expression nonterminal add Digit {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}
 # Addop ::= '+' | '-'
 expression nonterminal add Addop {/ {t +} {t -}}
 # Mulop ::= '*' | '/'
 expression nonterminal add Mulop {/ {t *} {t /}}

 # Number ::= Sign? Digit+
 expression nonterminal add Number {x {? {n Sign}} {+ {n Digit}}}
 # Factor ::= '(' Expression ')' | Number
 expression nonterminal add Factor {/ {x {t (} {n Expression} {t )}} {n Number}}
 # Term ::= Factor ( Mulop Factor )*
 expression nonterminal add Term {x {n Factor} {* {x {n Mulop} {n Factor}}}}
 # Expression ::= Term ( Addop Term )*
 expression nonterminal add Expression {x {n Term} {* {x {n Addop} {n Term}}}}
 # Start here:
 expression start {n Expression}

 puts [expression serialize]

 grammar::peg::interp::setup expression

 set i -1
 # The expression to be parsed. No whitespaces!
 set src {1024+256}

 proc nexttok {} {
   global src
   global i
   return [list [string index $src [incr i]] {} $i 0]
 }

 set errs {}
 set ast {}

 grammar::peg::interp::parse nexttok errs ast

 puts "errs = $errs"
 puts "ast = $ast"

 proc print_nested {ast {prefix ""}} {
   global src
   puts "$prefix [lindex $ast 0]: [string range $src [lindex $ast 1] [lindex $ast 2]]"
   foreach sub [lrange $ast 3 end] {
     print_nested $sub "$prefix  "
   }
 }

 print_nested $ast