ptparser

SS 23Mar2004 - ptparser is a Tcl parser in pure Tcl. It comes from the Sugar macro system, but can be useful alone. The following is the code and a very simple example that shows how to use it.

 # ptparser - A Tcl parser in pure Tcl.
 # Copyright (C) 2004 Salvatore Sanfilippo
 # Under the same license as Tcl version 8.4

 # The parser. It does not discard info about space and comments.
 # The return value is the "type" of the token (EOF EOL SPACE TOKEN).
 #
 # It may be interesting to note that this is half of a simple
 # Tcl interpreter.
 #
 # The fact that it is still so simple, compared to what it can
 # be in Python just to say one (much worst in Perl), it's an advice
 # that to add syntax to Tcl is a bad idea.
 namespace eval ptparser {}

 # Initialize the state of the interpreter.
 # Currently this parser is mostly stateless, it only needs
 # to save the type of the last returned token to know
 # if something starting with '#' is a comment or not.
 proc ptparser::parserInitState statevar {
     upvar $statevar state
     set state [list EOL]
 }

 proc ptparser::parser {text tokenvar indexvar statevar {dosubst 0}} {
     upvar $tokenvar token $indexvar i $statevar state
     set token {}
     set inside {}
     set dontstop $dosubst
     while 1 {
        # skip spaces
        while {!$dontstop && [string match "\[ \t\]" [string index $text $i]]} {
            append token [string index $text $i]
            incr i
        }
        # skip comments
        if {$state eq {EOL} && !$dontstop && [string equal [string index $text $i] #]} {
            while {[string length [string index $text $i]] &&
                  ![string match [string index $text $i] \n]} \
            {
                append token [string index $text $i]
                incr i
            }
        }
        # return a SPACE token if needed
        if {[string length $token]} {return [set state SPACE]}
        # check for special conditions
        if {!$dontstop} {
            switch -exact -- [string index $text $i] {
                {} {return [set state EOF]}
                {;} -
                "\n" {
                    append token [string index $text $i]
                    incr i
                    return [set state EOL]
                }
            }
        }
        # main parser loop
        while 1 {
            switch -exact -- [string index $text $i] {
                {} break
                { } -
                "\t" -
                "\n" -
                ";" {
                    if {!$dontstop} {
                        break;
                    }
                }
                \\ {
                    incr i
                    append token \\ [string index $text $i]
                    incr i
                    continue
                }
                \" {
                    if {[string equal $inside {}]} {
                        incr dontstop
                        set inside \"
                        append token \"
                        incr i
                        continue
                    } elseif {[string equal $inside \"]} {
                        incr dontstop -1
                        set inside {}
                        append token \"
                        incr i
                        continue
                    }
                }
                "\{" {
                    if {[string equal $inside {}]} {
                        incr dontstop
                        set inside "\{"
                        append token "\{"
                        incr i
                        continue
                    } elseif {[string equal $inside "\{"]} {
                        incr dontstop
                    }
                }
                "\}" {
                    if {[string equal $inside "\{"]} {
                        incr dontstop -1
                        if {$dontstop == 0} {
                            set inside {}
                            append token "\}"
                            incr i
                            continue
                        }
                    }
                }
                \$ {
                    if {![string equal $inside "\{"]} {
                        if {![string equal [string index $text [expr {$i+1}]] $]} {
                            set res [ptparser::substVar $text i]
                            append token "$$res"
                            continue
                        }
                    }
                }
                \[ {
                    if {![string equal $inside "\{"]} {
                        set res [ptparser::substCmd $text i]
                        append token "\[$res\]"
                        continue
                    }
                }
            }
            append token [string index $text $i]
            incr i
        }
        return [set state TOK]
     }
 }

 # Actually does not really substitute commands. You can
 # perform a real substitution if you like.
 proc ptparser::substCmd {text indexvar} {
     upvar $indexvar i
     set go 1
     set cmd {}
     incr i
     while {$go} {
        switch -exact -- [string index $text $i] {
            {} break
            \[ {incr go}
            \] {incr go -1}
        }
        append cmd [string index $text $i]
        incr i
     }
     string range $cmd 0 end-1
 }

 # Get the control when a '$' (not followed by $) is encountered,
 # extract the name of the variable, and return it.
 proc ptparser::substVar {text indexvar} {
     upvar $indexvar i
     set dontstop 0
     set varname {}
     incr i
     while {1} {
        switch -exact -- [string index $text $i] {
            \[ -
            \] -
            "\t" -
            "\n" -
            "\"" -
            \; -
            \{ -
            \} -
            \$ -
            ( -
            ) -
            { } -
            "\\" -
            {} {
                if {!$dontstop} {
                    break
                }
            }
            ( {incr dontstop}
            ) {incr dontstop -1}
            default {
                append varname [string index $text $i]
            }
        }
        incr i
     }
     return $varname
 }

 #################### TEST CODE #######################

 ptparser::parserInitState state
 set script [info body ptparser::parserInitState]
 set index 0
 while 1 {
     set type [ptparser::parser $script token index state]
     puts "$type \"$token\""
     if {$type eq {EOF}} break
 }

The output generated by the TEST CODE above can be useful to understand what's the parser output before to download/try it.

 EOL "
 "
 SPACE "    "
 TOK "upvar"
 SPACE " "
 TOK "$statevar"
 SPACE " "
 TOK "state"
 EOL "
 "
 SPACE "    "
 TOK "set"
 SPACE " "
 TOK "state"
 SPACE " "
 TOK "[list EOL]"
 EOL "
 "
 EOF ""

escargo 22 Mar 2004 - Looking at this, I wonder a couple of things. First, being lazy, I wonder how a quotation mark is represented in the token stream. Second, might { and } be better delimiters than quotation marks? That way, if you wrote EOF, SPACE, TOK, and EOF commands, you could use the output of the parser as input directly without having to worry about evaluation of the associated value strings. - RS: The quotes in the demo output are produced by the demo code above, not the parser himself. For brace quoting (where needed), just change the puts line to

 puts [list $type $token]

Lars H: As the author of parsetcl, I thought I should point that out as an alternative. Also I wonder what the proper terminology is for these things are. ptparser merely seems to generate a stream of tokens. parsetcl returns a tree. I think there are different names for these two functions.

For parsing Tcl code in general it isn't sufficient to handle just the basic Tcl syntax rules -- one must also be able to parse strings-as-lists (which is very similar to parsing scripts, but not identical) and expr expressions (which is very different). But perhaps you have solutions to that in other components of your Sugar, SS.

SS: Yes I've to deal with expr, so Sugar have a special parser for it, but it's very simple because all I need to do is to identify command substitutions inside expr's expressions, I don't need to really parse it. About lists, because the parser is in Tcl, and Tcl can parse Tcl list, why to add this inside the parser? A final bit, to turn etparser in a parser that returns a tree is very simple, but I think that's also useless, because you don't know what's a script and what's a string in Tcl just locking at the source (and should not be assumed from the name of the command *inside* the parser), my feeling is that a decent solution is that the parser's user will call the parser recursively against arguments that are belived to be scripts. That's how Sugar works, and why it is able to expand commands at any deepth inside a Tcl script.

Lars H: I was aiming at cross-referencing code rather than transforming it, which leads to slightly different needs -- in particular making it necessary to keep track of character positions relative to the original chunk of text. The built-in list parser doesn't keep track of that, and some Tcl commands have arguments that are lists. In particular (one form of) switch has an argument that is a list whose elements may be scripts ... But you have probably noticed that already. About trees: There are many other things than strings that are scripts which give rise to a tree structure. Consider

  list [set a] "xxx $Arr(yyy,$z,[www $u "x \n"]) sss!"

Quite a lot of the height of trees come from the ways in which substitutions can be nested, and that is part of the core Tcl syntax.

SS: Yes, I agree, the transformation have different needs. Actually you are right about switch: I'm using Tcl list commands to access the second form that can be used with this command, and it's the only sugar macro that breaks the indentation of the original code, for all the rest being etpaser able to return all the blank spaces, it's possible to take the indentation unaltered. In fact, I may rewrite the switch macro in order to use the parser itself to get every element of the list: for the first level of the list this should work anyway. And again from the different needs of the transformation there is the *need* to get every token that is a single argument as a single entity, even if contains any kind of command/variable expansion. that's why in my case to get "xxx $Arr(yyy,$z,[www $u "x \n"]) sss!" as a single argument is actually a goal: a trivial macro called [first] will receive one argument in all the cases, and will just use it to expand to [lindex <this-argument> 0].

Btw, the parser above *is* actually able to parse all this, because it was used for a different goal previously. As you can see the substCmd and substVar procedures are there for this and in this version of the parser are just returing what they parse for concatenation with the rest of the token. They will not stay longer without to be used because I'll probably add 'varmacro' macro type to sugar that will be called for every single variable expansion in the code, as plain arguments, inside "", interpolated together other commands/vars, and so on.


SS: Final note: expr syntax is not part of the Tcl syntax. It is important to support it because core commands implicitly call expr, and expr itself is the way to do math in Tcl (waiting for TIP174...), but should not be *formally* considered part of the Tcl syntax.


Category Package