Syntax parsing in Tcl

Richard Suchenwirth 2001-05-28 - The following will look immediately familiar to friends of formal syntax:

 set rules {
    s  {np vp}
    np {det n}
    np {npr}
    np {det adj n}
    vp {vi}
    vp {vt np}
    det "the"
    n   "dog"
    n   "man"
    npr "John"
    adj "lazy"
    adj "old"
    vi  "walks"
    vt  "feeds"
 }

(VL 11 jun - yes for the ps-freaks ps as in phrase structure rules not postscript... ;-) )

... a small grammar of English, consisting of substitution/expansion rules, and a minimal "dictionary". To apply this syntax to test sentences, the following minimal parser (which takes the term "substitution rules" literally with regsub) is sufficient:

 proc parse0 {s rules} {
    set fired 1; set s " $s "
    while {$fired} {
        set fired 0
        foreach {lhs rhs} $rules {
            incr fired [regsub -all " $rhs " $s " $lhs " s]
        }
    }
    set s
 }

It tries to repeatedly apply all rules until none fires any more, and returns the remaining list. If that equals s, the input was a well-formed sentence (according to the syntax).

Lars H (11 Jun 2003): How about using [string map] instead of [regsub]? That is much faster for fixed string replacements. It requires some preprocessing of the rules, though. - RS: wrote this years ago when I could not rely on string map being available on all my boxes - and speed is not the worry of this little plaything ;-)

More elaborated, we want to see what tree structure is parsed out of an input. This requires slightly more work:

 proc skim L {
    # returns all first elements of a list of lists
    set res [list]
    foreach i $L {lappend res [lindex $i 0]}
    set res
 }
 proc lmatch {list sublist} {
    # returns start and end positions of first occurrence of sublist in
    # list, or an empty list if not found
    set lsub [llength $sublist]
    for {set i 0} {$i<=[llength $list]-$lsub} {incr i} {
        set to [expr {$i+$lsub-1}]
        if {[lrange $list $i $to]==$sublist} {
            return [list $i $to]
        }
    }
    return {}
 }
 ######## enhanced parser, returns parse tree list
 proc parse1 {s rules} {
    set fired 1
    while {$fired} {
        set fired 0
        foreach {lhs rhs} $rules {
            if [llength [set t [lmatch [skim $s] $rhs]]] {
                foreach {from to} $t break
                set s [lreplace $s $from $to \
                        [eval list $lhs [lrange $s $from $to]]]
                incr fired
            }
        }
    }
    set s
 }

Of course, these micro-parsers have severe limitations: they can't detect ambiguities (like in He saw the man with the binoculars; but you also cannot express that "walks" can be intransitive (vi) or transitive (vt, "John walks the dog"), and the order of rules matters. But they are a nice starting point for syntax experiments (maybe one other weekend, the parse trees need some visualisation on a canvas...) - May 28,2001: Done - see Simple tree layout RS


AK -- With respect to working with trees I recommend to use the tree structure in tcllib. Working with that should be easier than direct manipulation of nested lists. It especially makes additional treewalks for modification of a syntax tree much easier. For visualization see VCG and other layouting tools (most are not Tk based).

Additional ideas: Recognition of sentences with a certain structure inside of gobs of text and highlighting them (text widget), or processing them in some way beyond simple recognition. For example numbers (i.e. two million four hundred twenty three).

RS -- See also English number reader which does the processing of such strings, but without full parsing... and, specialized, Expression parsing.

AK -- Another option is Yeti

jt -- Yet another option is taccle. You'll probably want a good scanner (insert cheap plug for fickle here).

Daniele Alberto GALLIANO I have two interesting small programs about this concepts, that are, more or less, porting of etok from C/C++ to Tcl. I added several rules in the grammar. Since sources are a bit longer than usual, I please You to refer to Universal Translator.


Gerald Lester (June 12, 2003) - Since english words can be multiple parts of speach (such as the word sail), evaluating the "sentence" with each possible meaning of the word and rating all of the resultant "parses" against the current context, then choosing the "best" fit to execute has been shown to work. As you narrow the context, you get better and better fits.


Category Human Language Arts and crafts of Tcl-Tk programming Category Parsing Category String Processing