Version 2 of Cloverfield - Parser

Updated 2008-02-08 17:02:48 by andy

AMG, 7 Feb 2008: Here's my attempt at a Cloverfield parser, written in Tcl:

proc parse {text {index_var ""} {nesting none}} {
    set out_scr {}      ;# List of commands.
    set out_cmd {}      ;# List of words.
    set out_word {}     ;# List of syllables.
    set out_mods {}     ;# List of modifiers on the word.
    set out_syl ""      ;# Literal character string.
    set state space     ;# Current state.
    set quote none      ;# Word quoting style.
    set begin 1         ;# Beginning of the command, before the first word?
    set backslash 0     ;# Inside a backslash sequence?
    set bsseq ""        ;# Characters in the backslash sequence.
    set braces 0        ;# Number of nested brace pairs.
    set brquote 0       ;# Inside double quotes inside brace pairs?
    set brcomment 0     ;# Inside a comment inside brace pairs?
    set brspace 0       ;# Inside whitespace inside brace pairs?
    set space {" " \f \n \r \t \v}  ;# Whitespace.
    set wordsep [list {*}$space \;] ;# Word delimiters.
    set cmdsep {; \n #}             ;# Command delimiters.
    set octal {0 1 2 3 4 5 6 7}     ;# Octal digits.

    # Use the caller's index variable, if provided, or start from the beginning.
    if {$index_var ne ""} {
        upvar 1 $index_var index
    } else {
        set index 0
    }

    if {$nesting eq "command"} {
        lappend wordsep ]
        lappend cmdsep ]
    } elseif {$nesting eq "variable"} {
        # TODO: some stuff
    }

    if {$::verbose} {puts "========== begin =========="}

    # Process each character of the input.
    for {} {$index < [string length $text]} {incr index} {
        set char [string index $text $index]

        # Everybody loves state machines.
        switch -- $state {
        space {
            # Currently in whitespace.  Alternately, the previous character was
            # a closing brace, double quote, or parenthesis, in which case this
            # character *should* be whitespace, except in the case of word
            # modifiers.
            set nonspace 0
            if {$backslash} {
                # Inside a backslash sequence.  Handle this character specially.
                if {$char ne "\n"} {
                    # Anything other than backslashed newline is part of a word.
                    set nonspace 1
                }
                set backslash 0
            } elseif {$char in $cmdsep} {
                # Unbackslashed command delimiter.  Note: the beginning of a
                # line comment is the end of a command.
                if {$char eq "#"} {
                    # If this is the beginning of a line comment, ignore
                    # everything through the end of the comment.
                    set state comment
                }
                if {!$begin} {
                    # Append the previous word to the output command.
                    lappend out_cmd $out_word
                    set out_word {}
                }
                if {[llength $out_cmd] != 0} {
                    # Append the command to the script, and start working on
                    # the next command.
                    lappend out_scr $out_cmd
                    set out_cmd {}
                    set begin 1
                }
                if {$char eq "]"} {
                    # A close bracket signifies the end of a nested script.
                    break
                }
            } elseif {$char eq "\\"} {
                # Unbackslashed backslash.  This may or may not be the start of
                # a word, depending on the next character.  Handle the next
                # character specially.
                set backslash 1
            } elseif {$char ni $space} {
                # Unbackslashed double quotes.
                set nonspace 1
            }
            if {$nonspace} {
                # This character is not whitespace.  It is the start of a word.
                if {$quote ne "none"} {
                    # The previous character was a closing quote of some kind.
                    switch -- $quote {
                    quote {error "extra characters after close-quote"}
                    paren {error "extra characters after close-parenthesis"}
                    brace {
                        error "word modifiers not implemented"
                        error "bad word modifier or extra characters\
                               after close-brace"
                    }}
                }
                if {!$begin} {
                    # Append the previous word to the output command.
                    lappend out_cmd $out_word
                    set out_word {}
                }
                if {$backslash} {
                    # Anything else backslashed is part of a word.  The word
                    # will not be quoted.  Reprocess both the backslash and the
                    # current character.
                    set state word
                    set quote none
                    incr index -2
                } else {
                    # The first character decides the quoting mechanism.
                    # Reprocess the first character of a non-quoted word.
                    set state word
                    switch -- $char {
                    \"      {set quote quote}
                     (      {set quote paren; error "parens not implemented"}
                    \{      {set quote brace; incr braces; set brspace 1}
                    default {set qoute none; incr index -1}
                    }
                }
            } elseif {!$backslash} {
                # Forget the quoting of the previous word after seeing at least
                # one whitespace character.  (Note: backslash may or may not be
                # whitespace, depending on the character that follows it.)
                set quote none
            }
        } word {
            # Currently in a word.
            set begin 0
            if {$backslash} {
                # Inside a backslash sequence.
                set first [string index $bsseq 0]
                if {$quote eq "brace"} {
                    # Simply pass this character through unmolested and
                    # uninterpreted if inside a brace-quoted word.
                    append out_syl $char
                    set backslash 0
                } elseif {$bsseq eq ""} {
                    # Handle the first character of a backslash sequence.
                    if {$char in {a b f n r t v}} {
                        # Handle single-character backslash sequences.
                        append out_syl [subst \\$char]
                        set backslash 0
                    } elseif {$char in [list {*}$octal x u \n]} {
                        # Handle multi-character backslash sequences.
                        append bsseq $char
                    } else {
                        # Handle ordinary backslash quoting.
                        append out_syl $char
                        set backslash 0
                    }
                } elseif {($first in {x u} && ![string is xdigit $char])
                       || ($first in $octal && $char ni $octal)} {
                    # If this character is one past the end of a valid
                    # multi-character backslash sequence, substitute in the
                    # backslash sequence and reprocess this character.
                    append out_syl [subst \\$bsseq]
                    set bsseq ""
                    set backslash 0
                    incr index -1
                } elseif {$first eq "\n"} {
                    # Ignore whitespace following a backslashed newline.
                    if {$char ni {" " \t}} {
                        # If this character is past the end of a sequence of
                        # quoted newline-whitespace, replace the sequence with
                        # a single space and reprocess this character.
                        append out_syl " "
                        set bsseq ""
                        set backslash 0
                        incr index -1
                    }
                } else {
                    # Append this character to the backslash sequence.
                    append bsseq $char
                }
            } elseif {$quote ne "brace" && $char eq "\$"} {
                # Unbackslashed dollar sign.  This is the start of a variable
                # substitution or reference.
                error "variable substitution and reference not implemented"
                if {$out_syl ne ""} {
                    lappend out_word literal $out_syl
                    set out_syl ""
                }
                incr index
                lappend out_word [parse $text index variable]
            } elseif {$quote ne "brace" && $char eq "\["} {
                # Unbackslashed opening bracket.  This is the start of a
                # command substitution.
                if {$out_syl ne ""} {
                    lappend out_word literal $out_syl
                    set out_syl ""
                }
                incr index
                lappend out_word cmdsub [parse $text index command]
            } elseif {$quote eq "none" && $char in $wordsep} {
                # Unbackslashed word ending character.
                if {$out_syl ne ""} {
                    lappend out_word literal $out_syl
                    set out_syl ""
                }
                set state space
                if {$char in $cmdsep} {
                    # Reprocess command delimiters.
                    incr index -1
                }
            } elseif {$quote eq "quote" && $char eq "\""} {
                # Unbackslashed closing double quote.  This is (or should be)
                # the last character of the word.
                if {$out_syl ne ""} {
                    lappend out_word literal $out_syl
                    set out_syl ""
                }
                set state space
            } elseif {$quote eq "paren" && $char eq "("} {
                # Unbackslashed opening parenthesis.  This is the start of a
                # parentheses-quoted word.
            } elseif {$quote eq "paren" && $char eq ")"} {
                # Unbackslashed closing parenthesis.
            } elseif {$quote eq "brace" && $char eq "\{"} {
                # Unbackslashed opening brace inside a braced word.  Increment
                # the brace count if not also inside a comment or pair of
                # unbackslashed double quotes.
                if {!$brquote && !$brcomment} {
                    incr braces
                }
                append out_syl $char
            } elseif {$quote eq "brace" && $char eq "\}"} {
                # Unbackslashed closing brace inside a braced word.  Decrement
                # the brace count if not also inside a comment or pair of
                # unbackslashed double quotes.  If this is the final closing
                # brace, the word is done.  Or this could just be a word
                # modifier preceding the real word.  The space state code
                # handles this possibility.
                if {!$brquote && !$brcomment} {
                    incr braces -1
                }
                if {$braces == 0} {
                    if {$out_syl ne ""} {
                        lappend out_word literal $out_syl
                        set out_syl ""
                    }
                    set state space
                } else {
                    append out_syl $char
                }
            } elseif {$quote eq "brace" && $char eq "\""} {
                # Unbackslashed double quotes inside a braced word.  Enable
                # brace quoting if this character is the start of the subword
                # or disable if not.
                set brquote $brspace
                set brspace 0
                append out_syl $char
            } elseif {$quote eq "brace" && $brspace && $char eq "#"} {
                # Unbackslashed quote character inside braces at the start of a
                # subword.  Begin comment.
                set brcomment 1
                set brspace 0
                append out_syl $char
            } elseif {$quote eq "brace" && !$brquote && $char in $wordsep} {
                # Unbackslashed, uncommented space or command separator
                # character.
                if {$brcomment && $char eq "\n"} {
                    # If this is a newline inside a comment, terminate the
                    # comment.
                    set brcomment 0
                    set brspace 1
                } elseif {!$brquote} {
                    # If this is a word separator outside of double quotes,
                    # begin whitespace.
                    set brspace 1
                }
                append out_syl $char
            } elseif {$char eq "\\"} {
                # Unbackslashed backslash.  Enter backslash quoting mode.  For
                # the purposes of brace counting, don't consider this character
                # to be either space or nonspace because the next character may
                # or may not be newline.
                set backslash 1
                if {$quote eq "brace"} {
                    append out_syl $char
                }
            } else {
                # Any other character.  Append to the output syllable.
                set brspace 0
                append out_syl $char
            }
        } comment {
            # Currently in a line comment.
            if {$backslash} {
                # Inside a backslash sequence.  Unconditionally ignore this
                # character.
                set backslash 0
            } elseif {$char eq "\n"} {
                # Unbackslashed newline.  The comment ends here.
                set state space
            } elseif {$char eq "\\"} {
                # Unbackslashed backslash.  Ignore the next character.
                set backslash 1
            }
        }}

        dump_state
    }

    if {$state eq "word" && $quote ne "none"} {
        # It's not legal for a script to end inside a quoted word.
        switch -- $quote {
        quote {error "missing close-quote"}
        paren {error "missing close-parenthesis"}
        brace {error "missing close-brace"}
        }
    }
    if {$backslash} {
        # Handle backslash sequences at end of script.
        if {$bsseq eq ""} {
            # Treat backslash at end of script as a literal backslash.
            append out_syl \\
        } elseif {[string index $bsseq 0] ne "\n"} {
            # Attempt to substitute in the backslash sequence.
            append out_syl [subst \\$bsseq]
        }
        set begin 0
    }
    if {$out_syl ne ""} {
        # Append final syllable.
        lappend out_word literal $out_syl
        set out_syl {}
    }
    if {!$begin} {
        # Append final word.
        lappend out_cmd $out_word
        set out_word {}
    }
    if {[llength $out_cmd] != 0} {
        # Append final command.
        lappend out_scr $out_cmd
        set out_cmd {}
    }

    dump_state
    if {$::verbose} {puts "========== end =========="}

    return $out_scr
}

proc regurgitate {script} {
    set result ""
    foreach cmd $script {
        set line ""
        foreach word $cmd {
            set text ""
            foreach {type syl} $word {
                if {$type eq "literal"} {
                    append text $syl
                } elseif {$type eq "cmdsub"} {
                    append text \[[regurgitate $syl]]
                } else {
                    error "bad type \"$type\""
                }
            }
            lappend line $text
        }
        lappend result $line
    }
    return [join $result \n]
}

proc dump_state {} {
    if {$::verbose} {
        uplevel {
            if {![info exists old]} {
                set old {}
            }

            foreach var [lsort [info vars]] {
                if {$var in {old var val}} {
                    continue
                }
                set val [set $var]
                if {![dict exists $old $var] || [dict get $old $var] != $val} {
                    puts "$var = [list $val]"
                    dict set old $var $val
                }
            }
            puts [string repeat - 20]
        }
    }
}

set verbose 0

puts [regurgitate [parse {
    set x 42 # This is a comment!
    set y [expr {$x + 1}]0 # Whee, a multiline\
                             comment.
    puts "(\$x + 1) * 10\
          = \$y"
}]]

puts [regurgitate [parse\
"proc foo {} {puts foo\\\}
# comment }
puts \"quotes }\"
puts b#\"ar}"]]

Script overview

[parse] takes one argument, the Cloverfield script it is to parse. It returns the parse tree. See below for more.

The [regurgitate] proc tries to reproduce the original script. It doesn't try very hard, though; it incorrectly braces what should be command substitution. Oh well; I'll fix it after I have implemented variable substitution.

[dump_state] prints all variables that have changed since the previous time it was called. It's used to trace through [parse]'s execution. Set verbose to true to see it in action. Be prepared for a lot of output.


Test cases

This script includes two test cases:

    set x 42 # This is a comment!
    set y [expr {$x + 1}]0 # Whee, a multiline\
                             comment.
    puts "(\$x + 1) * 10\
          = \$y"

and

proc foo {} {puts foo\}
# comment }
puts "quotes }"
puts b#"ar}

The first case tests basic functionality, including line comments that can begin anywhere a word can. (For the sake of comparison, Tcl comments can only begin where a command can.) The second case tests brace counting. See that the closing braces at the end of the first three lines aren't counted since they're backslashed, commented, and quoted, respectively. The closing brace on the last line is counted because it's not commented and not quoted--- neither # nor " have any special meaning when they don't appear at a place where a word can begin.

The output for the first test case is:

set x 42
set y {[expr {$x + 1}]0}
puts {($x + 1) * 10 = $y}

and for the second test case:

proc foo {} puts\ foo\\\}\n#\ comment\ \}\nputs\ \"quotes\ \}\"\nputs\ b#\"ar

As you can see, the first output is pretty much a normalized form of the input, except for the erroneous bracing of words containing command substitution. Sorry. The second output is the input all jammed on one line. It's on one line because there's only one command, on account of the way matching braces are counted.


Implemented features

  • Unquoted words: foo
  • Quoted words: "foo"
  • Braced words: {foo}
  • Backslash quoting: \{foo
  • Backslash substitutions: \n
  • Line comments: # foo
  • Line continuations: "foo\, # foo\
  • Brace counting: {foo {bar}}
  • Multiple commands: foo; bar
  • Command substitution: [foo; bar]
  • Concatenation: [foo]bar

Unimplemented features

  • Parenthesized words: (foo)
  • Variable substitution: $foo
  • Word comments: {#}foo
  • Raw data: {data}foo
  • Word modifiers: {meta foo}bar

Parse tree design

A Cloverfield script (parse tree) is a list of commands. A command is a list of words. A word is a list of syllables and modifiers. A syllable is a substitution type specifier and a string, a script (parse tree), or a variable name. A variable name is a name word and a list of indexes. An index is a type and one or two index value words.

The substitution type specifiers (i.e., the possible syllable types) are literal, cmdsub, varsub, varref, and space. The word modifiers are null, expand, flatten, metasub, metadef, delay, refer, and list. The index types are vector, range, and key.

The string value of a word is the concatenation of its syllables after substitution. Some word modifiers influence or disable substitution.

An unquoted word has one or more syllables. Each command or variable substitution is a syllable, and each sequence of literal characters or backslash substitutions is a literal syllable. Backslash substitutions are done immediately by the parser.

A quoted word is like an unquoted word, except that it is possible to have zero syllables in the case of empty string.

A braced word is either zero syllables (empty string) or one literal syllable.

A parenthesized word is like an unquoted word, except that sequences of unquoted whitespace and nested parenthesis characters are stored in space syllables. A parenthesized word has a list modifier. When substitution is performed on a list word, consecutive non-space syllables are quoted (if necessary) to preserve word boundaries.


Parse tree representation

[parse] returns a list of commands. Each command is a list of words. (Word modifiers aren't supported yet.) Each word is a list of alternating syllable-type, syllable-value pairs. The supported types are literal and cmdsub. A literal syllable's value is an ordinary string, and a cmdsub syllable's value is a full parse tree generated by recursively invoking [parse].


Discussion